Troubleshooting damaged systems is essential skill of every SysAdmin, SRE or DevOps engineer. Everyone of us runs into OS related issues from time to time and it's better to be prepared when things go terribly wrong. It's especially beneficial to be able to identify and act on the issue quickly to prevent any significant damage. To help with that in this article - we will go over a few common problems that you might encounter as well as ways to gather information, troubleshoot and solve these issues.
Note: This article uses RHEL 8/CentOS. But, examples/concepts below can be applied to any Linux distribution.
Recovering root
Password
What if you lose root
password and you don't have access to any privileged user? If you still have access to the machine, then there is way to solve this inconvenient situation:
First, start by rebooting the machine. When machine starts, hit any key to access boot menu:
In boot menu hit e
to edit boot options. Using arrows move to the line starting with linux
and append rd.break
- this breaks boot process early on. Optionally you can also append enforcing=0
, to pause SELinux enforcing. Next, hit CTRL+X
to let machine boot.
After few seconds of booting, you should get shell. At this point you have access to system in read-only mode. So, in order to change anything in the system - like root
password - we need to make filesystem read-write. We can do that by running mount -o remount,rw /sysroot
. Next thing we need to do is enter root jail using chroot /sysroot
- this changes root of filesystem to be /sysroot
instead of /
. This is required, so that any further commands we run will be in regards to /sysroot
directory. Now we can change root
password using passwd
.
If you added enforcing=0
to boot options, you can now hit CTRL+D
(or type exit
) and let system fully boot. If not, run touch /.autorelabel
to trigger SELinux system relabel. This is needed because changing password results in /etc/password
having incorrect SELinux security context. Therefore we need to relabel whole filesystem during next boot (this can take some time, depending on size of filesystem).
As an alternative solution, you could also access Linux debug-shell
. This can be done - again - by accessing GRUB during boot and appending systemd.debug-shell
instead of rd.break
. When you let system boot with this option, you will end up in normal shell session, which isn't really helpful. If you however try to access terminal 9 using CTRL+ALT+F9
, you will open debug-shell
with full root
permissions. Here, you can change password normally. At this point you can switch back to normal shell (CTRL+ALT+F1
) and login. You shouldn't forget to stop the debug-shell
though, as it is huge vulnerability to the system. You can do that by running systemctl stop debug-shell.service
(you can still switch back to debug-shell
but it will be unresponsive - killed-off).
Fixing Unmountable Filesystems
Creating new partitions, creating filesystems, mounting filesystem, etc. are common tasks for most SysAdmins. But, even though these are basic tasks it's easy to make mistake which may render your system unbootable. Let's see how you can solve problems related unmountable filesystems:
As with previous solutions, we start by rebooting the machine, accessing the boot menu and editing it, this time appending systemd.unit=emergency.target
. This tells your system to boot into emergency target instead of default one (multi-user or graphical). When system boots and we get shell, we login as root and we again remount the filesystem using mount -o remount,rw /
. Now we can try mounting all filesystems by running mount -a
. If there is a problem with mounting some specific filesystem, you might see error message like mount: /wrong-mount: mount point does not exist.
or mount /wrong-mount: special device /dev/sdb1 does not exist.
. These kind of an issues need to be fixed inside /etc/fstab
:
/dev/sda1 /mnt xfs defaults 0 0
/dev/sdb1 /wrong-mount xfs defaults 0 0 <- fix, comment out or remove this line
After fixing the issue in /etc/fstab
run systemctl daemon-reload
, so that systemd
picks up the changes. Now, run mount -a
again. If the issue was indeed fixed, you should see no error (no news, is good news). You can now exit using CTRL+D
and let system boot normally.
Aside from mistyped device or mount point name, you might also encounter issues with VDO (Virtual Data Optimizer) or Stratis, which require extra mount arguments e.g. x-systemd.requires=vdo.service
or x-systemd.requires=stratisd.service
, without which system won't boot properly. Another common and easily fixable mistake might be missing quote when using UUID="...
to specify device (use /etc/fstab
syntax highlighting, it can save you lot problems).
Troubleshooting SELinux Problems
This one is not a life and death kind of a situation, but it can cause a lot problems, so it's beneficial to be able to identify it quickly when it happens.
It's important to realise that most of the time, SELinux is doing it's job correctly. But it might just happen that you are trying to achieve something SELinux doesn't expect. Some of the problem you might encounter may include issues with incorrect file context, for example after moving file from one place to another. Sometimes the issue might be with overly restrictive policies (SELinux booleans) or blocked service ports.
One can troubleshoot all of these problems by first temporarily changing SELinux to non-enforcing mode using setenforce 0
and retrying the action that wasn't working previously. If the problem got fixed by switching SELinux to non-enforcing mode, then we know that the problem was caused by some SELinux violation.
Now, if we turn SELinux back on using setenforce 1
, we can try to analyze and fix the violation:
First install setroubleshoot-server
using yum -y install setroubleshoot-server
. This troubleshooting server will listen to /var/log/audit/audit.log
and send summary messages to /var/log/messages
. Next, to analyze these messages run grep sealert /var/log/messages
which should give you messages like this:
...
Apr 6 08:03:46 server1 setroubleshoot[24548]: SELinux is preventing /usr/sbin/httpd from name_bind access on the tcp_socket port 8012. For complete SELinux messages run: sealert -l d2c359a9-e9c5-48cb-af8b-ba73d982f21b
As an example here, I configured httpd
to run on port 8012
which is blocked because of SELinux service's allowed ports. If we were not aware of this, then it would be quite hard to find root cause of this issue. The output above can help with that. We can see description of the SELinux violation as well as command that can help us troubleshoot further, so let's try it out:
[root@server1 ~]# sealert -l d2c359a9-e9c5-48cb-af8b-ba73d982f21b
SELinux is preventing /usr/sbin/httpd from name_bind access on the tcp_socket port 8012.
***** Plugin bind_ports (92.2 confidence) suggests ************************
If you want to allow /usr/sbin/httpd to bind to network port 8012
Then you need to modify the port type.
Do
# semanage port -a -t PORT_TYPE -p tcp 8012
where PORT_TYPE is one of the following: http_cache_port_t, http_port_t, jboss_management_port_t, jboss_messaging_port_t, ntop_port_t, puppet_port_t.
... omitted
Additional Information:
Source Context system_u:system_r:httpd_t:s0
Target Context system_u:object_r:unreserved_port_t:s0
Target Objects port 8012 [ tcp_socket ]
Source httpd
Source Path /usr/sbin/httpd
Port 8012
Host ...
Source RPM Packages httpd-2.4.37-10.module+el8+2764+7127e69e.x86_64
Target RPM Packages
Policy RPM selinux-policy-3.14.1-61.el8.noarch
Selinux Enabled True
Policy Type targeted
Enforcing Mode Enforcing
... omitted
This produces full report of what caused the violation. Including suggested (not necessarily the most appropriate) fix. If you have some experience with SELinux you might realise that most appropriate way to fix this issue is to add the relevant port to SELinux service (http_port_t
). This can be done by running semanage port -a -t http_port_t -p tcp 8012
.
This pattern of replicating the violation, looking for sealert
messages in var/log/messages
and viewing the report and analyzing the report can be applied to any SELinux violation/problem, not just the one example above.
Alternatively you can also search directly in /var/log/audit/audit.log
using ausearch
. Specific command you would want to run: ausearch -m AVC -ts recent
. This shows all recent denials. Output should look something like this (same information, little less user friendly):
time-<Mon Apr 6 08:10:56 2020
type=PROCTITLE msg=audit(1586160656.860:235): proctitle=2F7573722F7362696E2F6874747064002D44464F524547524F554E44
type=SYSCALL msg=audit(1586160656.860:235): arch=c000003e syscall=49 success=no exit=-13 a0=3 a1=562ebb162280 a2=10 a3=7ffcd89eb21c items=0 ppid=1 pid=24612 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="httpd" exe="/usr/sbin/httpd" subj=system_u:system_r:httpd_t:s0 key=(null)
type=AVC msg=audit(1586160656.860:235): avc: denied { name_bind } for pid=24612 comm="httpd" src=8012 scontext=system_u:system_r:httpd_t:s0 tcontext=system_u:object_r:unreserved_port_t:s0 tclass=tcp_socket permissive=0
Getting Logs From Crashing System
By default, logs stored in /run/log/journal
are not persisted across system reboots. That might become a problem if you need to debug logs on crashing system. To preserve journal logs we need to modify /etc/systemd/journald.conf
. More specifically the Storage
parameter:
# /etc/systemd/journald.conf
[Journal]
#Storage=auto # <- This one
#Compress=yes
#Seal=yes
#SplitMode=uid
...
By uncommenting and changing Storage
to persistent
, we tell systemd
to store all logs in var/log/journal
. Aside from this change we also need to run systemctl reload systemd-journald
to make sure that the change takes effect.
Even though this change will persist logs on your system, it won't keep all of them forever. By default, journald
is configured to not exceed 10% of filesystem or leave system with less then 15% of free space.
Now, to actually inspect the previously stored logs. First switch to root
user. Run journalctl --list-boots
. This will give you list like this:
-3 620d709db36440dc9a0b165f85f9f346 Sun 2020-03-08 18:58:41 CET—Fri 2020-04-03 08:29:25 CEST
-2 89a7ba7725ab4ac5bc91517d29fbc31e Fri 2020-04-03 08:33:19 CEST—Fri 2020-04-03 08:35:02 CEST
-1 e93ec544729f4008ac126b80b4b0c207 Fri 2020-04-03 08:35:53 CEST—Sun 2020-04-05 15:21:16 CEST
0 58a205834d2d4d7b9b228d8a126b9e81 Sun 2020-04-05 15:21:40 CEST—Fri 2020-04-10 14:57:49 CEST
Based on the dates and times, choose from which boot you want to see logs. For example, to view logs from boot with id -2
with log level err
or higher:
~ $ journalctl -b -2 -p err
-- Logs begin at Sun 2020-03-08 18:58:41 CET, end at Fri 2020-04-10 14:59:59 CEST. --
apr 03 08:33:21 my-pc bluetoothd[1187]: ...
apr 03 08:33:25 my-pc bluetoothd[1187]: ...
apr 03 08:33:25 my-pc kernel: ...
apr 03 08:33:36 my-pc wpa_supplicant[1198]: ...
apr 03 08:33:36 my-pc wpa_supplicant[1198]: ...
...
If the logs above are not enough to troubleshoot your issues, then there are other log files to check:
/var/log/messages
- Most of the syslog messages should be here./var/log/boot.log
- System startup messages (Not related to syslog)
Alternatively, if you can't boot your machine normally, then you can access emergency.target
as shown above a view logs there the same way.
Conclusion
There is a lot more that can go wrong with Linux machine than what I shown in sections above. These examples/approaches though, can be applied to variety of other problems that you might encounter. Also, not all of them are life-and-death kind of a situation, but it's always preferable to be able solve them rather quickly, especially if this problematic machine is a production system. Solving most of the issues depends of getting the right information and being able to restore previous configurations, therefore it's crucial to always store logs and to backup critical files on your system before modifying them.