Linux is legendary for its stability - once set up correctly, a Linux box, left to its own devices, will run trouble-free for a very long time. Most problems arise soon after installation or major configuration changes, and are the result of misconfiguration, typographical errors or the occasional hardware failure.
However, from time to time accidents do happen, even in the best-regulated environments . . .
A Linux Troubleshooting Toolkit
The best way to minimise the impact of those unforeseeable events is to prepate for them, by assembling the recovery tools in advance
Tom's Root Boot Disk
An essential part of every Linux professional's bag of tricks, this tiny (by today's standards) package unpacks to create a 1.722 MB floppy disk that is a complete Linux distribution with a selection of recovery tools - until you see how it's done you'll find it hard to believe a single floppy can contain so much!
An alternative version comes in El Torito (bootable CD-ROM) format . You can download tomsrtbt from http://www.toms.net/rb/
Knoppix
This is a popular Linux distribution, based on Debian, which boots and runs entirely from CD-ROM. While it is popular for demonstrations, or for letting interested users get a taste of Linux without having to install a distribution on the hard drive, it is also incredibly useful as a system repair tool. You can download Knoppix fromhttp://www.knopper.net/knoppix/index-en.html (read the notes on software patents, then click on the KNOPPIX link - it's still there).
mkbootdisk
Most Linux distributions have a command to build a bootable floppy disk which can be used to repair a system. Red Hat Linux, for example, has the mkbootdiskcommand. In order to use this, you only need to know the desired kernel version to write to floppy, and you can find the current kernel version with the uname -rcommand:
Other Boot Disks
Most Linux distributions allow you to boot from the first installation CD in a system repair or 'rescue' mode. For Red Hat, for example, using the first CD-ROM to boot with the command 'linux rescue' will boot the system and then attempt a number of basic repairs automatically. The repair script will attempt to identify all the Linux partitions on your hard drives and mount them in the correct location. At the end of this process, you should wind up with the system completely assembled and mounted under /mnt/sysimage.
Red Hat Linux Professional boxed sets of recent vintage also include a rather neat credit-card-sized rescue CD, and similar CD's are sometimes available from Linux-related company stands at trade shows.
Problems:
Can't Boot?
Watch the system closely as it boots, and take note of any error messages that appear. If the system complains that it is unable to mount the root filesystem, for example, this can be for any of several reasons:
- The BIOS cannot find the boot loader. This sometimes happens after you've installed Linux to dual-boot with Windows, but - out of concern to not misconfigure the system - have asked the install program to place the boot loader in the Linux root (or /boot) filesystem. The problem is that the BIOS can't see it there, unless you make that the active partition. The simplest fix is to reinstall Linux and this time, let it place the LILO or GRUB boot loader into the Master Boot Record - don't worry, the Linux boot loaders are automatically set up to let you choose Linux or Windows at boot time. It is possible to perform a more complex fix, for example by copying the Linux boot loader sector into a file, and setting up the Windows NT/2K/XP boot loader to chain to it - but that is too complex to describe here (seehttp://www.lesbell.com.au/Home.nsf/web/Using+the+NT+Boot+Loader+to+Boot+Linux?OpenDocument where you'll find a longer article describing how to use the NT boot loader to boot Linux).
- The kernel doesn't have a device driver to access the hard drive (e.g. a SCSI drive). Fix this by using the mkinitrd script to build a new initrd file that contains the correct drivers, or recompile the kernel to include the driver code. This usually happens because you've built a new kernel and slightly messed up the configuration.
- The kernel doesn't have a filesystem driver to access the root partition. For example, if the root filesystem is formatted with ext3, then you will need the ext3 andjbd modules in the initrd or compiled into the kernel. Fix as for the previous problem. Again, this usually happens after building a new kernel.
- The partition table has been modified, for example, by the installation of another operating system. In this case, edit the kernel command line (in /ec/lilo.confor /boot/grub/menu.lst) and the contents of /etc/fstab to contain the correct entries.
- Filesystems are corrupted, due to a power failure or system crash. Generally, after a system crash or power outage (what? No UPS?), the system will come up and repair itself. If you are using a journalling filesystem like ext3fs, jfs, xfs or resiserfs, it will usually perform a roll-forward recovery from its journal file and carry on. Even with the older ext2fs, the system usually runs an fsck (file system check) on the various file systems and repairs them automatically. However, just occasionally manual intervention is required - ; you might have to answer 'Y' to a string of questions (answering 'N' will get you nowhere unless you intend to perform really low-level repairs yourself in a last-ditch attempt to avoid data loss). In the worst case, you might have to reboot from rescue media and manuall run the e2fsck (or similar) command against each filesystem in turn. For example:e2fsck -p /dev/hda7If the program complains that the superblock - the master block that links to everything else - is corrupted, it is useful to remember that the superblock is so critical that it is duplicated every 8192 blocks through the filesystem and you can tell e2fsck to use one of the backups:
e2fsck -b 8193 /dev/hda7
- One or more filesystems cannot be found and mounted: Check the contents of /etc/fstab - in making quick alterations here, typographical errors are common. You can use the e2label command to view the label of each filesystem: some distributions set these to the mount point so you can figure out what is what.
Forgot root password
If you have - really have - forgotten the root password for your system, it is still possible, in many cases, to log in and fix this. On some distributions, you can boot in single-user maintenance mode (runlevel 1) by appending a '1' or 'single' on the end of the normal kernel boot command line. With the LILO boot loader, for example, you can type
However, some distributions will still request the root password in runlevel 1. For those, you should append the option 'init=/bin/bash' to the kernel command line, e.g.
Security Warning!
Now that everyone knows this tip, you should take care to set a LILO or GRUB password to stop an attacker from editing the boot command line and breaking into your system this way. Of course, an attacker could also remove the root password by booting from floppy or CD, so you should set the system to boot from hard drive first, and then password-protect the BIOS settings, too!
Can't Eject CD-ROM?
You can normally eject a CD using the eject command (and you can close the drive again later with eject -t). But what if you get a message:
No sound
Sound configuration is fairly tricky unless you know exactly what type of sound hardware you have - the chipset, not the brand of card. The simplest solution is to use the distribution's own sound configuration command - for Red Hat, this is redhat-config-soundcard or sndconfig (for the older versions).
X resolution too low or too high
Try using the left Ctrl and Alt keys with the + and - keys on the numeric pad to cycle through the various resolutions available on your system. You can also manually edit the XF86Config file (look in /etc/X11/ or nearby for this, depending on your distribution), then find the relevant Modes line, and comment out inappropriate modes
For example, if my monitor couldn't cope with 1400 x 1050 resolution, I would remove that entry from the Modes line in my XF86Config file:
Find the Right Driver Module
You can make the system attempt to load every device driver module of any given type in turn by using the command
Trouble-shooting techniques
Use pairs of similarly-configured systems
Quick things to check:
Is a filesystem full? This can show up in lots of different ways: being unable to save files, print jobs not spooling correctly (especially on Samba print/file servers), and so on. Use the df command to see available space:
If you need to make space by deleting some large files, use the command 'ls -lS' to get a directory listing that is sorted by file size. To scan an entire filesystem (e.g. /home or /var) for the largest files, use the command:
Adding New Drives
Sometimes the growth of a filesystem - particularly /home - means that it is necessary to find it a new home; in other words, add another physical disk and relocate the filesystem to its new home where there is room to grow.
Here is the procedure for adding another drive, with a single partition which will become the new /home filesystem (I'm assuming fdisk has already been used to partition it):
As root:
Network Problems
Use the ifconfig command to check whether an interface has been configured and is up. For example:
Long delays while starting daemons at boot time
If the system seems to stop for 30 seconds or more while starting - particularly when starting network deamons like sendmail or NFS - then the problem is likely to be either DNS misconfiguration, a DNS outage, or no network connection at all. Check that /etc/resolv.conf contains the correct DNS addresses, check that/etc/hosts contains the correct IP address and names for this machine, and then check that the network interface is up.
Troubleshooting Techniques and Skills
The first rule is: Use the log files - they are the primary source of debugging information and clues. You can examine the main log file with the command:
If trying to resolve boot-time problems, use the command:
The next rule is to compare similarly-configured systems, if you have them. Often, you can see obvious differences in the configuration files between a working system and the broken system.
Next: if you are stumped, talk the problem over with a colleague or friend. They don't have to know the perfect solution - often, their suggestions can trigger a new line of thinking or remind you of something you have overlooked.
If you don't have someone you can talk to, then use online resources. Get to know how to perform searches at http://www.google.com/linux , and how to search thecomp.os.linux and similar newsgroups at http://groups.google.com. On many occasions, I've turned up answers online after exhausting my own ideas.
Problem Avoidance Techniques
Keep a system change log. Whenever you make changes to the system, write them into the log. In general, if you never make changes to a system, it will just keep running - so that if the system breaks, the problem is usually related to recent changes.
Before making changes to critical system configuration files, make a backup copy which you can restore if everything goes pear-shaped. For example:
And, of course, the most importand System Administration Rule of all: Never make changes after three p.m. on a Friday!
The chroot Command
The chroot command is extremely useful for both system security and for system repair. Its basic syntax is:
The chroot command is often used to start network daemons on servers - this is so that if an attacker manages to compromise the daemon, perhaps through a buffer overflow, he is unable to navigate around the entire system directory tree, but is instead constrained within a 'chroot jail'.
A major use of the chroot command is to change the root directory of the system after booting from a repair floppy or CD. For example, if you boot a Red Hat installation CD with the command 'linux rescue', the root file system is actually a RAM disk, and the root filesystem on your hard drive is mounted as /mnt/sysimage. Commands you give will load programs from /bin and /sbin on the RAM disk, which is obviously limited. To get access to those directories on the hard drive, you will need to change your root directory with the command
chroot /mnt/sysimage