Had a very odd thing happen with a system at work. Near as we can tell, something went wrong in the kernel and we couldn’t recover from it — had to reboot. (Naturally, it was a part of the critical path during a highly significant day at work: we couldn’t take the downtime hit during the day, even fifteen minutes. We had to nurse it through a couple hours to make the daily maintenance window.)
The machine had been running fine for over 300 days. At about 10 AM, I was able to ssh in normally, and sudo crontab -e for another user to add a cron job. I also added a moderately sized chunk of data to a running MySQL instance — pretty ordinary fare.
A little bit later, at 11:28 AM, several things happened:
- syslog-ng stopped recording updates: no logs, local or remote, received any updates after 11:28.
- ssh sessions would not go through. Initially, ssh -v showed everything normal, the key exchange, authentication, tty allocation, and then it would hang (consistent with a problem with the filesystem one’s home directory is on, maybe). Later on, maybe an hour into troubleshooting, ssh started simply dropping the connection nearly immediately — I didn’t sniff the packets, but it was within 4 or 5 of the first lines of ssh -v output. Connection reset, IIRC.
- sudo and su attempts from ordinary users logged on the console would hang indefinitely. Ordinary unprivileged users could log in on the console fine. Root logins, however, would hang and then time out after 60 seconds.
- This machine does not rely on any external authentication protocols (NIS, LDAP, etc).
- Inside of /etc, nothing had changed since last week.
- /etc/passwd and /etc/group were intact. (I didn’t have access to read /etc/shadow, but it was what looked like a normal size.)
- Nothing pam-related had changed.
- net-snmpd continued to run (standalone), and continued to supply information to the monitoring server when it connected to query.
- MySQL continued to run fine, even accepting new remote connections and allowing me to run mysqldump without incident.
- Any attempt to use logger to write syslog messages would hang. /dev/log was present, a unix socket file, and had the correct permissions.
- The cron parent process would spawn children on schedule, but none of them would ever do anything, or exit.
- Inside /proc, various out-to-lunchy processes (such as a cron child) would have missing data — the symlinks ‘exe,’ ‘cwd,’ etc. pointed nowhere and would return “permission denied” when I ran ls -l inside the (mode 0755) directory.
- Obviously we could not strace anything running as root since any attempt to elevate privileges resulted in a hang (and a lost console, which is significant when you’ve only got 6 to work with).
- When, in desperation, I ran find / -mmin -90 -ls 2>/dev/null, to see what might have changed on the system since I was last able to function, I turned up nothing more interesting than log files.
- There was nothing in dmesg, or in the log files which I had permissions to read (messages, auth, etc).
- The date and time were correct on the system.
- One of the network-listening daemons runs as a user we use interactively, so I was able to get access to look at some stuff in /proc. Nothing was interesting in there — open libraries were limited to typical stuff like libnsl, libsocket, libld, and pam libraries.
- Memory use was typical for the system, and not exhausted. The system’s load fluctuated slightly, but never really went over 1, and that only briefly; usually it hovered around 0.10 or so.
- So far as we can tell, nothing external precipitated the change (i.e., no one launched a new daemon or started a new program running on it, nor even really did anyone connect to the available network services beyond the ordinary).
- syslog-ng was in a sleep state (Ss, according to ps).
- An strace of logger showed it opening the socket connection and then trying to send() to it, but it never returned from that.
- wtmp kept being updated with normal logins/logouts of unprivileged users.
- There was nothing wrong with the disks. Reads and writes continued totally normally.
- You couldn’t strace anything useful because, of course, a userland strace can’t attach to a process running under another userid, so launching strace su was pointless (and, indeed, just hung a terminal, which Certain Developers Who Shall Remain Nameless wouldn’t believe me about until I proved it – hrmph).
We brainstormed about the differences between root privileged and ordinary unprivileged logins, but in the end, all we could think of was some additional logging, and the fact that root’s home directory is on another filesystem than our users. (There is nothing interesting in root’s .profile…and anyway, sudo doesn’t set up root’s environment unless you specifically tell it to, which I wasn’t.)
In the end, we concluded that what had happened was that there was something interfering with either socket access or syslog calls specifically, maybe some sort of stuck lock. My personal feeling is that there was something wrong with the network stack inside the kernel, causing logging to die when it tried to access the /dev/log socket, and also choking the ssh attempts. (In other news, am I totally wrong in thinking /dev is a kind of odd place to go leaving unix sockets lying about? It just seems to stick out.) Arguments against my theory include the fact that mysql continued happily accepting and serving new network connections the entire time (although it only had the one listener and didn’t have to open up any new ports to listen to)… also, the fact that a screwed-up tcp/ip stack in the kernel (in a kernel version/patchlevel we run all over the place) would be very, very bad.
Unfortunately, we had to crash the machine in the end — since obviously we couldn’t elevate privs enough to run reboot (and despite all my furious find / …-ing, I never did find a root-owned, setuid, writable file I could cat bash over top of and launch myself a root shell without authenticating), I shut down MySQL, and the processes running in userland that I had access to, let it rest a minute to quasi-sync the disks, and then power-cycled it. (I feel bad every time I do that to one of my poor machines.) To its credit, it came back up nicely afterwards, with just a brief fsck to make me feel guilty. And the problem didn’t resurrect itself even after we brought the production services back online and ran some tests. It’s certainly an odd one.
I did learn a new trick though. While taking some stabs in the dark to make sure certain requisite devices were still functioning, I ran a couple dd if=/dev/(random|urandom)… jobs. Since I neglected to do of=/dev/null, though, I trashed my terminal pretty thoroughly with all the pretty binary characters. Neither reset nor stty sane had any effect, to my woe (by this time I was down to two working terminals, and now one of them was typed out in smiley faces and other useless garbage characters). A quick visit to the Googletron, though, revealed this tidbit: ^N and ^O switch one’s terminal between its secondary and primary charactersets — and while your primary charset is probably ascii or something similarly letter-ish, the secondary is more likely to be line art stuff for drawing curses menus and such. So if the binary spew splashing across your terminal includes a ^N, it’ll trip you over into line art-land. In that case, a quick echo ^O (remember the ^V before the literal ^O) returns you back to the primary character set, and legibility. Instant resurrected terminal! I was very pleased. Do you have any idea how many terminals I’ve trashed by catting binary junk across them in the past 15 years? Seriously, I wish someone would have told me this years ago, it would have saved me lots of terminal kill-and-restarts.
At any rate. What questions do you ask when you come across a particularly weird problem you’ve never seen before? What questions would you have asked that I didn’t note above? What ideas do you have about what the problem might have been?