This is a few days stale. In all the excitement of starting a new job, I’m slow to get the good gossip!

From The Register:

The Linux kernel has been purged of a bug that gave root access to untrusted users – again.

The vulnerability in a component of the operating system that translates values from 64 bits to 32 bits (and vice versa) was fixed once before – in 2007 with the release of version 2.6.22.7. But several months later, developers inadvertently rolled back the change, once again leaving the OS open to attacks that allow unprivileged users to gain full root access.

For more information: RedHat Bugzilla page, RHEL patch info, Fedora patch info, CentOS patch info, SLES/openSUSE update info, Slackware ChangeLog info, Gentoo discussion thread.

 

You know what I hate? I hate producing XML that looks perfectly valid, validates with the schema checker, and doesn’t cause the parser to throw errors when actually processing it, yet completely fails to do what it’s supposed to do.

In case you’re interested, the difference between

<interfaces config:type=”list”>

and

<interfaces>

in an AutoYaST install profile is, the first configures your network interfaces for you, and the second says “Oh, this bit here with the nic card and the IP address? Naaaah…. don’t bother. I never liked the Internet anyways. Let’s just mess around with xfractint or something. Networking is for people who can’t make their own fun.”

Bah, humbug, AutoYaST. Bah, humbug.

 

This is how I feel about a little shell script I threw together this morning:

Garfield

I wrote it, I ran it, it ran without error in test mode… it ran without error in a state where it didn’t have anything to do… it ran without error in a state where it had to fix something… I never write a shell script without at least screwing up one ” mark with a ‘ mark or something…

I KNOW I PUT A TYPO OR A BUG OR SOMETHING IN THERE SOMEWHERE, NOW WHERE IS IT?!?!?!

 

Had a very odd thing happen with a system at work. Near as we can tell, something went wrong in the kernel and we couldn’t recover from it — had to reboot. (Naturally, it was a part of the critical path during a highly significant day at work: we couldn’t take the downtime hit during the day, even fifteen minutes. We had to nurse it through a couple hours to make the daily maintenance window.)

The machine had been running fine for over 300 days. At about 10 AM, I was able to ssh in normally, and sudo crontab -e for another user to add a cron job. I also added a moderately sized chunk of data to a running MySQL instance — pretty ordinary fare.

A little bit later, at 11:28 AM, several things happened:
- syslog-ng stopped recording updates: no logs, local or remote, received any updates after 11:28.
- ssh sessions would not go through. Initially, ssh -v showed everything normal, the key exchange, authentication, tty allocation, and then it would hang (consistent with a problem with the filesystem one’s home directory is on, maybe). Later on, maybe an hour into troubleshooting, ssh started simply dropping the connection nearly immediately — I didn’t sniff the packets, but it was within 4 or 5 of the first lines of ssh -v output. Connection reset, IIRC.
- sudo and su attempts from ordinary users logged on the console would hang indefinitely. Ordinary unprivileged users could log in on the console fine. Root logins, however, would hang and then time out after 60 seconds.
- This machine does not rely on any external authentication protocols (NIS, LDAP, etc).
- Inside of /etc, nothing had changed since last week.
- /etc/passwd and /etc/group were intact. (I didn’t have access to read /etc/shadow, but it was what looked like a normal size.)
- Nothing pam-related had changed.
- net-snmpd continued to run (standalone), and continued to supply information to the monitoring server when it connected to query.
- MySQL continued to run fine, even accepting new remote connections and allowing me to run mysqldump without incident.
- Any attempt to use logger to write syslog messages would hang. /dev/log was present, a unix socket file, and had the correct permissions.
- The cron parent process would spawn children on schedule, but none of them would ever do anything, or exit.
- Inside /proc, various out-to-lunchy processes (such as a cron child) would have missing data — the symlinks ‘exe,’ ‘cwd,’ etc. pointed nowhere and would return “permission denied” when I ran ls -l inside the (mode 0755) directory.
- Obviously we could not strace anything running as root since any attempt to elevate privileges resulted in a hang (and a lost console, which is significant when you’ve only got 6 to work with).
- When, in desperation, I ran find / -mmin -90 -ls 2>/dev/null, to see what might have changed on the system since I was last able to function, I turned up nothing more interesting than log files.
- There was nothing in dmesg, or in the log files which I had permissions to read (messages, auth, etc).
- The date and time were correct on the system.
- One of the network-listening daemons runs as a user we use interactively, so I was able to get access to look at some stuff in /proc. Nothing was interesting in there — open libraries were limited to typical stuff like libnsl, libsocket, libld, and pam libraries.
- Memory use was typical for the system, and not exhausted. The system’s load fluctuated slightly, but never really went over 1, and that only briefly; usually it hovered around 0.10 or so.
- So far as we can tell, nothing external precipitated the change (i.e., no one launched a new daemon or started a new program running on it, nor even really did anyone connect to the available network services beyond the ordinary).
- syslog-ng was in a sleep state (Ss, according to ps).
- An strace of logger showed it opening the socket connection and then trying to send() to it, but it never returned from that.
- wtmp kept being updated with normal logins/logouts of unprivileged users.
- There was nothing wrong with the disks. Reads and writes continued totally normally.
- You couldn’t strace anything useful because, of course, a userland strace can’t attach to a process running under another userid, so launching strace su was pointless (and, indeed, just hung a terminal, which Certain Developers Who Shall Remain Nameless wouldn’t believe me about until I proved it – hrmph).

We brainstormed about the differences between root privileged and ordinary unprivileged logins, but in the end, all we could think of was some additional logging, and the fact that root’s home directory is on another filesystem than our users. (There is nothing interesting in root’s .profile…and anyway, sudo doesn’t set up root’s environment unless you specifically tell it to, which I wasn’t.)

In the end, we concluded that what had happened was that there was something interfering with either socket access or syslog calls specifically, maybe some sort of stuck lock. My personal feeling is that there was something wrong with the network stack inside the kernel, causing logging to die when it tried to access the /dev/log socket, and also choking the ssh attempts. (In other news, am I totally wrong in thinking /dev is a kind of odd place to go leaving unix sockets lying about? It just seems to stick out.) Arguments against my theory include the fact that mysql continued happily accepting and serving new network connections the entire time (although it only had the one listener and didn’t have to open up any new ports to listen to)… also, the fact that a screwed-up tcp/ip stack in the kernel (in a kernel version/patchlevel we run all over the place) would be very, very bad.

Unfortunately, we had to crash the machine in the end — since obviously we couldn’t elevate privs enough to run reboot (and despite all my furious find / …-ing, I never did find a root-owned, setuid, writable file I could cat bash over top of and launch myself a root shell without authenticating), I shut down MySQL, and the processes running in userland that I had access to, let it rest a minute to quasi-sync the disks, and then power-cycled it. (I feel bad every time I do that to one of my poor machines.) To its credit, it came back up nicely afterwards, with just a brief fsck to make me feel guilty. And the problem didn’t resurrect itself even after we brought the production services back online and ran some tests. It’s certainly an odd one.

I did learn a new trick though. While taking some stabs in the dark to make sure certain requisite devices were still functioning, I ran a couple dd if=/dev/(random|urandom)… jobs. Since I neglected to do of=/dev/null, though, I trashed my terminal pretty thoroughly with all the pretty binary characters. Neither reset nor stty sane had any effect, to my woe (by this time I was down to two working terminals, and now one of them was typed out in smiley faces and other useless garbage characters). A quick visit to the Googletron, though, revealed this tidbit: ^N and ^O switch one’s terminal between its secondary and primary charactersets — and while your primary charset is probably ascii or something similarly letter-ish, the secondary is more likely to be line art stuff for drawing curses menus and such. So if the binary spew splashing across your terminal includes a ^N, it’ll trip you over into line art-land. In that case, a quick echo ^O (remember the ^V before the literal ^O) returns you back to the primary character set, and legibility. Instant resurrected terminal! I was very pleased. Do you have any idea how many terminals I’ve trashed by catting binary junk across them in the past 15 years? Seriously, I wish someone would have told me this years ago, it would have saved me lots of terminal kill-and-restarts.

At any rate. What questions do you ask when you come across a particularly weird problem you’ve never seen before? What questions would you have asked that I didn’t note above? What ideas do you have about what the problem might have been?

 

In our infrastructure, where I work, we use configuration management to manage our software installs. Since we’re using OpenSUSE, that means (by default) using RPM. I’ve got a YUM repository that I maintain our packages in, and that source is added to zypper as a part of the system tweaks at install time, and then cfengine uses zypper to do installs. It’s actually not half bad at all — now that I’ve learned to love the spec file. (At the outset, I admit, I thought I was doing my usual over-engineering of the solution, rather than merely Doing the Right Thing. Now that it works, though: obviously I was right all along!)

We recently bought licenses for SarCheck, which is a package designed to take your collected system statistics and translate them into plain English, so you can go from looking at statistics like X% blocked read i/o on device D to being told “You should split /var/foo off onto a new disk so it’s not fighting /var/bar for I/O bandwidth.” It’s actually pretty slick. At a previous employer, I used it on several Solaris machines. Now, I use it on Linux. Normally, I would just deploy the software — since there’s no way I want to go and hand-install it on 50 hosts — by dropping the RPM into the friendly local yum repository and rebuilding my indices, then tell cfengine to drop it where I want it to go. The only problem with this plan is that it’s not distributed as an RPM — you’re given a compressed tarball, which installs into /opt/sarcheck.

I really don’t like that part. :)

I’m not normally an enormous stomper and shouter about The Rules and The Standards. I like things to be orderly, but the strength of UNIX is flexibility, after all, right? But. Let’s not be all crazy here. The flexibility is the chocolate inside the delicious candy coating of history and tradition, and yes, standards. The Filesystem Hierarchy Standard is a nice, short one, which most of us already know: /bin is where you put your binaries, /var is for data, and /opt is for add-on packages. Coming back to the original topic, part of the deal with SarCheck is that, in the tarball as distributed, it writes data files within /opt/sarcheck — it stores information on the stuff it gets out of /proc to analyze, and it optionally can also store ps output for analysis. This is, I think, data that should not be living on /opt — especially since most of my systems have /opt on /, and I don’t want data living there. (Rest assured that any criticism you’d like to level at my site after that revelation has already been made.)

So now, not only do I need to get this tarball into an RPM — oh, I didn’t mention the part where it’s not open source, did I? It’s not open source. — but I also need to go mucking around in its innards.
Continue reading »

© 2012 /etc/system Suffusion theme by Sayontan Sinha