Grrrr.
You know what I hate? I hate producing XML that looks perfectly valid, validates with the schema checker, and doesn’t cause the parser to throw errors when actually processing it, yet completely fails to do what it’s supposed to do.
In case you’re interested, the difference between
<interfaces config:type=”list”>
and
<interfaces>
in an AutoYaST install profile is, the first configures your network interfaces for you, and the second says “Oh, this bit here with the nic card and the IP address? Naaaah…. don’t bother. I never liked the Internet anyways. Let’s just mess around with xfractint or something. Networking is for people who can’t make their own fun.”
Bah, humbug, AutoYaST. Bah, humbug.
Top of page | Leave a comment »
Raiding on my parade
I had to do something kind of weird the other day — I had to take my perfectly well mirrored disks and split up the raid volumes, because I wanted to start doing some testing on that server, and use the second disk for space. As it turns out, there are a great many tutorials on the vast googletrons to help you set up, manage, or break Linux raid mirrors, but not so many when it comes to wiping away all evidence that you’d ever entrusted your system to such a beast.
Well, on the bright side, at least that means I get material for a blog post out of it!
It turns out that it’s actually fairly easy to do, but a shade tricky. As you probably already know if you’ve ever had to boot off one raided disk after losing part of your mirror, there’s really nothing special about the partitions that you’ve put in your raid volumes — they’re still secretly ext3, or whichever filesystem type you’ve chosen. But the partition type is set to 0xfd, Linux raid, rather than 0×83, Linux. And the superblock is encoded with the fact that it’s a member of a raid set — which is the reason that even if you boot off alternate media with no config files or anything, it’ll see the slice’s raid information and happily treat them that way even if you are tearing your hair out with the effort to make the system see a partition as just a plain old partition.
So here’s the steps I was following to wash that raid right out of my hair:
- Used cfdisk -P r to dump out my disk partition information in case I accidentally trashed it.
- Saved my /proc/mdstat and df output so I knew which filesystems were on which partitions and wouldn’t have to bother guessing.
- Booted off alternate media. (I used a net boot server but you could use IPMI virtual media, a DVD, USB stick, etc.)
- mdadm /dev/mdX –fail /dev/sdbX to disable half of the mirror, then mdadm /dev/mdX –remove /dev/sdbX to remove it from the volume. Repeat for all slices on /dev/sdb. (I am not really certain if this step is truly necessary — but I did use it, so it’s here for completeness’ sake. If I get a chance to down-convert another raided host I’ll note this appropriately.)
- mdadm –stop /dev/md0 to stop the raid volume.
- mdadm –zero-superblock /dev/sdaX and mdadm –zero-superblock /dev/sdbX to remove the raid information from the filesystems.
- fdisk /dev/sda and fdisk /dev/sdb to change the partition types from 0xfd to 0×83.
- mount /dev/sda1 /mnt; mount /dev/sd0 /mnt/boot to mount the root and /boot filesystems from one half of the ex-mirror.
- Edit /mnt/etc/vfstab to have the new plain sdaX partitions instead of mdX volumes.
- Edit /mnt/boot/grub/menu.1st to have the appropriate devices as well.
- Fix any other references to /dev/mdX you might have littered around your system.
- Umount /mnt/boot and /mnt, and repeat for sdb as appropriate.
- Reboot!
It wasn’t that difficult, really. In fact, although it took me a while to bumble my way through it the first time (mostly since I wasn’t aware of the superblock aspect, and it was driving me nuts every time I’d boot up in some alternative fashion and it would resurrect my raid volumes from the should-have-been-dead), it would probably take me all of about 15 minutes to do now, and most of that would be time spent rebooting.
Of course, after all the fuss, we ended up not really using that machine as a test host after all, but at least it was a good learning opportunity. It’s not nearly often enough that you get the chance to potentially totally blow up all your filesystems with no consequences. :)
I know it’s in there.
This is how I feel about a little shell script I threw together this morning:
I wrote it, I ran it, it ran without error in test mode… it ran without error in a state where it didn’t have anything to do… it ran without error in a state where it had to fix something… I never write a shell script without at least screwing up one ” mark with a ‘ mark or something…
I KNOW I PUT A TYPO OR A BUG OR SOMETHING IN THERE SOMEWHERE, NOW WHERE IS IT?!?!?!
“Good enough, move on”
The Slacker Manager blog covered this topic several months ago in an article about a concept they named GEMO: “Good enough, move on”:
You work at something, you begin to run out of steam or you know more needs to be done but there are other projects and things that need to be done so you say, GEMO. You move on and you know you can come back to it and improve it later.
In the past few days, the thing all the cool kids are talking about is The Cult of Done Manifesto, which is a 13-rule list of things which boil down to, more or less, “do something, anything, whatever, it doesn’t matter — just do it.”
Top of page | Leave a comment »
The nature of ‘production’
What does it mean for a system to be ‘in production’?
This came up for me on Friday. We have a NAS unit, a shiny new NetApp filer, which was recently purchased and set up, and although we are testing things with it, we hadn’t yet taken the final steps — it was not directly doing anything mission critical. Oh, it’s connected to several systems all right, some of which are mission critical, but itself, it’s not in the line of fire. Its documentation was not finished. But I was getting ready to pull the trigger for one application, switching one Linux-based NFS server to use the NetApp instead. I planned to take the weekend to sync the data and swap the NFS clients over, easy peasy. It should have been relatively easy. In fact, I was so not worried about it (despite the fact that the NFS server it was replacing is of paramount importance to one production software platform) that I took Thursday off to attend a full-day workshop extolling the virtues of NetApp and how to leverage various features to save time doing backups, maximize the resource usage, etc.
This made it extra ironic when I got paged at 3PM that the NetApp was “down.”
Top of page | Leave a comment »
A stumper
Had a very odd thing happen with a system at work. Near as we can tell, something went wrong in the kernel and we couldn’t recover from it — had to reboot. (Naturally, it was a part of the critical path during a highly significant day at work: we couldn’t take the downtime hit during the day, even fifteen minutes. We had to nurse it through a couple hours to make the daily maintenance window.)
The machine had been running fine for over 300 days. At about 10 AM, I was able to ssh in normally, and sudo crontab -e for another user to add a cron job. I also added a moderately sized chunk of data to a running MySQL instance — pretty ordinary fare.
A little bit later, at 11:28 AM, several things happened:
- syslog-ng stopped recording updates: no logs, local or remote, received any updates after 11:28.
- ssh sessions would not go through. Initially, ssh -v showed everything normal, the key exchange, authentication, tty allocation, and then it would hang (consistent with a problem with the filesystem one’s home directory is on, maybe). Later on, maybe an hour into troubleshooting, ssh started simply dropping the connection nearly immediately — I didn’t sniff the packets, but it was within 4 or 5 of the first lines of ssh -v output. Connection reset, IIRC.
- sudo and su attempts from ordinary users logged on the console would hang indefinitely. Ordinary unprivileged users could log in on the console fine. Root logins, however, would hang and then time out after 60 seconds.
- This machine does not rely on any external authentication protocols (NIS, LDAP, etc).
- Inside of /etc, nothing had changed since last week.
- /etc/passwd and /etc/group were intact. (I didn’t have access to read /etc/shadow, but it was what looked like a normal size.)
- Nothing pam-related had changed.
- net-snmpd continued to run (standalone), and continued to supply information to the monitoring server when it connected to query.
- MySQL continued to run fine, even accepting new remote connections and allowing me to run mysqldump without incident.
- Any attempt to use logger to write syslog messages would hang. /dev/log was present, a unix socket file, and had the correct permissions.
- The cron parent process would spawn children on schedule, but none of them would ever do anything, or exit.
- Inside /proc, various out-to-lunchy processes (such as a cron child) would have missing data — the symlinks ‘exe,’ ‘cwd,’ etc. pointed nowhere and would return “permission denied” when I ran ls -l inside the (mode 0755) directory.
- Obviously we could not strace anything running as root since any attempt to elevate privileges resulted in a hang (and a lost console, which is significant when you’ve only got 6 to work with).
- When, in desperation, I ran find / -mmin -90 -ls 2>/dev/null, to see what might have changed on the system since I was last able to function, I turned up nothing more interesting than log files.
- There was nothing in dmesg, or in the log files which I had permissions to read (messages, auth, etc).
- The date and time were correct on the system.
- One of the network-listening daemons runs as a user we use interactively, so I was able to get access to look at some stuff in /proc. Nothing was interesting in there — open libraries were limited to typical stuff like libnsl, libsocket, libld, and pam libraries.
- Memory use was typical for the system, and not exhausted. The system’s load fluctuated slightly, but never really went over 1, and that only briefly; usually it hovered around 0.10 or so.
- So far as we can tell, nothing external precipitated the change (i.e., no one launched a new daemon or started a new program running on it, nor even really did anyone connect to the available network services beyond the ordinary).
- syslog-ng was in a sleep state (Ss, according to ps).
- An strace of logger showed it opening the socket connection and then trying to send() to it, but it never returned from that.
- wtmp kept being updated with normal logins/logouts of unprivileged users.
- There was nothing wrong with the disks. Reads and writes continued totally normally.
- You couldn’t strace anything useful because, of course, a userland strace can’t attach to a process running under another userid, so launching strace su was pointless (and, indeed, just hung a terminal, which Certain Developers Who Shall Remain Nameless wouldn’t believe me about until I proved it - hrmph).
We brainstormed about the differences between root privileged and ordinary unprivileged logins, but in the end, all we could think of was some additional logging, and the fact that root’s home directory is on another filesystem than our users. (There is nothing interesting in root’s .profile…and anyway, sudo doesn’t set up root’s environment unless you specifically tell it to, which I wasn’t.)
In the end, we concluded that what had happened was that there was something interfering with either socket access or syslog calls specifically, maybe some sort of stuck lock. My personal feeling is that there was something wrong with the network stack inside the kernel, causing logging to die when it tried to access the /dev/log socket, and also choking the ssh attempts. (In other news, am I totally wrong in thinking /dev is a kind of odd place to go leaving unix sockets lying about? It just seems to stick out.) Arguments against my theory include the fact that mysql continued happily accepting and serving new network connections the entire time (although it only had the one listener and didn’t have to open up any new ports to listen to)… also, the fact that a screwed-up tcp/ip stack in the kernel (in a kernel version/patchlevel we run all over the place) would be very, very bad.
Unfortunately, we had to crash the machine in the end — since obviously we couldn’t elevate privs enough to run reboot (and despite all my furious find / …-ing, I never did find a root-owned, setuid, writable file I could cat bash over top of and launch myself a root shell without authenticating), I shut down MySQL, and the processes running in userland that I had access to, let it rest a minute to quasi-sync the disks, and then power-cycled it. (I feel bad every time I do that to one of my poor machines.) To its credit, it came back up nicely afterwards, with just a brief fsck to make me feel guilty. And the problem didn’t resurrect itself even after we brought the production services back online and ran some tests. It’s certainly an odd one.
I did learn a new trick though. While taking some stabs in the dark to make sure certain requisite devices were still functioning, I ran a couple dd if=/dev/(random|urandom)… jobs. Since I neglected to do of=/dev/null, though, I trashed my terminal pretty thoroughly with all the pretty binary characters. Neither reset nor stty sane had any effect, to my woe (by this time I was down to two working terminals, and now one of them was typed out in smiley faces and other useless garbage characters). A quick visit to the Googletron, though, revealed this tidbit: ^N and ^O switch one’s terminal between its secondary and primary charactersets — and while your primary charset is probably ascii or something similarly letter-ish, the secondary is more likely to be line art stuff for drawing curses menus and such. So if the binary spew splashing across your terminal includes a ^N, it’ll trip you over into line art-land. In that case, a quick echo ^O (remember the ^V before the literal ^O) returns you back to the primary character set, and legibility. Instant resurrected terminal! I was very pleased. Do you have any idea how many terminals I’ve trashed by catting binary junk across them in the past 15 years? Seriously, I wish someone would have told me this years ago, it would have saved me lots of terminal kill-and-restarts.
At any rate. What questions do you ask when you come across a particularly weird problem you’ve never seen before? What questions would you have asked that I didn’t note above? What ideas do you have about what the problem might have been?
Top of page | Leave a comment »
How not to sell me your staffing services
*ring ring*
Me: Hello, this is Sabrina.
$(Colleague): Hey, Sabrina, I have a phone call for you. What’s your outside number?
Me: *gives it to her*
$(Colleague): Okay, thanks. I’ll put him through now.
*clicky*
Me: Hello?
$(Recruiter): Hello! This is ICantRead with WellKnownRecruiting Group!
Me: …okay.
$(Recruiter): …
*click*
*ring* *ring*
Me: Hello, this is Sabrina.
$(Recruiter): Hello! This is ICantRead with WellKnownRecruiting Group!
Me: Yes. You just called me.
$(Recruiter): I have some Windows/Cisco engineering candidates I’d like to talk to you about!
Me: I don’t hire Windows or Cisco people. How did you get my contact information?
$(Recruiter): Are you more on the Unix side?
Me: Yes. How did you get my contact information?
$(Recruiter): I talked to your assistant!
Me: I don’t have an assistant. You talked to someone else inside the company and then hung up on me. How did you get my contact information?
$(Recruiter): We must have gotten disconnected!
Me: No, you said “Hello, this is ICantRead from WellKnownRecruiting Group,” and I said “okay,” and then you hung up on me. How did you get my contact information?
$(Recruiter): I talked to your assistant.
Me: I don’t have an assistant. How did you get that phone number?
$(Recruiter): I dialed YourCompany!
Me: But how did you get my contact information?
$(Recruiter): There was a job posting on Simple Hire!
Me: I’ve never heard of that site, but I do have a job posting on Dice, and that one says “No phone calls and no recruiters.”
$(Recruiter): Well, this was on Simple Hire!
Me: That’s nice, but I don’t know what Simple Hire is, and my only job posting says “No recruiters.”
$(Recruiter): So you work more with Linux?
Me: Have a good weekend.
*click*
I assume the approach philosophy here is “if you annoy them enough, they will give in and buy your services just to get you to shut up.”
Using RPM to install SarCheck
In our infrastructure, where I work, we use configuration management to manage our software installs. Since we’re using OpenSUSE, that means (by default) using RPM. I’ve got a YUM repository that I maintain our packages in, and that source is added to zypper as a part of the system tweaks at install time, and then cfengine uses zypper to do installs. It’s actually not half bad at all — now that I’ve learned to love the spec file. (At the outset, I admit, I thought I was doing my usual over-engineering of the solution, rather than merely Doing the Right Thing. Now that it works, though: obviously I was right all along!)
We recently bought licenses for SarCheck, which is a package designed to take your collected system statistics and translate them into plain English, so you can go from looking at statistics like X% blocked read i/o on device D to being told “You should split /var/foo off onto a new disk so it’s not fighting /var/bar for I/O bandwidth.” It’s actually pretty slick. At a previous employer, I used it on several Solaris machines. Now, I use it on Linux. Normally, I would just deploy the software — since there’s no way I want to go and hand-install it on 50 hosts — by dropping the RPM into the friendly local yum repository and rebuilding my indices, then tell cfengine to drop it where I want it to go. The only problem with this plan is that it’s not distributed as an RPM — you’re given a compressed tarball, which installs into /opt/sarcheck.
I really don’t like that part. :)
I’m not normally an enormous stomper and shouter about The Rules and The Standards. I like things to be orderly, but the strength of UNIX is flexibility, after all, right? But. Let’s not be all crazy here. The flexibility is the chocolate inside the delicious candy coating of history and tradition, and yes, standards. The Filesystem Hierarchy Standard is a nice, short one, which most of us already know: /bin is where you put your binaries, /var is for data, and /opt is for add-on packages. Coming back to the original topic, part of the deal with SarCheck is that, in the tarball as distributed, it writes data files within /opt/sarcheck — it stores information on the stuff it gets out of /proc to analyze, and it optionally can also store ps output for analysis. This is, I think, data that should not be living on /opt — especially since most of my systems have /opt on /, and I don’t want data living there. (Rest assured that any criticism you’d like to level at my site after that revelation has already been made.)
So now, not only do I need to get this tarball into an RPM — oh, I didn’t mention the part where it’s not open source, did I? It’s not open source. — but I also need to go mucking around in its innards.
Read the rest of this entry »
Top of page | Leave a comment »
Link roundup
Some useful stuff:
- Tools of the Trade — Iostat, Vmstat, and Netstat - article at the now-defunct SysAdmin magazine about using the named tools to troubleshoot performance issues. Good intro if you’re unfamiliar; syntax used is for Solaris and AIX.
- Sharding and Time-Base Partitioning - article from MySQL Performance Blog discussing aspects of dividing data horizontally to scale performance.
How to make your hiring manager cry
I feel a little let down. You see, I’m hiring for an open sysadmin position, and … it turns out that it is hard work. Curse you, recession, keeping everybody from wanting to job-hop willy-nilly!
Over the past couple of weeks, I’ve posted the job to LOPSA and Sage’s job boards. (Representin’ my sysadminly peeps, yo.) I even posted to chi.jobs (eek!). I pinged via linkedin, I pinged via people I drink beer with. I pinged and pinged and generally looked sad and overworked yet hopeful. The only firm ping I’ve gotten in return, however, was a cold-calling recruiter who wanted to pitch some of her candidates to me. So today I used our corporate account and posted to Dice. And then… I started browsing resumes.
Ai yi yi! So many typos. So many people claiming proficiency in the programming language “Linux.” So many with vi skills. (Guys. I am a vi user myself. I love me my visual editor. Still… it is a text editor. It’s like saying that, as a car driver, you have experience with moving the shifter. — Now, if you were a coder and you were badass with lisp, I could see listing emacs. But let’s be real. It’s vi. It’s not that hard.) Aieee. Also, if the only employment experience you are listing is a two-year stint as a student employee at your college help desk, don’t insult me: you do not have 10 years professional experience with Linux. I’ll accept that you may have been playing with it since you were 10, but you probably weren’t doing it 40 hours a week, okay? (And if you were, someone needs to arrest your parents.)
Someone. Save me. Get me an awesome mid-level sysadmin candidate who knows the difference between RPM and .deb. Please. Save me from reading more of these resumes before my mind numbs over for good.
I signed up for a job seeker account as well, so as to scope out the competition. Seems like there are a bunch of “unique,” “prestigious,” and “fast-paced” trading firms hiring out there. I was tempted to rewrite my posting to call us “fun,” “laid-back,” and “awesome yet not stuck up about it,” just to taunt the other guys. I didn’t, though. That would have been snarky. (Also, my posting already has more personality than theirs out of the gate — especially the one that started out with the slightly hostile note, “We do NOT accept unsolicited calls - we ONLY talk to candidates with an appointment - We frown unfavorably on this!” Yeahhhh… I’m totally gonna want to work for you, Ms. Frownypants.)
So, after a brief brush with unpleasant reality (what… you mean qualified, ideal candidates aren’t just going to fall into my lap?? NO FAIR.), this is where I am at. Grumbling about typos and wishing.
Star light, star bright
First star I see tonight
I wish I may, I wish I might
Hire a freaking sysadmin before I die of old age.
ObRelatedLinks: The Top Eight Ways Your Resume Disqualifies You For My Open Job Posting (my blog), 36 Beautiful Resume Ideas That Work (JobMob), How to Edit Your Resume like a Professional Resume Writer (Brazen Careerist).
Top of page | Leave a comment »
Next page »
