Blogs@Baruch Semester in Review: Part One, Triumph and Tribulation

We’re winding down another eventful semester on Blogs@Baruch, and over the next few days I’d like to offer some reflections about where we’ve been and where we’re going. Our usership has tripled, and we’ve also expanded to serve a much broader range of constituencies at the college. This broadening and deepening has taught me much about the opportunities and challenges of supporting Baruch’s use of this powerful open source publishing platform.

Mikhail Gershovich accepts the Mike Ribaudo Award at the 8th Annual CUNY IT Conference

Mikhail Gershovich accepts the Mike Ribaudo Award at the 8th Annual CUNY IT Conference

Two events over the last ten days drew into sharp focus what we have accomplished and also some of the challenges we face. At the 8th Annual CUNY IT Conference, the Schwartz Institute was awarded the Michael Ribaudo Award for Innovation in Technology. Mikhail, Suzanne, Tom, and I were recognized along with administrative teams from John Jay and the CUNY First project, as well as our good friend Matt Gold, Project Director for the CUNY Academic Commons. The Commons is like a sister project to Blogs@Baruch, since we’re using the same software, and we share ideas, labor, and a philosophy about what support for technology at the university level should entail.

It was an honor to be recognized for our innovations and, especially, to share the honor with Matt, since it signaled to the broader CUNY community that the work we’re undertaking is not only viable, but forward-looking and vital to the work of the University. At the risk of sounding like an ingrate, though, I noted that the certificates we received read that this was an “Information Technology” award. I’ve made the point before, and will make it again: instructional technology is not information technology. This is actually acknowledged in how the Ribaudo is awarded, as it’s split between the two areas (even if the split is not represented on the certificate). This is more than a semantic argument: we need to encourage our communities to understand the differences and to constantly reexamine how the University’s information technology architecture relates to and interacts with the deployment of technology in the service of teaching, learning, and scholarship.

It’s always nice to get an award, and last week brought hearty congratulations from inside and outside the Baruch community. In the midst of these pats on the back, however, I learned a little bit more about the difference between information technology and instructional technology. At approximately 7pm on Wednesday evening I happened to look at one of our blogs, and saw the dreaded:

Screen shot 2009-12-14 at 2.56.20 PM

(What follows is a bit technical: click here to jump to the rub).

The error appeared on all subdirectory blogs, while the main blog was completely white. I logged into the command line, verified that MYSQL was running, and saw that the load on our server was fine. The documentation I was able to find suggested either a MYSQL problem or a plugin conflict; I deleted all plugins, with no improvement. Now, instead of the “Error Establishing a Database Connection” I was getting what geeks refer to as the “White Screen of Death” across the entire installation. Having exhausted pretty much the extent of my command line knowledge, I sent out emails to our contacts at BCTC, and waited for a response.

A couple hours later, I was contacted by a sysadmin at BCTC; he had gamely returned to work on his way home from the gym to take a look at our server. He immediately noticed that the directory that holds Blogs@Baruch was about 98% full. We knew that we were approaching space limits, but I had (mis)calculated that we could make it to the end of the semester (when we’ll be moving the entire installation over to a new server). I was puzzled, however, because we had this issue once before and it didn’t cause an outage– it just caused an error in our database backups that resolved as soon as we opened up space. I hoped opening space would clear up our problem, but it did not.

We both thought that the database needed to be repaired, but neither of us were comfortable issuing the repair commands. The admin at BCTC contacted MYSQL, and got assistance repairing and then restarting MYSQL. 1 am, no improvement. We’d have to wait until morning.

At 6 am I took another look at the server to see if I had missed anything, and began to respond to users who were emailing about the site. I posted a query to our premium support forum with Automattic describing the problem, and got a quick response from Donncha, the lead developer of WPMu. Unfortunately, my question included a distracting error that I found in the log that was caused by a bad Phpinfo file I had put on our server (in my haste I wrote the file in Text Edit at home, which put additional characters into the file that I wasn’t able to see). Donncha thought we might have been hacked, and asked me to check our .htaccess files, which looked ok. I caught my mistake, and explained it (along with a note apologizing for not being a system administrator). Apparently I wasn’t clear, because Donncha kept pursuing the PHP error… we weren’t communicating well. He suggested I use error_log() to track down where the PHP problem was.

In the meantime, emails and phone calls from users were flowing in, and I did my best to explain to as many as possible that we were investigating the problem and should be live again soon. Internally, though, I wasn’t so sure; we had exhausted our knowledge and the knowledge in the free forums, and the premium forum to which I was posting wasn’t yielding results. Jim Groom suggested we contact Ron and Andrea Rennick, who I refer to as the “WPMu Wonder Couple,” to see if they might be able to help us out.

Within 3 hrs of Jim’s suggestion, BCTC had vetted Ron and granted him temporary access to our server; he located and fixed the problem in about 20 minutes. In the meantime, Barry Abrahamson, who runs the servers for WordPress.com and also posts to the premium support forum, had offered to do the same.

Turns out the problem was one that I had caused while trying to fix the space issue. When I deleted the plugins in mu-plugins, I failed to delete the Supercache file that sits outside of the plugins folder, inside of wp-content. I also deleted the existing cached pages. Ron concluded that:

Once you ran out of disk space, pages expiring in supercache were being refreshed as empty files. Eventually nearly all of your pages were cached as empty files. I disabled supercache by renaming advanced-cache.php in wp-content. MU checks for the file and includes it in the processing if it exists.

He later added:

I did some testing locally and reproduced the white screen by deleting the contents of the cached version of the index.

Here’s the rub: we got through it. Ultimately this was two small problems masquerading as a big one. We ran out of space, then I failed to properly disable a powerful plugin running on our system, which disabled the entire install. We were down less than 20hrs, and that was only because I wasn’t systematic enough to pick up on the way Supercache works. To a certain extent, something like this was inevitable. All sites go down, even the Big G. It’s the risk you run when you work online, and reasonable end users can accept it– it helps if those running the site aspire towards transparency.

The outage confirmed my belief in open source applications, and particularly the communal ethos that (often) animates them. Three friends: Boone Gorges, Jim, and Zach Davis, offered assistance as soon as they learned of the problem, and moral support because they’ve each been in similar situations. The offers of hands-on help were reassuring, but I didn’t really need them because I was already in contact with the three most knowledgeable WPMu people in the world.

The outage also reminded me that being able to type stuff at the command line and get stuff in return does not make one a system administrator. I’m a humble educational technologist, and I depend on information technology to get my work done. When the lines are blurred– and I blurred them here more out of necessity than conceit– trouble may ensue. Had I been able to look holistically at the problem and troubleshoot it methodically, I probably could have caught the error. But inexperience and the pressure of supporting 3k+ users clouded my vision and convinced me the solution to the problem was out of my reach. These are valuable lessons to carry forward on this project.

Within an hour of Blogs@Baruch going backup, Baruch College’s enews arrived in my mailbox, containing a congratulations to the Institute on the Ribaudo Award. I clicked on a link and landed happily at our pretty little homepage, which was humming nicely along. When I closed my laptop, I still managed to feel pretty good about the week.

PS: I’ve learned that the following cultural artifact can help one oversee an enterprise publishing platform: