Hey, there! Log in / Register

His forecast: Cloudy, with a chance of downtime

John Halamka, in charge of network computing at both Harvard Medical School and Beth Israel Deaconess Medical Center, considers recent public cloud outages (from Amazon to Blogger), says he remains optimistic about the basic concept, in part because:

Problems on centralized cloud architecture that is homogenous, well documented, and highly staffed will be more rapidly resolved than problems in distributed, poorly staffed one-off installations.

He describes some of the issues his own campus clouds have had over the past year, including:

HMS has clustered thousands of computing cores together to create a highly robust community resource connected to a petabyte of distributed storage nodes. In theory is should be invincible. In practice it went down. A user with limited high performance computing experience launched a poorly written job to 400 cores in parallel that caused a core dump every second contending for the same disk space. Storage was overwhelmed and went offline for numerous applications.

Neighborhoods: 
Topics: 


Ad:


Like the job UHub is doing? Consider a contribution. Thanks!

Comments

I guess Harvard has demonstrated, yet again, that there is no such thing as an "idiot proof" system. Idiot resistant, perhaps ... idiot proof, no.

up
Voting closed 0

Plan for failure.

Netflix did that and was able to re-route around Amazon, when Amazon's cloud collapsed a couple weeks back. Reddit didn't, and it had no choice but to wait for Amazon to figure out what had gone wrong.

up
Voting closed 0

A lot of that goes towards how critical the service is. Reddit's still relatively small ( especially on staffing/funding ) whereas Netflix is probably overflowing with developers & cash for servers. If reddit goes down, no one loses anything they've spent money on. If Netflix goes down they have to deal with tons of paying customers. It's probably cheaper for Reddit to just let the site fail occasionally, whereas it's cheaper for Netflix to not deal with lots of annoyed paying customers.

I'd assume hospitals in the area follow the Netflix model, if not even more redundant.

up
Voting closed 0

and less time goofing off on the internet blogging and working so many different jobs.

The cluster he's talking about is called "Orchestra", and calling it "robust" is a pathetic joke. It goes down constantly. The system that handles the submission of new jobs goes down. The storage subsystem, which they just installed a year ago and is supposed to be completely bulletproof and "cloud like"...well, that goes down on a regular basis too, and it's so slow that researchers will try to get access to any other cluster they can. Up until recently the cluster was running an ancient version of Linux; now it's running a slightly-less-ancient version.

up
Voting closed 0

If that unfortunate user had been a Harvard undergrad, they would have kicked him out for a year and ruined his permanent record.

up
Voting closed 0