Omar Shahine, a PM on Windows Live Hotmail, has an excellent post entitled, "Designing for Services Dependencies" from the Hotmail perspective. Reading it brought back a bunch of memories and lessons learnt from Messenger Server. (I don’t have to think about service reliability these days as we have more "qualified" people to do so.) Let me focus on the "reliability" aspect rather than the "dependency" aspect.
Who remembers the 8+ day outage of MSN Messenger back in July 2001? I confess that I was "not a fan" of MSN Messenger back in those days (I had just started to use AIM over ICQ) so it didn’t affect me. But the legend of that outage can still be heard if you seek out the members of that team back in the day (whom are now scattered all over the place–everywhere, including Google and Yahoo!, but Messenger). You can read about the whole ordeal here. (As far as I know, though, the outage had nothing to do with .NET; the .NET Messenger Service was just a PR branding exercise.) What about the outage in early 2003? I don’t have first-hand accounts of these weeks as I wasn’t around during that time.
In the old days, what we would do is schedule several hours for server upgrades where we would kick off the entire Messenger user base (that’s millions of people around the world), take the entire cloud offline, deploy new binaries to the machines, restart the machines, smoke-test a bit, and then finally start taking traffic. It was a heavily manual, intensive process, often taken many hours. These were scheduled during the lowest peak traffic period in the week, which happened to be Friday nights, around 9 PM PST (sorry, Asia!). We would be down for several hours while we upgraded bits. Some people would be in the S.O.C. while the rest of us would be in a large lecture hall watching a movie (before!) and then watching the "action" on the big screen during deployment. People got tired and sometimes made mistakes. [Note that in the really old days, the servers were rebooted by developers to "try" out new fixes/features. In the really really old days the servers were boxes under someone’s desk. It would have been cool to be around then.]
So when was the last time you saw the pop-up dialog when using Messenger: "The Messenger Service will be performing maintenance in 5 minutes"? It should have been some time late in 2004. That maintenance, if I recall correctly, lasted several hours, during which time 100% of the Messenger user base was disconnected. (I remember my sister on the east coast IM’ing me, "AHH!! No! No maintenance! I need to talk to my partner to get this project done!!!!") Prior to that event would in fact be the weekend of October 9-10, 2004. That was a ‘fun’ weekend. According to this article:
By early afternoon Monday, a representative of Microsoft said the company had fixed the issues that had prevented its users from logging on to Messenger.
"The system is now back up and running," the spokesperson said at 1 p.m. PDT. "We believe that the problem is now fixed." …
The spokesperson would not give further details about the problem, except to say that the Monday morning outage was due to "administrative maintenance."
Indeed, that was a big SNAFU. (How much can I divulge here without getting fired? Just use your imagination.) It took a long while to recover from that outage, and we had learnt our lesson. It was interesting to note, as well, that when you have an outage, it actually takes a while before the number of online users reaches the pre-outage numbers. I suppose that it’s not surprising that you lose some percentage of users when your service is out-of-commission, but you don’t really wrap your head around the fact that even a small percentage of millions of people is a lot of people. Nine hours in a year (= 99.9%) is not a lot of time, yes, but nine hours at once, passes by fairly quickly, and you’re guaranteed to hit many other blips along the way.
How do we measure reliability? Surprisingly enough (it surprised me), we don’t actually take the statistics that clients upload to us since the checkbox to "Join our Customer Experience Improvement Program" is turned off by default, unfortunately, and not very many people turn it on. (By the way, you really should check off that box: Tools > Options > General > Quality Improvement.) Instead, we have these little programs that run against the cloud, simulating actual clients, every X minutes or so. Using a little math, one can turn these into a rough measure of percentage uptime. I always found this to be somewhat arbitrary. For instance, it’s easy to figure out what happens if Passport is down. As Omar says, you’re down on your ass. But what does it mean, for example, if you can login and get the status of your buddies, but you can’t establish an IM session with them? How does weight that for ‘service reliability’. The best way (in my opinion) would be to get actual client data–how often people try to set up IM sessions compared to how often they fail; but we don’t have all that data (and definitely not all of it in real-time), so we figure out some complicated formula. We try to be objective but it’s totally subjective.
Keeping a service up and "reliable" may sound extremely boring to many folks, but there are people here dedicated to doing exactly that. An extra 9 is coveted by these people and will do anything to get that extra bit. Funny enough, it’s a constant struggle between those that want to keep the service reliable ("don’t touch what ain’t broke") and others that want to roll out new features to users as quickly as possible. Interesting dilemma there too.
Messenger has gotten much much better at dealing with downtime since 2004. Part of that, of late, has to do with some platformizing and leveraging some cool work done in Search. Which is why, although you may experience "unable to connect" every now and then, you won’t see "maintenance" any time soon.