вторник, 26 февраля 2013 г.

In the Aftermath of Unexpected Worldwide Windows Azure Storage SSL Certificate Expiration

Somewhere between Feb 22, 2013 and Feb 23, 2013 Windows Azure Storage service had its SSL certificate unexpectedly expired. Proofs: one and two. A lot of software is configured in such way that once an SSL certificate expires the software no longer trusts the service and refuses to connect to it. That happened to all geographical regions, so no amount of data replication would help (except replicating to some other provider of course).

A fair portion of chaos followed. Microsoft replaced the certificate in several hours and the life goes on.

Now it's time to analyze this situation. Here we have a third party service a lot of other services depend upon and it turns out the service provider let the certificate expire.

Suppose your service uses Windows Azure Storage for storing data and you find yourself in the situation described above. What lessons will you learn from it?

The standard way to handle this situation is the following. It was Microsoft who was responsible for the certificate and so it's Microsoft's fault. Let's get some Jack Daniel's and have a good time.

This approach no longer works. At least not with the cloud services responsibility model.

If you read Windows Azure Storage SLA (highly recommended) you'll see that in no event you're eligible for a refund greater than the sum you paid for the service. This means that if you were affected by that incident you likely can get a refund of several USD, not much more. With that refund you then have to go to your service customers and explain why your service has got affected.

Note that Microsoft is not scamming you – you've been showed that SLA upfront and your lawyers have likely read it.

Now follow any of the two links above and look carefully at that picture with the certificate information. Where's that picture from? When you open any HTTPS-enabled site like https://twitter.com your browser shows a visual indication of an encrypted connection. If you click there you can read who owns the site and who issued the certificate and there's a kind of "more info" button that brings you to that certificate information dialog. Using this way you can see that https://twitter.com SSL certificate expires (as of the date of this writing) at May 11, 2014 which is quite far from now.

So it turns out you can look at any HTTPS-enables site SSL certificate at any time and see its expiration date.

Soooo… Unlike any other kind of unexpected event – like a lightning, a storm, an earthquake, an intern spilling coffee onto critical equipment – this time anyone could have seen the disaster was coming. The information was publicly available weeks in advance and noone noticed the upcoming problem.

Soooo… If your service uses a third-party service via SSL you have only two options. Either you monitor that service certificate or you risk that certificate unexpectedly expiring and sending you into chaos.

It doesn't matter that it's Microsoft (or Twitter) certificate. If your service depends on that certificate you have to monitor it. That's how clouds work. If you don't comply with this model all you have is a crowd of upset customers, a ton of upset blog posts and a refund worth no more than several USD.