With the outage as widespread as it was we highly doubt that we will ever get a straight answer from Google so all we have are theories and guesses. Still it was certainly an issue that was across the board for Google encompassing their Google+ IM, Gmail in addition to Google Talk. When we attempted to use the service it appeared to load properly and even allowed for proper authentication. However, as soon as we tried to open a chat with someone we received our errors and ultimately were disconnected. Now, considering the usual way that federated messaging services run this might be a problem with the SIP gateway servers that allow external access to the internal paths for message transfer.
We have seen this happen in Microsoft Office Communication Server before when there were networking issues between the pool servers and the gateway. In one case a certificate issue brought the whole IM network down because the systems refused to talk to each other. Granted this is pure speculation on our part as we do not have details on how Google operates their messaging network.
After Google’s issue came word that Microsoft’s Vaunted Azure was down in many places across Western Europe. This is another issue where we will not get any details about what is really going on. All anyone from Microsoft ever posted was a note saying they were actively investigating the issue, but assured people that their storage would not be affected in any way. The outage affected Microsoft’s hosted services and prevented people from provisioning (or accessing in some cases) their virtual servers. Some speculated that there might have been issues at the Dublin Data center. This center has had power issues in the past that have brought Amazon and Microsoft down for short periods of time when power providers failed to keep things running smoothly. At the time of this writing Microsoft says that All services are back online. Still it makes you wonder about their push to the cloud with Windows 8… at least it makes us wonder.
Last on the list was Twitter who also had an issue in two data centers which took the service offline for several hours. The Twitter issue is a big one in that it illustrates that even with failover and backup you can still have outages. It honestly reminds me of a situation where not two but three different failovers did not kick in. The failure which was later found to be in the actual system that was supposed to automatically switch from one power source to the other and back delayed an multi-million dollar project by over 60 days. Fortunately that failure happened during the power systems validation phase BEFORE we even started to switch to the servers in this datacenter so the company involved did not suffer any down time.
Twitter was only down for about an hour but the service was affected by sluggish performance for around four hours total. Like we said it was not a good day for supporters of the cloud, but hopefully the issues will lead to better services; at least that is what we would like to see. Unfortunately, many companies will only spend the time and money to Band-Aid the issue and move on. Which means that we are likely to see more outages and down time in the future.
Discuss this in our Forum