On the 16th of April 2014 at 19:00 hours (7:00PM) Endurance International experienced a network failure at its Provo data centre (In Texas, Utah). This failure took Hostgator and Bluehosts servers offline for a period of 14 hours.
This outage the second within a week (A 20 – 40 minute similar outage occurred on 15th) is bringing questions about how Endurance International is managing the operations around their servers and the Provo data centre. The latest outage disconnected all servers in the Provo data centre under the Endurance International banner regardless if they were shared hosting servers, virtual private servers or dedicated servers.
Some servers were intermittently available during this outage resulting in some connections succeeding and others failing. Personally I had around 32 notifications stating that the servers were offline and online again. This dropped the availability of the website to 98.36% for the last week and 89.28% for the past 24 hours. This is far shorter than the advertised 99.99% up-time advertised by the hosting companies. As 99.99% up-time allows for 52.56 minutes in a year. This outage alone puts the services provided outside the 99.99% up-time.
The teams in the Provo data centre should have identified the issue with the networking equipment on the 15th of April during that downtime the data centre equipment failed to switch over to redundant equipment and / or backup links. The latest issue however was caused due to a bug / fault in the firmware of the main networking equipment which caused packet loss to occur.
Endurance International Group now needs to analyse their practices within the data centre and how they manage faults and their backup plans. For many customers potentially millions of dollars could have been lost along with large numbers of customers with the same being said for Endurance International and all Hosting companies under their banner. As well as tarnishing their names.
That being said now Endurance International and the hosting companies now have to regain the lost faith in their customer bases and hopefully learn from their mistakes while the customers evaluate the benefit of staying with the above companies verses moving to another provider.
Update as of 18th April 2014
Hostgator and Bluehost have both now made announcements regarding the downtime with Hostgator providing additional information regarding the issue. Hostgators statement can be found at http://forums.hostgator.com/showpost.php?p=518345&postcount=12 while a copy of Bluehosts statement can be found at http://arweth.com/wp-content/uploads/2014/04/bluehost_statment.png
Additional information that has been provided by Hostgator states that previous issues that were encountered were not related to the downtime. However with the previous down times they should have prepared redundancy systems. As stated on Hostgatos site the data centre is equipped with 9 different provider links so a fail over should be in place to use backup equipment and these links in the event of a networking issue either in house or upstream from the center. As a side note the information regarding the providers may be incorrect as it still talks about Softlayer as the centre provider which is not the case for most if not all servers now.
In response to the public’s complaint about Endurance International’s involvement Hostagator stated that Endurance provided additional resources to help rectify the solution sooner.
With regard to the companies (Hostgator and Bluehost) leaving Softlayer and moving to the Provo data centre allowed them to have more control over the equipment and management of the data centre equipment to enable better support of the systems. And in the long run will be more beneficial to the customers. This is one of those statements that can only be determined by what’s happened so far and at the moment it’s not going very well at the Provo data centre. They seem to feel that the best course of action is to stay at the Provo centre and this could be the case in future it may provide better stability while allowing lower costs at the same time but there is nothing to say what way it may go.
Lastly the biggest question of will this issue happen again their response was the traditional we will endeavour to prevent this happening again but are unable to guarantee that it will not occur. They do state they are investing large amounts of resources to increase the systems reliability and prevent an issue from occurring. This however is a generic response that you would expect to hear from a issue like this happening. They are being honest by saying that it is possible that it could happen again though which is not what the customers really want to hear.
Overall the downtime will be regrettable and the companies should now look at putting in place backup measures to ensure something like this will not happen again. Leaving Bluehost or Hostagtor for another provider will not guarantee that an issue like this will not occur again as they can come arise out of stable systems like this one has. As long as there is no considerable downtime with Hostgator or Bluehost in the coming weeks and months ahead there would be no reason to leave them as a hosting provider as in my opinion they will work harder to build their reputation backup and ensure systems reliability.