Blog

Sandbox service disruptions August 2011

0

Back in August there were a number of lengthy outages on CS3 and CS8 Sandbox instances – following a root cause analysis, the email below has been sent out to customers with details of what caused this problem and the measures salesforce are taking to ensure that this doesn’t happen again.

Executive Summary
At salesforce.com the trusted success of our customers is our top priority. We want to provide a further update on the CS3 and CS8 Sandbox service disruptions we experienced in August 2011.

We sincerely apologize for the impact that these incidents may have caused to your business. We have taken this issue very seriously and made the investigation into their root cause a top priority. The full root cause of both of these issues is outlined below.

Root Cause Analysis
On August 23, 2011 starting at 8:38 AM US/Pacific, customers residing on the CS3 sandbox instance experienced an inability to login while trying to access test.salesforce.com services. The salesforce.comTechnology Team performed systematic troubleshooting to isolate the cause and to restore services. On August 28, 2011 at 10:50 PM US/Pacific we restored service on CS3 for all customers with sandbox organizations that had been accessed within the past 90 days.

In the CS3 incident, one of the servers in our database tier experienced a fatal hardware error. This failure interrupted system I/O operations and corrupted files that are core to database functionality.

When our database prepares a block to be written to disk, it calculates a checksum value that is also stored in the block header, completes a sanity check, and sends the block to be written to disk. Just prior to the hardware failures, our database initiated two simultaneous but separate I/O operations to write the block to two key database files. Due to both files residing on the same database disk partition and the timing of the hardware failure, both I/O operations were interrupted and damaged the blocks on the disk. While the same block was corrupted in both files, they were corrupted in different ways. One file failed the checksum test on the block; on the other file, the block header was damaged and failed the sanity check.

While our database solution provides scalability and a high degree of availability, it was not able to protect against this type of failure. The corruption impacted a critical function at the database level that caused an inability to perform a standard database recovery.

On September 4, 2011 starting at 3:58 PM US/Pacific, customers residing on the CS8 sandbox instance experienced an inability to login while trying to access test.salesforce.com services. The salesforce.comTechnology Team performed systematic troubleshooting to isolate the cause and to restore services. On September 7, at 2:32 AM US/Pacific we restored service on CS8 for all customers with sandbox organizations. We were able to restore the sandbox to a recovery point around 3:43 PM US/Pacific on September 4, 2011.

In the CS8 incident, a system driver bug caused the database server to crash and key database files were again corrupted. This issue was similar to the incident that CS3 experienced, but the root cause was a different underlying bug.

Action Plan
As a result of our in-depth root cause analysis of these incidents, we have taken the following actions:

• Deployed a new operating system version and system firmware to resolve the issues experienced on CS3 and CS8;
• Deployed a third level of resiliency in our production environment that will minimize the duration of any restoration efforts and protect us from corrupt data being replicated to our DR data centers. We believe this will enable us to restore the system from a data corruption in well under 24 hours;
• Enhanced our incident management and triage process to confirm that logs are adequately reviewed; and
• Diversified copies of key database files across multiple database disk partitions, instead of on the same database partition, to mitigate severe corruption on all critical database files.

Based on our investigations and subsequent preventative actions, we are confident that we are in a much stronger position to recover from a similar incident in the future.

Our goal is to provide world-class service to our customers. We are continuously reviewing, revising, and improving our tools, processes, and architecture in order to provide customers with the best service possible. We sincerely apologize for any impact these service disruptions may have caused to your business, and appreciate your continued trust as we continue to improve our processes and services.

Best regards,
salesforce.com Customer Support