Salesforce performance degradation March 2010


As I’m sure many users would’ve noticed, during March 2010 there seemed to be a number of performance issues with the salesforce platform as well as a couple of (reasonably short) outages. There were some brief explanations posted on the website however for those of you who are interested here is the version of events as explained by salesforce premier support, which offers more detail about what happened and what they are doing to ensure it does not happen again:

RCM: 3/8/10 – 3/28/10 Performance degradation

1. Problem Summary experienced service degradations due to bugs in a vendor network firewall tier in our common infrastructure. These service incidents occurred on the following dates.
o    3/8/10 16:35 – 17:14 UTC
o    3/11/10 15:08 – 15:46 UTC
o    3/17/10 0328 – 0348 UTC
o    3/20/10 0752 – 0845 UTC
o    3/22/10 1541 – 1611 UTC
o    3/23/10 2044 – 2325 UTC
o    3/28/10 1951 – 2006 UTC
During these incidents, customers may have experienced sporadic service degradation, including service interruptions.

2. Root Cause
The issue was caused a vendor bug in their architecture for the network devices used in the shared infrastructure. The issue were tracked to an underlying hardware design problem exposing a previously set of unknown events to occur. The issue was resolved by replacing the hardware with another model with a
different architecture which does not exhibit the problem.

3. Action Taken with/Timeline
∙    3/8/10 UTC Network device showing high session and bandwidth utilization incident observed. Vendor engaged by way of a Severity 1 case
∙    3/9/10 UTC Network devices tested and later patched to recommended version based on vendor feedback. Changed configuration parameters for the networking devices.
∙    3/11/10 UTC Second incident causing more severe behavior resulting in unresponsiveness from the networking devices. Executive escalation to vendor. In addition a subset of network traffic flow was redistributed for more efficiency and risk mitigation. Patch levels discussed with vendor and another patch level was recommended, tested and implemented.
∙    3/17/10 UTC An additional event occurs resulting in high session and bandwidth usage. Further escalation followed by vendor technology team onsite engagement for monitoring and troubleshooting. Additional traffic redirected. Added additional diagnostic monitoring.
∙    3/20/10 UTC Network devices downgraded to a version which had been running the longest without issues within the infrastructure. Exploration of alternate vendor solution for the network devices initiated.
∙    3/22/10 UTC Vendor conveys that there may be a more serious architectural problem causing the behaviors we experienced. Recommendation is to move to an alternate model based on a different architecture. Devices are being shipped, tested and ready for migrating traffic.
∙    3/24/10 UTC Current Installed Production network device models replaced with the recommended alternate models.
∙    3/28/10 UTC Maintenance for isolating a class of internal network traffic from direct customer traffic completed for better traffic management and isolation. During this movement some users saw minimal performance degradation for up to 15 minutes.

4. Remaining Issue/Risk
The Technology team successfully resolved the issue by working with our vendor to replace the production network devices. After these were replaced the vendor and the team have been on-site monitoring traffic and running diagnostics to ensure everything is working as expected.
No issues observed and service has been operating as expected since and as such we consider this issue to be closed.

5. Actions to Prevent Future Incidents
∙ technology team has replaced all critical firewalls and continues to replace all models that could be exposed to the issue
∙    Network traffic has been redistributed for improved efficiency, isolation and greater fault tolerance
∙    Monitoring is being further enhanced to address such issues in timely manner