SAP SuccessFactors Learning Life Sciences User Group Discussions
cancel
Showing results for 
Search instead for 
Did you mean: 

Ashburn Data Center Unavailability

kishore_krish12
Galactic 3
Galactic 3

Does anyone know or has been communicated the reason for the system outage?

28 REPLIES 28

Brian_Boegs
Galactic 6
Galactic 6

We are down too. This is the 4th time down in the last 2 months. Not good.

8 days later and another outage. SAP better check their data center for loose cables.

1/18/2022            (1 Hour 34 Minutes)

1/10/2021            (6 hours)

12/30/2021         (8 minutes)

12/10/2021         (5 minutes)

11/18/2021         (3 hours 46 minutes)

kishore_krish12
Galactic 3
Galactic 3
0 Kudos

Agree. Our business has pointed to the increase in the frequency of these outages. This will play a major role when the license renewal comes up

0 Kudos

Just passed the 3 hour mark. No word on how much longer.  Support updated my ticket and said to just keep checking the cloud availability site. Not much help when I have tons of people asking when we can expect to be back online. 

0 Kudos

Same here. i am checking CAC , every 2 minutes. We may end up opening a deviation

kevin_weatherbe
Galactic 3
Galactic 3
0 Kudos

We just received this update from SAP:

The core SuccessFactors platform is almost back online now. We still need to restore Boomi services and that will be done once the core SFSF services are up and running. I am working on getting a more concrete ETA for you on the full restoration now

0 Kudos

Where did you get this Update ? On a support ticket? Because CAC has the same message as before

mpact Description:
You will not be able to access the system(s) until resolution.
Status Update:

We are continuing to investigate the issue impacting your SuccessFactors Applications. Recovery efforts are in progress to restore application access. We are investigating the issue and will check back with you as we learn more information.

eoinfinn
Employee
Employee
0 Kudos

My customer has received a resolved notification but their preview environment is still disrupted 

Does any know when this will be restored? 

0 Kudos
Both prod and staging instance are back up for us. We did some basic testing and it looked ok


nadeem_a
Galactic 3
Galactic 3
0 Kudos

we were unable to access community during the period of the outage - kept receiving log in errors, but looks like you guys were able to.  agree on the increasing frequency of these issue.

Ibrosseau
Galactic 3
Galactic 3
0 Kudos

Same here was not able to access community on Jan 10.
Do we know if there will be a Root and Cause explanation (Email?) for my Deviation?

0 Kudos

there's usually an RCA within 10 days of the outage.  that should go to the same people who got the outage notification.

kevin_weatherbe
Galactic 3
Galactic 3
0 Kudos

Our users are receiving the same error when trying to access our Production environment as we were last week during the outage. The Cloud Availability Center isn't showing any issues. Is anyone else experiencing any issues?

0 Kudos

We are experiencing the same issue on our production instance as well! 

0 Kudos

Yes same issue, we logged a P1 incident with SAP

Maria_Molvin
Galactic 4
Galactic 4
0 Kudos

Yes we are having the same issue - again!

Maria Molvin

Learning Technologist

GOJO Industries, Inc.

0 Kudos

same here and this is no good. 

0 Kudos

Same here. Not working!

0 Kudos

Unfortunately, our company is completely down as well.

Wick_Orley
Galactic 4
Galactic 4
0 Kudos

We are down again too...though no communication from SAP.

nadeem_a
Galactic 3
Galactic 3
0 Kudos

yep, I can get into the system but see the error "internal server error" when trying to access training in my plan.

Jesus_Gandia_Ga
Galactic 2
Galactic 2
0 Kudos

The same issue for us; I check the trend in this is #4 in 3 months

Brabham_Marla
Galactic 4
Galactic 4
0 Kudos

Same here; no update from SuccessFactors about the cause. Have not gotten an RCA from the last outage.

chadhughes
Galactic 2
Galactic 2
0 Kudos

This is the second time we've been completely down in 8 days.  The last time there was an issue with the Ashburn Data Center as well.  SAP still has not provided us with a root cause. 

I think the SLA for root cause analysis is 10 days (working days).  so we should have something by end of week or Monday).

eoinfinn
Employee
Employee
0 Kudos

Does someone have an email for the outages team? Or team responsible for LMS? a customer requesting a call due to the recent outages. 

SF-Dan
Employee
Employee
0 Kudos

Per KBA 2116114 - Root Cause Analysis (RCA) Policy for SuccessFactors Cloud:
Timeline:
 The target timeline for the RCA to be published in CAC is within 20 days of the incident resolution.
The Jan 10th event EV8166899 has RCA already published to CAC ahead of this timeline. 

KAC06610
Galactic 3
Galactic 3
0 Kudos

this came on 1/20

Dear Customer,

We have completed the root cause analysis for this incident and would like to share it with you.

Incident Description:

On January 10, 2022, customers hosted in the Ashburn (DC08) Data Center experienced a service disruption while accessing multiple SuccessFactors applications.

The disruption was caused by a network connectivity issue that continuously caused multiple network switches to restart. The constant restarting of the switches did not allow application requests to process.

As a result, the vCenter (central management utility used to manage virtual machines (VMs) from a centralized location) lost connectivity to the underlying storage devices causing all the VMs hosted in DC8 to become non-responsive.

Incident Resolution:

The network team disabled an authentication configuration on the impacted switches to resolve the issue, which recovered the switches and restored network connectivity.

After the network resolution, Cloud Operations recovered the Center, then restarted all VMs in sequential order. The recovery process was complex due to several service interdependencies. After appropriate validation checks, Cloud Operations confirmed service restoration on January 10, 2022.

Root Cause Analysis:

The root cause of this incident has been attributed to a software defect on a specific version of the network vendor switch firmware. The vendor had recommended an iterative upgrade approach using the interim version of firmware as a first step, followed by an upgrade to the final firmware release. It was during this iterative approach/interim version, the defect was introduced and led to the network connectivity issue/service disruption for all SuccessFactors applications.

Preventive Measures:

Measure #1: Upgrade the network vendor switch firmware to the final firmware release. (Completed: January 15, 2022)

Measure #2: Audit checks of the firmware version on all remaining network devices not affected by the issue, and an upgrade to final firmware version where required. (Target Date: February 25, 2022)

Measure #3: Future network switch upgrades to be performed after thorough verifications and checks by the network vendor. (Target Date: February 25, 2022)

Measure. #4: Create an automated service restoration process for data center level outages to restore services as quickly as possible if server restarts are required for resolution. (Target Date: February 17, 2022)