Skip to main content

Unraveling the Chaos: What We Learned from a Data Center Electrical Disturbance

By February 13, 2026uPDate

In the complex world of electrical systems, even the smallest error can trigger a cascade of major disruptions. This was the case at a data center, where an unexpected electrical disturbance caused a series of failures, causing chaos throughout the system. The incident not only tested the resilience of the facility’s infrastructure but also highlighted the importance of following detailed procedures when operating complex electrical systems.

The Incident

It all started when facilities personnel inadvertently initiated a “load test” of a generator instead of the intended “no load” engine test of a standby diesel generator, causing an unintended open transition transfer to the generator, which resulted in downstream loads being transferred to battery power for approximately three seconds. The error likely occurred from pressing the wrong button on the Woodward DTSC-200 controller.

UPS Failure and Load Transfer

MIB3 control board

The situation quickly escalated when the retransfer to utility power was initiated. As expected, during the open transition retransfer process, the generator main breaker opened. The UPS ran on battery for approximately 179 milliseconds before experiencing UPS alarms indicating breaker position inconsistencies and the opening of output contactors, which led to de-energization of the UPS output. The UPS internal logic misinterpreted breaker statuses due to damage to the external interface control board within the UPS, causing the UPS controls to believe the UPS output was paralleled with the maintenance bypass source. This led to an attempt to transfer to static bypass, which was unavailable as the utility main had not yet closed.

STS Retransfers and Manual Interventions

All five STSs associated with the system simultaneously transferred the load to the backup UPS, utilizing optimized transfer algorithms to limit inrush current. The transfer duration took at least 5 milliseconds to complete, during which time the bulk capacitors at the server power supplies were partially drained of their stored energy. When the load was transferred to the backup UPS, bulk capacitors demanded massive amounts of energy to recharge their power supply capacitors (236% of the backup UPS rating), and as a result, the backup UPS was overloaded. The massive overload of the backup UPS forced it to static bypass out when its inverter was out of phase with the bypass source. The out-of-phase transfer condition caused the power distribution units (PDUs) to generate massive inrush currents. The impedance of the power source was such that the high current flow resulted in depressed voltage at the secondary side of the downstream PDUs and ultimately partial de-energization of some critical loads in the critical environment.

Physical Observations and Equipment Findings

Final resting place of the screw after MIB3 was unmounted from the enclosure

Onsite investigation with the original contractor identified the MIB3 control board as a probable cause of the UPS failure. Upon further inspection, a small screw was found behind the board, loose, with evidence of arc damage, indicating a short circuit had disrupted breaker status signals. This conclusion was further supported by a blown fuse on the emergency power supply board feeding MIB3, necessitating board replacement. The replacement of the signal acquisition board (MIB3) at the UPS was completed, and loose plugs inside the UPS were identified and reseated. Testing after repairs showed normal operation with no recurrence of the failure symptoms.

The board failure shorted out the control power to the board, causing the UPS internal control logic to perceive the maintenance bypass breaker (MBB) as being closed and the UPS output breaker (UOB) status as undefined. This incorrect conclusion triggered shutdown and transfer attempts to static bypass, which was unavailable at that time. The STSs operated correctly, transferring load simultaneously to the backup UPS system with minimal inrush current. The massive current spike observed after transfer was primarily due to server power supply load rather than transformer inrush, causing the UPS to overload and transfer to static bypass. The total event duration from UPS output failure to UPS voltage recovery was approximately 24 to 27 milliseconds, involving multiple brief outages and significant voltage disturbances that likely disrupted server operations.

Next Steps and Recommendations

To prevent similar incidents in the future, HP&D made several recommendations:

  • Investigate server ride-through capabilities, load characteristics during voltage disturbances, and potential adjustments to server startup configurations to mitigate load spikes
  • Clarify UPS overload thresholds and consider settings changes to improve UPS performance during similar events, including replacing Signal Acquisition Boards in other UPS units as per field service bulletins
  • Develop interim operational measures and MOP revisions to ensure cautious STS transfers and improved monitoring with power quality meters, alongside enhanced personnel training and escalation protocols
  • Consider synchronizing critical electrical component log timestamps and upgrading meters to include waveform capture capabilities to improve the accuracy of event logging
  • Consult power supply manufacturers to understand the impact of waveform distortion and voltage dips on server power supplies during such events
  • Determine pre-transfer delay times and UPS ride-through capabilities to inform future system design and settings

Conclusion

The electrical disturbance at the facility was a complex issue with multiple contributing factors. Through investigation and collaboration, we identified the root causes and implemented corrective actions. This incident highlights the importance of rigorous procedures, regular equipment maintenance, effective load transfer management, and expert consultation. By investing in advanced monitoring tools and taking proactive steps, you can prevent future disruptions and maintain smooth operations.