Days-long downtime spurs disaster recovery tune-up

Apr 7, 2004

A network outage is the bane of every information technology (IT) administrator's professional life. In healthcare it's more than an inconvenience -- it can be a devastating, potentially life-threatening crisis.

If one or two systems within a medical campus go down, the institution can implement stopgap measures to compensate. If the entire network shuts down in a medical center, the flow of vital patient information such as images, pathology results, and pharmacology, grinds to a halt, putting critical care in peril.

In mid-November 2002, IT executives Alice Lee and Cathy Ball of CareGroup Healthcare System in Boston experienced the nightmare of a three-day complete system outage at their facility. In a presentation at the 2004 Healthcare Information and Management Systems Society (HIMSS) conference in Orlando, FL, the duo shared the lessons they learned.

"On Wednesday afternoon November 13, the IT support center began receiving calls from clinicians (complaining) of slow response and loss of access to applications. In addition, intermittent network downtimes were reported. By late that evening, applications were restored," said Alice Lee, vice president of IS clinical systems at CareGroup.

However, trouble started again on Thursday as the staff attempted to utilize the network as usual. The IT staff fanned out through the facility to troubleshoot systems as various departments reported problems.

By Friday, work-arounds implemented the previous 36 hours were showing little effect on bringing the systems back to necessary performance levels.

"At 4 p.m., the chief operating officer (COO) decided to remain in full downtime mode until 24 hours of continuous service could be attained. Staff access to applications was blocked as some personnel still attempted to use the system despite campus-wide instructions to stop," Lee said.

This was not a trivial decision on the part of CareGroup management. More than 120 applications affecting every aspect of the healthcare continuum at the facility were shut down.

"Clinical users were forced to create paper-based records for orders and results, drug interaction checking was done manually, we were forced to film rather than PACS images, e-mail and home directories were not available, and $3 million a day in patient billing activity ceased," Lee reported.

Downtime diligence

As the IT staff went into 24/7 mode to fix the system problems, each department in the medical center implemented its own operational downtime plans. According to Lee, the departments quickly found out that:

Many of the downtime plans were outdated, including the contact lists.

The plans did not reflect what was needed in an extended downtime situation.

The staff had difficulty locating paper forms to be used in place of electronic ordering and documentation.

The hospital responded to the system shutdown by establishing a command center staffed by senior administrators. They identified runners from each department for specimens, tests, supplies, results, and communications from the center so that patient care could be continued at the facility.

"It was like stepping back in time 20 years as we returned to the manual paper method," Lee said. "Our older clinicians were able to adapt quickly to the downtime manual method, having done work this way in the past. The house staff and younger clinicians were unable to visualize other than electronic methods, requiring more training and guidance."

By Sunday (November 17), the IS staff had put a solution in place, and the decision was made to monitor the stability of the system for 24 hours before allowing the use of clinical applications by the CareGroup staff. Bringing the network back online turned out to be only part of the task facing the IT group.

Wrestling with recovery

Even though the technical work in creating a network fix had been accomplished, recovery plans had to be implemented for each of the more than 120 clinical applications in the campus. In addition, the financial applications would need to go through a recovery cycle.

"Not all the applications had a recovery plan, as we hadn't experienced or anticipated an extended downtime," said Cathy Ball, director of IS clinical systems for CareGroup.

Over the weekend, the missing recovery plans were identified and created with clinicians and were signed off by senior administration. By Monday morning, the system had maintained stability for the requisite 24 hours and the COO gave permission for all systems to be accessed by the users.

The recovery implementation rolled out from Monday through Wednesday, with provider order entry scheduled over two full days. In addition, parallel processes and the command center were kept in place through Thursday November 21, just in case further problems arose.

Post-mortem

Although Ball and Lee didn’t share the specifics of what caused the network outage, they said the most important lesson the IT group learned through its experience was the need to plan for extended network outages.

"We discovered that a hard downtime is preferable to systems that are going up and down. Also, off-line systems are much easier to troubleshoot," Lee said.

Have plans at the ready to accommodate IS/IT staff overnight.

Designate a dedicated IS liaison for consistent communication to hospital leadership.
Prepare for media coverage.

Keep downtime plans and contact lists up to date with extra paper forms available.
Ensure that all IS/IT staff is oriented to clinical and ancillary areas.

Don't allow any one individual in the IT/IS group to become a single point of failure. Ensure that each position has a backup person.

Have two-way radios available as a backup to the phone system.

Each quarter, CareGroup checks its ongoing readiness for disaster recovery by updating its contact lists and downtime/recovery policies and procedures and periodically exercising them, said Ball.

Although a complete system shutdown could have had catastrophic effects on patient health, the November 2002 downtime was well managed by the CareGroup staff.

"Care was delayed, but a review indicated that clinical outcomes were not affected by the network outage," Lee said.

By Jonathan S. Batchelor
AuntMinnie.com staff writer
April 8, 2004

Preparing for the worst: PACS disaster recovery, May 3, 2002

Disaster recovery in radiology, Part III: JCAHO requires emergency preparedness, January 31, 2002

Disaster recovery in radiology, Part II: The New York City experience, January 24, 2002

Disaster recovery in radiology, Part I: Protecting your images and information, January 17, 2002