A Disaster Recovery Perfect Storm
Last week I had a customer with a downed POWER5 system due to hardware failure.
The system bezel said that it couldn’t find the disk drives which could’ve been a number of failing parts. This system had a single disk drive mirrored to another. The backup was questionable (i.e., the customer couldn’t guarantee what was on it) on ¼ inch tapes so it was not really compatible with current technology. Also, the customer did not have an active hardware/software maintenance contract. The system was set up for operations console but there was no cable nor operations console PC in the building. No LIC DVD was available either. These are the facts of the situation.
Settling on a plan:
An IBM customer engineer (CE) was dispatched at their hourly rate with a four-hour minimum and a purchase order required up front. That’s standard. IBM arrived on-site at 2 PM as they were a 90-minute drive away. Since there was no available console and no LIC DVD present, plus the CE didn’t bring a 5.4 LIC DVD, there’s not much they could do that late in the day. When you’re paying IBM time and materials, they don’t necessarily work on the weekend. Time and materials mean when they have the time and if they have the materials, within a 9 to 5 window during the business day. Customers with 24×7 maintenance contracts come first so we contemplated the scenarios with IBM about how to move forward.
We settled on a two-prong approach: we would prepare a cloud recovery and IBM will continue diagnosing the problem locally.
First thing Monday morning the customer shipped us their 1/4 inch tapes which we had to convert to our Virtual Tape Library. Once the tapes were converted we’d attempt a restore to our cloud offering iInTheCloud.
A great effort by IBM
The IBM CE went above and beyond what was required. He returned Monday and configured a LAN console via the bezel and diagnosed the problem fully with a D-mode IPL after having brought a 5.4 LIC DVD. He stayed on-site well past 5 PM, so I owe the guy dinner next time I see him. The list price for the replacement hardware via IBM would’ve exceeded the cost of a tiny new POWER9, three years’ worth of maintenance and an external LTO7 drive.
Where iTech came in
In the interim, we had spun up a cloud partition for them and recovered the IBM i 5.4 system to a 7.3 partition Tuesday night after the tapes arrived. Luckily they previously passed an object observability test (ANZOBJCVN) so bringing the workload to 7.3 wasn’t too much of a problem. The customer shipped a full system save tape from a couple of years ago (which sadly was not) and a nightly save tape from last week which, as luck would have it, contained security data, configuration, DLOs, and all user libraries.
All in all, the customer was down from Friday through Wednesday morning.
Takeaways that can help you:
Any type of disaster recovery can be quite costly in terms of having your business down for days or longer. This type of scenario underscores some valuable points:
- If you haven’t tested your nightly backup or full system save, it’s suspect. Don’t wait for a disaster to test your disaster recovery. Also, know exactly what you save on a nightly basis. User profiles? Configuration? User libraries? Only changed objects? You need to know this.
- Full system saves will save you. Know where the tapes are located and test them too. If you don’t have any spare iron to do so, we can assist with ours. We have a lot of it either in our head office in Danbury or via our iInTheCloud offering. You should never have to look for your Save 21 tape.
- Ensure you have a working console. This will enable you to both perform diagnostics on a D-mode IPL and recover the system. When the IBM CE asks what kind of console you have you’d better have an answer. Don’t waste valuable time doing a 65+21+11 over and over on the front panel. We literally had IBM quickly look on the back of the box to see if there was a twinax block just to rule it out.
- Ensure you have a Licensed Internal Code (LIC) DVD to do a D-mode IPL. This DVD should be at the same version release and be equal to or greater than the respin level of LIC you have installed. When a new LIC respin is released, download a copy.
- Keep your systems under 24/7 maintenance. Yes, it can get a little expensive, especially on older iron however it ensures that IBM will work overnight and through the weekend to get you back in business. It also ensures that parts needed to get you back in business will not cost you your first born. Time and materials mean when they have the time and if they have the materials. There are no guarantees. It’s the best effort.
- Keep your assets current. Older parts are not as easy to come by. In this scenario, replacement disk drives were available at the local depot but the backplane and disk controller would have to be shipped in from three hours away. Plus the cost of those parts is not cheap. From a recovery compatibility standpoint, old ¼ inch tapes need to be converted before we can do anything with them. If the customer was on a newer LTO drive it would speed up the recovery dramatically. LTO is designed to last longer. If they were using LTO for the last 10 years then I would’ve had more confidence in the ability to read the tapes. We didn’t know if we could even read them until we tried it. For all we knew they were blank or degraded to the point of being unusable.
Use this cautionary tale to protect your business.