Fun Friday #49: What could go wrong that DID go wrong

Happy Friday everyone!

Today I want to trade ‘war stories’ of IT mishaps and experiences learned.

Have you ever felt yourself psychic, that you could see something going horribly wrong, and it DID go horribly wrong? Maybe you had the overwhelming urge to say I told you so? Maybe you did tell them so and got promoted or fired!

Lets hear the stories!

My story that inspired this Fun Friday happened a fair amount of time ago, so it feels okay to discuss now.

Customer was going through some financial difficulties, cash flow was reduced, survival mode engaged. Cut all the costs that you can find.

The support contract for the backup server was due for renewal, this server was already 5 years old and with their financial forecast, a replacement was not on the table. We tried to be price sensitive and offered third party Next Business Day support, was as cheap as we could get. Customer asked if there was anything cheaper, we told them there wasn’t. Customer declined support renewal. I made a personal appeal in the conversation that this server only has a RAID 5, is their ‘all-in-one’ backup server, if it stops, all backups stop.

Customer still did not renew.

Fast forward 6 months, it happened. A disk failed in the array. Customer needs a quote for a disk, we get the quote, but the supply chain says it’ll take 10 days to get there. Panic sets in.

Customer got a disk failing warning for a second disk, second disk ordered. Customer is stressing a lot “have you not got any spares???” Type of questioning of us. Drive turned up, failed disk was replaced, rebuild completed over the weekend, before they got anyone to site to replace the failing disk, it failed, cue more stress whilst they awaited the RAID rebuild.

Side note: The 3rd party NBD support contract was less money than the two disks..

Page 1 / 1

The famous saving attempts…

I had once a customer that did not want the “expensive” support from the manufacturer of a tape library and ordered third party support.

After some time there was a damage and the library gave some strange messages in the logs. The third party technician came, looked at the library and said “You have to update the firmware first. Prior to this I cannot do anything.”.

Ok, give the new versions to us and we will do this.

Problem was, firmware download were not included in the third party support… So we asked the manufacturer to get the firmware from him.

To make the long story short, the firmware download was more expensive than the whole third parts support for several year. The original support would be less expensive for a whole year (with firmware support) than the single download…

Mission perfectly accomplished to save money… Double expenses and massive spent hours for us to get this going….

Well this is not Veeam related but we decided to implement a new product for offline archiving. The software was “brand new” so technically we were Beta testing it. I was rushed through implementing and testing it to get to Production by a specific date. I met the deadline but tried to explain that if we release this too quickly that I can foresee many issues due to how new the software was.

So I pushed back each time they wanted to bring on Petabyte customers to this service and in the end had to give up somewhat. We brought customers on and initially had issues as I forecasted but once we got over the hump the service has been rock solid until a few months ago. But after some tweaking, etc. we are back to stability.

Lesson learned - no matter how quickly you want to get an application in to Production be sure that you have thoroughly tested it and did all your checks and balances first. Especially when it comes to other applications that may interact with the one you are working on.

Buckle up kids...this is a long one!

TL;DR: Client is on really old hardware. We recommend replacing it. Hardware fails as expected. Veeam and some creative engineering gets them back online. Clients has a hard time paying bill, makes a claim to insurance, does eventually pay us but but we eventually fire them.

The long version:

Have/had a client (small rural community hospital) that was running on an 7 year old Equallogic PS6100 storage array and 6 year old Dell ESXI hosts - all were retired/donated hardware from other businesses. No hardware warranty/support. Told the client that this was a huge risk to hospital operations. This hardware could fail and take everything down. We gave them a proposal to replace the hardware but they were short on money of course and were going to ride it out without a plan. Nearly a year later, we went back to them to let them know they were at more of a risk. We in fact went on-site, sat down with them and our proposal, and walked them to their server room so that they could lay eyes on their systems and understand what hardware we were talking about. They said they understood and would talk it over, but funding was still an issue as they were a low traffic facility that averaged less than two patients a day. CEO went fairly deep into how the grant process and Medicare funding works, etc. I felt for the guy...he knew what he was talking about, and I knew they didn’t have the money for the hardware but were working on it, but very slowly. Over the course of time, hospital is doing construction and twice the server room overheats due to AC exhausts being blocked with dust and debris and just turning off. We tell the client this is really hard on hardware that the server room got over 100 degrees (F). Client is paying for some of the construction with COVID (CARES) money and plans on using it for infrastructure upgrades as well but ends up scaling back on what they can spend on IT/Phones/Cameras/Access Control because the money doesn’t go quite as far as they wanted. Delays in purchasing the needed IT hardware.

A couple months later, the entire environment is down. On the weekend of course. One of my engineers drives 2 hours on-site to troubleshoot but doesn’t know what the issue is. I hear through the grapevine, but I’m not yet escalated to. Engineer drives 2 hours back during the night to our office to get some hardware and then 2 hours back to the client site. Morning rolls around...still having issues, and the problem is escalated to me. I head to the office, grab a car, grab some hardware and head on out for the 2 hour drive as we now know that the storage array has failed. I console into the controllers and find that the supercapacitor board on both controllers has failed due to age causing BSD kernel panics and both controllers are boot looping. Turns out the spare SAN I brought (PS6000) doesn’t use the same modules because it’s a generation older - it has actual batteries.

I slap extra hard drives into one host and setup the local datastore on the added disks, build a new Veeam server (VBR was a virtual box backing up to a NAS), connect it to our Iland VCC repo to grab the config database backup, restore it and take a look at what I have for backups. Turns out all VM’s were successfully backed up the night before except for one Domain Controller before the repo was getting full….no biggie, just restore from the other and bring it up first and then bring up the older restore point on the VM that wasn’t as recently backed up and let AD replicate.

Sit for a few hours while VM’s restore, the VOIP team builds new servers/appliances because those were never added to the backups. Hospital is back online. All is well. Two hour drive home. We call the boss man and tell him not to let the initial responding engineer to drive home as he’s not been up for 40 hours and needs to sleep…..

The next day the client reports that some of their files restored are out of date. How can that be….turns out the DC that was restored from an older restore point also happened to be a file server. So now I need to rebuild the failed array that I had brought back to the office with me. Fortunately, I had another client that managed to trash a PS4100 that I knew they had in storage that was equally as old. That one I know failed because they tried to do power maintenance on UPS batteries that run out of juice that went hard down and had 5 out of 12 drives fail. Email that client, ask if he’d be willing to donate his old array that’s taking up space and might have the right parts that I need to help this little hospital. He is, and I drive 2 hours round trip to pick up the array. Parts match, and I’m able to bring the PS6100 back online. Plug it into my lab environment, figure out the ISCSI CHAP authentication, map some datastores, bring up the VM in question so that I can extract the files from that server. Copy the files to the DC back in production, throw the changed files out there that were changed on the restored VM since the originals - they now have duplicates of some data, but at least they have the old data back as well. Turns out to be 125MB worth of data, mostly a spreadsheet that is updated by nurses as they make their rounds or something like that. But hey….zero data loss, mostly due to Veeam covering us. I have no doubts that if the client was still using ShadowProtect from their previous MSP, we would have not been able to restore what we did, nor as quickly.

In the end, the client did find the money to get the hardware purchased, we throw in a new PowerVault and two new hosts, and a new physical Veeam server with local storage so that we don’t have to delay to build a new Veeam server next time. Client is claiming the replacement hardware to to insurance due to the service life being cut short due to the AC failures. Insurance company, probably partially due to COVID, goes through about 3 claims adjusters through the process. I provided a root cause analysis after the incident and we reprovide it to the the adjuster. In the RCA, I did state that the hardware was already really old but the AC failure didn’t help their cause with life on the old hardware and had a phone conversation with one of the claims adjusters to give some context to the RCA as they’re walking a fine line in my opinion, and I want to make sure we’re not aiding in some sort of insurance fraud.

Client eventually has issues paying us for our services. Threats of lawsuits ensue as the client is trying to wait out getting money from the insurance company to pay us for our services. We eventually get paid but end up firing the client a few months later because they’re just not a good fit for us, as much as it pains everyone involved. We have to do what’s right for us, but we give them plenty of time to get onboarded by someone else. They end up hiring one of our former engineers that left for a different company to handle the offboarding, and he gets roped into their insurance claim as well. Insurance company states they’ll pay them the depreciated value of the failed hardware, but won’t pay for the new hardware as that was an improvement and not a like-for-like replacement. Seems fair to me. Client wants us to provide a value of the failed hardware. Everything was end of life and only suitable for use in a lab environment, if even that. We don’t give hardware values to the client but I’d estimate the value all of the replaced hardware at about $500 tops, and I fully expect to see the insurance company cut a check to the client for maybe $100.

Client still isn’t offboarded yet, but soon. Really curious to see what the insurance pays out, but whatever it is, I’m pretty sure it’s “go away” money.

In the above post, I had note I had a client with a very old Equallogic PS4100 array that failed. Here’s some backstory.

Client has some old hardware and is trying to not spend money (there’s a theme in these stories). They have a PowerEdge R610, and a PowerEdge R710. The R710 has a failed IDRAC so the server always runs at 100% fans and when you boot it, they have to press a key to continue booting, every time. They had since purchased a replacement motherboard for the server but never get it installed. They also have several Windows Server 2003 and 2008 VM’s….this is about 3 years ago mind you. Some of those boxes are public DNS servers. There’s a lot of home-brewed applications here, but all the developers have since left and nobody knows how the apps work. I replaced their old Cisco ASA firewall with a Barracuda NextGen and managed to find why their network is running slowly (all traffic is traversing their old Cisco Voice router and bypassing fixes throughput). Lot’s of hairpinning rules, but in the end we get it replaced.

At some point, maybe a year or so down the road they’re concerned about their old Equallogic array failing. We propose a replacement, but with it new hosts and upgrade ESXI because the array is not listed as compatible with ESXI 5.x. They balk at the cost and the proposal goes nowhere. I’m told that the execs won’t sign off on replacing hardware that is still working…..

A year later or so they continued downsize and decide to move offices to a smaller location that they’re rehabbing. I write up a proposal to do the work, am VERY clear that their hardware is super old, I don’t trust it, and fully expect it to fail during the move to the new office. We take great care to put sensitive hardware in a vehicle with a squishy suspension, server racks in our box truck, and I have a spare host on the car just in case. Set everything up at the new office, and to my surprise everything survived. My newish engineer quits that night because the stress is too much for him. Fair enough, but this move really wasn’t that bad. But he has his own issues that I’m aware of and we wish him well.

I still get called here and there for some consulting with the client. I know they have to do a server room move down the hall to work around their construction, but they handle that and all is still well. Note, drywall dust is really fine and goes EVERYWHERE. Can’t help old hardware, but I digress.

A few months later I get an email from the clients personal email address. Everything is down. Not sure why. But they moved the power feeds coming into the building over the weekend and they attempted to leave the hardware online on the UPS thinking there was enough capacity to carry though. Spoiler, there wasn’t and everything went hard down.

I arrive onsite, console into the PS4100 array that doesn’t appear to be booting. Find some “technician” commands and find that out of 12 drives, 3 are failed and 2 are unknown. Total data loss. Client is really only concerned with getting email back up for now. I go back the office and generate a proposal for a new Office 365 tenant rather than on-premise exchange. Client doesn’t like the reoccurring cost and proposes a bad solution that we wont’ sign off on because we’re not putting our name on something so hacked together. Signs off on a single new host that will run minimal VM’s locally. I reach out to a contact at one of our competitors that he used to be with to see if they have any info on their old Exchange 2013 licensing that would still be supported (barely) at the time. Get enough info together to send to the client to contact Microsoft and turn out we can still get it installed and licensed with the information they provided to MS and and what MS is able to send back to them.

I build the replacement VM’s (Server 2012 R2 mind you with their existing licensing) because on my temp hardware until the new server comes and I migrate the VM’s over to it. Guess they didn’t need all those old apps after all. We also manage to send Veeam to them under the VCSP program and repurpose one of their old NAS’s so now we’ll at least have backups next time they make bad decisions and have catastrophic, but predictable failures.

Still get queries from that client on occasion for some random stuff but they’re pretty low maintenance since they’re still keeping costs down.

Some amazing stories there @dloseke! Enjoyed reading them

anyone else wanna share?

When I worked at IBM we often had customers drop their maintenance and when things would break the cost for hardware was often 10x to 40x the price of buying it outright. The catch is you cant put the machine back on maintenance until it’s verified no hardware issues are present. The take away is don’t drop maintenance for production machines.

That being said, I’ll keep my old SANs around but ONLY for swing space and testing. When I need to pull something from tape to grab a file etc. I usually save the DR SAN as spare parts and provide my own “maintenance” because of my previous experience as an IBM tech.

I don’t let other people use it as I can’t guarantee uptime or resiliency, but it works great for temporary space of data that can be lost with no repercussions. (Testing a restore of a few hundred TB is an example)

I have so many horror stories from that job of customers losing data. 99.999% of it is usually from a bad decision, improper config, or poor policy. I signed too many NDA’s to get into specifics, but if you follow 3-2-1 and best practices, you will be safe. Doing a health check is never a bad idea either.

Agreed, I keep old SAN’s for temporary loaners when production has failed and it’s going to be a bit (such as waiting for a new SAN to arrive), lab gear, etc. But once things hit a certain age and lack of support, they shouldn’t be in production. Sure...you might get a few more years out of them….but is it worth the risk to your business and what’s the cost of downtime once you factor in lost revenue, client frustration, and having employees sitting there twiddling their thumbs while they can’t do anything.

They are great for personal labs and running I/O meter all day too :)

Comment

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded