Fun Friday: Backup Designs That Sounded Better in Theory


Userlevel 7
Badge +20

Happy Friday everyone!

 

Sorry it’s been a bit quiet from me recently on this front, lets jump back into things with a new fun friday topic: Backup Designs That Sounded Better in Theory?

 

Firstly, what do I mean by this? I mean those designs whereby everything looks great, or a particular use-case or limitation was believed to have been overcome, just to fall over for one reason or another.

Before the realisation!

My story is of an interaction I had with a DBA of a large company, who was about to learn not all backups are created equal.


 

I was reviewing a company’s disaster recovery strategy from an application-level perspective, as in I was ensuring that the DR strategy would be able to return the application & all dependencies to a production state.

 

One of the dependencies was a database, of which I noticed I couldn’t see any backups were being generated by Veeam.

Internal Panicing

The DBA then informed me that they backed up using some SQL Server maintenance plan-based backup jobs. I asked them why they were choosing to use this over Veeam, as it would still provide all of the integrations they needed etc. The DBA’s response was that they were always messing about with their backups and preferred to manage it themselves as they couldn’t guarantee how swiftly a backup engineer would be about to assist them.

 

I asked where these backups were going, and was informed they were being saved to a file share. (Thankfully not on the same server!). I probed further and asked if they knew if/when the file-share was being backed up, some slight panic started to set in from the DBA as they didn’t know. But I assured them it’s okay, the file-share IS being backed up. I then asked when they were backing up the required databases to the share. I was informed it was midnight.

 

I told them we had a problem here, the file-share is backed up at 10pm. The reason why we had a problem wasn’t clear to the DBA, they asked me why it was a problem. And I explained, it’s midday at the moment, and if this server’s database was corrupted, you’d want to go back to a backup, correct?

The DBA nodded.

Me: So you’d want to go back to last night’s backup to minimize RPO correct?

The DBA agreed.

Me: Well, what if the file-share was corrupted to, such as if the whole company was hit with a ransomware attack?

DBA: Well, we’d just get the backup from the file-share once that’s restored.

Me: But the file-share is backed up two hours before you created a new backup, so I’d be giving you a backup for the day before. You’ve gone from losing 12 hours of data to 36 now.

Moment of Realisation

DBA: *moment of realisation* 🤯

 

The initial response was to align the database backups to take place before the file-share backup, and then a strategy was started to leverage native backups.

 

 

So, this is my story of the day, what’s yours?


6 comments

Userlevel 7
Badge +17

Yes, the never ending story with DBAs 😎
I have the same discussion every time the DBA changed.

 

But it is the same with many application admins, too. 😀

Userlevel 7
Badge +20

Having a discussion as to why a Dedupe appliance is not a good design and primary target for backups!  Then having my fears come try restoring a SQL server that took 3 days for Production.  😮🤐

Userlevel 7
Badge +12

Really sounds familiar 😅 If such a maintenance plan is setup right, with monitoring and failure handling, and also coordinated with the backup team, then they're OK in my opinion. If not, then I would suggest to do an application consistent backup with Veeam in addition; if the DB backups fails, you still have a safety net.

A common design which often only looks good in theory is the implementation of deduplication appliances. They can be really useful if sized correctly and for specific use cases. Where things can easily go wrong is:

  1. The dedupe rate was overestimated and therefore the appliance is filling up before achieving the planed retention. This can happen if the change rate is too high or the data can't be deduplicated.
  2. They're used as primary repositories and with the first restore you'll notice that you can't achieve your recovery time objective. Over time the random read performance decreases, so you should always use a fast performance tier.
Userlevel 7
Badge +12

Having a discussion as to why a Dedupe appliance is not a good design and primary target for backups!  Then having my fears come try restoring a SQL server that took 3 days for Production.  😮🤐

Oh I didn’t see your post while writing mine @Chris.Childerhose ...looks like we’ve had the same experience here 😅

Userlevel 7
Badge +20

Having a discussion as to why a Dedupe appliance is not a good design and primary target for backups!  Then having my fears come try restoring a SQL server that took 3 days for Production.  😮🤐

Oh I didn’t see your post while writing mine @Chris.Childerhose ...looks like we’ve had the same experience here 😅

Nice to see there are others with this pain. 😂

Userlevel 7
Badge +6

I’m pretty constantly revising my own plans for backup architecture.  Moving away from Synology NAS’s has been helpful but I was unfortunately down that track for about 3 or 4 years because that was always how we did it, but we finally convinced people to start using purpose-built Dell servers with local storage.  That said, getting folks to put the VBR Backup server at the recovery site instead of the primary site has been an issue in some cases, but that’s getting easier.  Using REFS was usually not an issue, but I have a coworker that has had issues with REFS in the past and is very afraid of it, and I’ve run into it as well with Microsoft’s various patches that cause REFS volumes to show as RAW, etc.

Now I’m starting down the road of the correct architecture for Linux Native Immutable backups.  My plan that was down for a client was that at the primary site there is a purpose-build Dell server with local storage.  Great.  The NAS that they are currently using for primary storage (will be repurposed as a copy repository at the recovery sites once the actually SAN that is on backorder until hopefully no later than next month) will host the data, but I was going to attach it as a RDM to a linux VM, but as others have pointed out, that’s a weakness as access to the ESXI console will allow access to linux VM console…..so, back to the drawing board.  I have a single host with gobs of disk space that’s going to be going to the recovery site, but it’s going to be an ESXI host that is going to be the replication target.  So...might have to repurpose a different server to act as a physical linux box for the copy repo.

Fortunately, I don’t run into many people-based architecture issues such as with DBA’s and app management folks because the environment’s I support I generally have full or fullish control over because we’re pretty solidly in the SMB space with our clients and my team generally does a good job of managing things with some logic on covering all the bases.

Comment