For those of you who don’t spend much time or have much experience with SQL Server, Brent Ozar is a well-known and respected Microsoft SQL consultant. In addition to consultancy and training services, Brent Ozar offers tools such as the First Responder Kit, and a SQL Server monitoring service called SQL ConstantCare.
SQL ConstantCare is a paid for monitoring service, designed to provide a daily report of what’s going on within your SQL Server platform, collecting data and notifying you if it sees genuine room for improvement, when things are good, it stays quiet.
It’s the content of these alerts that I want to discuss today. Brent Ozar did a blog post recently discussing the alerts that most people muted, and I found it really interesting to see the elements of data protection in there. So, let’s take a look:
SQL ConstantCare – Sample Size and Quality of Samples
Some background information on this data set that Brent Ozar has worked with firstly to provide context. At the time of writing (21st June 2023), this service was consumed by just over 400 users, against over 2,600 SQL Servers. I appreciate there’s a small time delta between Brent Ozar’s blog post and this one here, but I don’t believe it’s a significant enough a period of time to warrant these numbers too far from the true figures of the time of Brent Ozar’s original post.
I’d also like to highlight that the percentages are around users that have muted alerts, which implies there’s additional non-muted alerts being triggered for these issues, bumping the percentage points higher still, these aren’t shared within the blog post.
Now onto the quality of the samples. This tooling’s target audience consists of consultants, database administrators, and software vendors primarily. So, there’s a general assumption that these people know how to work SQL Server better than the average sysadmin. Which then makes me concerned about the percentages mentioned above, and how many databases are running that also suffer the same problems as this.
No Recent Full Backups
Of course, I was going to start by talking about database backups, as I mainly talk about backup & recovery all day long!
Firstly, I appreciate the filtering that Brent Ozar uses on this rule. SQL ConstantCare is only reporting on databases that are ‘actively online’ in Brent’s words. Not restored recently, not offline etc. So, these are persistently accessible databases.
Next, what does Brent say constitutes a recent backup? A backup in the past week! Realistically even a test database should be backed up weekly.
Now, what percentage of users have muted this for at least one database or server? 19%!
It’s easy to think, when surrounded with likeminded and data protection conscious peers that surely ALL databases are being backed up now? Surely, it’s not still seen as an unnecessary task because we have <insert high availability technology overlay here>. I’d be interested to see the breakdown of these servers and their databases to determine just how many of those databases really should be being protected.
Transaction Log File Too Large
There are two criteria that SQL ConstantCare is looking for when checking this rule. The transaction log file must be at least 5GB, and the transaction log file must be larger than the database.
There can be multiple reasons for this. If you’re using the full recovery model with SQL Server, you must take a backup before the consumed log space can be considered free and subsequently overwritten. If you don’t perform backups in a way that notifies SQL Server a successful backup has been taken, and it can consider its logs protected, then SQL Server will continue to grow the transaction logs indefinitely, or until a maximum size configuration has been met, alternatively the disk becomes full.
This final point is one that I was having a conversation earlier this year with a friend about. They were reviewing a SQL Server with approximately 1TB of consumed space, and over 85% of that drive was transaction logs due to the lack of an application-aware backup.
Backups aren’t the only reason why this can happen, it could be as simple as a templated standard initial log file size that is too big relative to the database, it could be due to a temporary backup issue that caused the logs to extend further than typically expected, and nobody has noticed or been unable to shrink the log file down sufficiently, or even performance demands have prevented the temporary performance impact of shrinking the log file.
Now for the numbers, 22% of SQL ConstantCare users had at least one database or instance that this alert has been muted on.
No Recent Corruption Checks
Okay, I get that your data protection solution doesn’t conduct this on your behalf, but there’s so much to consider here I felt it was worth expanding out and discussing.
You might disagree with that statement above and say ‘my data protection solution does XYZ data validation checks’, and you’d have a point, but those would be around the integrity of the backup, potentially as well you’d talk about something like Veeam’s SureBackup tooling to validate the database came online, that’s NOT what I’m talking about. Brent Ozar is talking about corrupted data within your SQL Server database(s), and that’s what I’m referring to as well.
SQL Server has ‘DBCC’ (Database Console Commands) statements within its programming language that allow you to perform various maintenance and validation tasks amongst others, but validation is where this conversation shall remain. And the one command that should be a must for any SQL Server instance is ‘DBCC CHECKDB’. This post won’t be a write-up of all the DBCC CHECKDB functionality, instead if you want to learn more about the command and its parameters, read Microsoft’s documentation here.
At a high level however, ‘DBCC CHECKDB”s purpose is to validate the logical and physical integrity of all the objects within your database(s). It’s looking at every table, view, catalog, disk space allocation, link-level consistencies, service broker data, and indexed view. The command also has parameters to perform database repairs whereby it finds consistency issues, and I’ll quote Microsoft directly here:
Warning: The
DBCC CHECKDB (Transact-SQL) – SQL Server | Microsoft Learn – 21/06/2023REPAIR_ALLOW_DATA_LOSS
option is a supported feature but it may not always be the best option for bringing a database to a physically consistent state. If successful, theREPAIR_ALLOW_DATA_LOSS
option may result in some data loss. In fact, it may result in more data lost than if a user were to restore the database from the last known good backup.
Microsoft are recommending when you find data corruption issues to leverage known good backups to perform the recovery of errors. However, if you aren’t using ‘DBCC CHECKDB’ to validate that your databases are good in the first place, how do you know when the corruption has occurred? Therefore, how do you know when to restore data from?
SQL ConstantCare reported that 24% of its users had at least one database or instance that this alert is being muted for. Whilst, yes, corruption checks might be being offloaded to somewhere else, again SureBackup/DataLabs could be a great candidate for performing this without production environment impact, the most likely conclusion is that some databases are being neglected.
Conclusion
Within data protection, we can only protect what we can see, and everyone needs to collaborate to provide an effective backup & recovery process, this includes visibility of applications, integration with applications, communication of application health status, and remediation of issues. If SQL Server’s data is bad by the time your data protection solution sees it, the backups will be bad too.
If you’ve got a SQL Server instance that’s a bit of a ‘black box’, it might be worth using Brent Ozar’s First Aid Kit to see what’s going on, and how healthy it is.
And finally, although this blog post might sound like an advertisement for Brent Ozar’s services, it’s not. Brent Ozar had no idea I was writing this post, and I’ve certainly not received any form of favour or compensation for writing this, it’s just a topic near & dear to my heart, and Brent Ozar’s tools have helped me many times in the field, so I want to spread awareness of them.