Skip to main content

For those of you who don’t spend much time or have much experience with SQL Server, Brent Ozar is a well-known and respected Microsoft SQL consultant. In addition to consultancy and training services, Brent Ozar offers tools such as the First Responder Kit, and a SQL Server monitoring service called SQL ConstantCare.

SQL ConstantCare is a paid for monitoring service, designed to provide a daily report of what’s going on within your SQL Server platform, collecting data and notifying you if it sees genuine room for improvement, when things are good, it stays quiet.

It’s the content of these alerts that I want to discuss today. Brent Ozar did a blog post recently discussing the alerts that most people muted, and I found it really interesting to see the elements of data protection in there. So, let’s take a look:

 

SQL ConstantCare – Sample Size and Quality of Samples

 

Some background information on this data set that Brent Ozar has worked with firstly to provide context. At the time of writing (21st June 2023), this service was consumed by just over 400 users, against over 2,600 SQL Servers. I appreciate there’s a small time delta between Brent Ozar’s blog post and this one here, but I don’t believe it’s a significant enough a period of time to warrant these numbers too far from the true figures of the time of Brent Ozar’s original post.

I’d also like to highlight that the percentages are around users that have muted alerts, which implies there’s additional non-muted alerts being triggered for these issues, bumping the percentage points higher still, these aren’t shared within the blog post.

Now onto the quality of the samples. This tooling’s target audience consists of consultants, database administrators, and software vendors primarily. So, there’s a general assumption that these people know how to work SQL Server better than the average sysadmin. Which then makes me concerned about the percentages mentioned above, and how many databases are running that also suffer the same problems as this.

 

No Recent Full Backups

 

Of course, I was going to start by talking about database backups, as I mainly talk about backup & recovery all day long!

Firstly, I appreciate the filtering that Brent Ozar uses on this rule. SQL ConstantCare is only reporting on databases that are ‘actively online’ in Brent’s words. Not restored recently, not offline etc. So, these are persistently accessible databases.

Next, what does Brent say constitutes a recent backup? A backup in the past week! Realistically even a test database should be backed up weekly.

Now, what percentage of users have muted this for at least one database or server? 19%!

It’s easy to think, when surrounded with likeminded and data protection conscious peers that surely ALL databases are being backed up now? Surely, it’s not still seen as an unnecessary task because we have <insert high availability technology overlay here>. I’d be interested to see the breakdown of these servers and their databases to determine just how many of those databases really should be being protected.

 

Transaction Log File Too Large

 

There are two criteria that SQL ConstantCare is looking for when checking this rule. The transaction log file must be at least 5GB, and the transaction log file must be larger than the database.

There can be multiple reasons for this. If you’re using the full recovery model with SQL Server, you must take a backup before the consumed log space can be considered free and subsequently overwritten. If you don’t perform backups in a way that notifies SQL Server a successful backup has been taken, and it can consider its logs protected, then SQL Server will continue to grow the transaction logs indefinitely, or until a maximum size configuration has been met, alternatively the disk becomes full.

This final point is one that I was having a conversation earlier this year with a friend about. They were reviewing a SQL Server with approximately 1TB of consumed space, and over 85% of that drive was transaction logs due to the lack of an application-aware backup.

Backups aren’t the only reason why this can happen, it could be as simple as a templated standard initial log file size that is too big relative to the database, it could be due to a temporary backup issue that caused the logs to extend further than typically expected, and nobody has noticed or been unable to shrink the log file down sufficiently, or even performance demands have prevented the temporary performance impact of shrinking the log file.

Now for the numbers, 22% of SQL ConstantCare users had at least one database or instance that this alert has been muted on.

 

No Recent Corruption Checks

 

Okay, I get that your data protection solution doesn’t conduct this on your behalf, but there’s so much to consider here I felt it was worth expanding out and discussing.

You might disagree with that statement above and say ‘my data protection solution does XYZ data validation checks’, and you’d have a point, but those would be around the integrity of the backup, potentially as well you’d talk about something like Veeam’s SureBackup tooling to validate the database came online, that’s NOT what I’m talking about. Brent Ozar is talking about corrupted data within your SQL Server database(s), and that’s what I’m referring to as well.

SQL Server has ‘DBCC’ (Database Console Commands) statements within its programming language that allow you to perform various maintenance and validation tasks amongst others, but validation is where this conversation shall remain. And the one command that should be a must for any SQL Server instance is ‘DBCC CHECKDB’. This post won’t be a write-up of all the DBCC CHECKDB functionality, instead if you want to learn more about the command and its parameters, read Microsoft’s documentation here.

At a high level however, ‘DBCC CHECKDB”s purpose is to validate the logical and physical integrity of all the objects within your database(s). It’s looking at every table, view, catalog, disk space allocation, link-level consistencies, service broker data, and indexed view. The command also has parameters to perform database repairs whereby it finds consistency issues, and I’ll quote Microsoft directly here:

Warning: The REPAIR_ALLOW_DATA_LOSS option is a supported feature but it may not always be the best option for bringing a database to a physically consistent state. If successful, the REPAIR_ALLOW_DATA_LOSS option may result in some data loss. In fact, it may result in more data lost than if a user were to restore the database from the last known good backup.

DBCC CHECKDB (Transact-SQL) – SQL Server | Microsoft Learn – 21/06/2023

Microsoft are recommending when you find data corruption issues to leverage known good backups to perform the recovery of errors. However, if you aren’t using ‘DBCC CHECKDB’ to validate that your databases are good in the first place, how do you know when the corruption has occurred? Therefore, how do you know when to restore data from?

SQL ConstantCare reported that 24% of its users had at least one database or instance that this alert is being muted for. Whilst, yes, corruption checks might be being offloaded to somewhere else, again SureBackup/DataLabs could be a great candidate for performing this without production environment impact, the most likely conclusion is that some databases are being neglected.

 

Conclusion

 

Within data protection, we can only protect what we can see, and everyone needs to collaborate to provide an effective backup & recovery process, this includes visibility of applications, integration with applications, communication of application health status, and remediation of issues. If SQL Server’s data is bad by the time your data protection solution sees it, the backups will be bad too.

If you’ve got a SQL Server instance that’s a bit of a ‘black box’, it might be worth using Brent Ozar’s First Aid Kit to see what’s going on, and how healthy it is.

And finally, although this blog post might sound like an advertisement for Brent Ozar’s services, it’s not. Brent Ozar had no idea I was writing this post, and I’ve certainly not received any form of favour or compensation for writing this, it’s just a topic near & dear to my heart, and Brent Ozar’s tools have helped me many times in the field, so I want to spread awareness of them.

I actually may look into this further when I have time, as we have Sharepoint SQL servers which tend to get maxed out occasionally by our vendor who takes care of them Annoys me to no end.

Thanks for the share Michael.


Definitely check out the first responder toolkit as it’s a couple of stored procedures with helpful links to explain why something is a problem! It’s free!

 

Once it highlighted to me that a SQL server with 10x CPU cores could only use 8, the VM had been provisioned with 2x cores per socket, so 5 sockets, and sql Server Standard is restricted to 4 cpu sockets, so 2 cores were completely unused! (But licensed 💰💰💰)


I don’t even know what version we’re using, as my Director is the one who created these servers to begin with for our service vendors to utilize. If I had to guess, I’d say Standard though.

Thanks for the info Michael.


Excellent write up! Out of curiosity, I will have to check “SQL ConstantCare” out later using the free trial.


I follow Brent as well with his SQL stuff. Great blog as unread it prior to here.  Will need to look more in to this for sure.


I’m not a DBA, but it’s very interesting to know..thanks @MicoolPaul 


Comment