A Severe Error Occurred! 5 Ways to Detect Database Corruption Early

A Severe Error Occurred! 5 Ways to Detect Database Corruption Early

SQL Server database corruption can make you sick

No one likes to think about how to detect database corruption. But, Imagine this: you’re arriving at your job expecting an ordinary day. You begin receiving calls that the main application is returning a weird error when submitting or updating data;

A severe error occurred on the current command.  The results, if any, should be discarded.

You give it a try and receive the same error, then proceed to call the support desk, and the conversation may go something like this.

You: “Hey, I’ve been hearing reports about some issues with the application’s database. Do you have any insights on what might be happening?”

Support: “Sure, the server is online and accessible; all the monitoring dashboards are showing green indicators.”

You: “Great, thanks for checking that. What about the database itself? Is it online and accessible for direct querying outside of the application?”

Support: “Absolutely, I’ve verified that as well. I actually used SQL Server Management Studio to run a few test queries directly on the database. The data is coming back without any errors.”

You: “Hmm, that’s really puzzling. If the server and database seem to be in good shape, then why are users encountering problems? Can you do a specific test for me? Try performing an update on the dbo.Sales table, focusing on the row that’s causing issues in the application.”

Support: “Sure, I’ll give it a shot right now… Oh, that’s strange. I just attempted the update, and I’m getting the same error that users are reporting.”

It’s at this point you realize the database may be corrupted, and you’re beginning to experience abdominal pain (database corruption has a funny way of doing that). Now what?

5 Ways to Detect Database Corruption Early

You’ll likely encounter this situation if you use databases long enough. Corruption is going to happen. The key is to detect database corruption early and to know how best to recover from it. For some great insights on dealing with database corruption, see Paul Randal’s https://www.sqlskills.com/blogs/paul/sqlskills-sql101-dealing-with-sql-server-corruption/. I can’t recommend his post enough. It’s great.

Let’s look at 5 ways to detect database corruption early.

  1. Backups (with checksum)
  2. Integrity Checks
  3. PAGE_VERIFY database option
  4. Test restores
  5. SQL Server Agent alerts

1. Backups (with checksum)

Backups are extremely important in most production environments. Having backups in place and regularly testing restores ensures you can recover from corruption, if needed. Make sure the process meets your RTO / RPO requirements (see here for more details on RTO / RPO and setting SQL Server backups).

Save the backups on a different server via UNC path and possibly in the cloud as a secondary location. Azure Blob Storage is one possible solution.

2. Integrity Checks

Another key method to detect database corruption early is to run database integrity checks often.

DBCC CHECKDB checks the logical and physical integrity of all the objects within the specified database. It performs the following operations; DBCC CHECKALLOC, DBCC CHECKTABLE, DBCC CHECKCATALOG, validates the contents of every indexed view, validates link-level consistency between table metadata and file system directories and files when storing varbinary(max) data in the file system using FILESTREAM, and validates the Service Broker data in the database.

We recommend using Ola Hallengren’s maintenance solution to perform integrity checks. These scripts are freely available and are very well documented. If you don’t already have a maintenance solution, grab Ola’s scripts here.

Don’t have enough time to run DBCC CHECKDB due to database size or limited resources? Brent Ozar has a few methods that can be used to help speed things up. See here for details.

It’s extremely important to ensure you’re running integrity checks on important databases. It’s better to find corruption early vs waiting months or even years before knowing and being unable to recover when needed.

3. PAGE_VERIFY database option

The third way to detect database corruption early is to enable PAGE_VERIFY for all databases that you care about. There are three PAGE_VERIFY database options:

  • NONE – Database page writes won’t generate a CHECKSUM or TORN_PAGE_DETECTION value. SQL Server will not verify a checksum or torn page during a read even if a CHECKSUM or TORN_PAGE_DETECTION value is present in the page header.
  • TORN_PAGE_DETECTION – SQL Server saves a specific 2-bit pattern for each 512-byte sector in the 8-kilobyte (KB) database page and stored in the database page header when the page is written to disk. When the page is read from the disk, the torn bits stored in the page header are compared to the actual page sector information. Unmatched values indicate that only part of the page was written to disk, and error message 824 is reported to the SQL Server error log and Windows event log.
  • CHECKSUM – SQL Server calculates a checksum over the contents of the whole page and stores the value in the page when written to disk. When the page is read from disk, the checksum is recomputed and compared to the checksum value stored in the page header. If the values don’t match, error message 824 is reported to the SQL Server error log and Windows event log. A checksum failure indicates an I/O path problem and requires further investigation.

SQL Server 2000 used a PAGE_VERIFY default setting of TORN_PAGE_DETECTION for all databases. If you’ve migrated databases over the years and are now running on a more recent version of SQL Server, check this setting. Restoring from an older version will not automatically set PAGE_VERIFY to the new default of CHECKSUM.

Verify that all databases are using the CHECKSUM page_verify option. You can use the t-sql statement below to do a quick check. Any that return a page_verify_option of NONE or TORN_PAGE_DETECTION should be evaluated.

USE MASTER
GO
select [name], page_verify_option_desc from sys.databases

4. Test restores

I’d recommend a regularly scheduled restore for your most critical databases. Having backups are great, but only if they can be used to recover a database when the need arises. I’d restore the databases and then run DBCC CHECKDB to ensure no corruption is encountered. Once every 90 days, or more frequently if needed, is where we generally start.

One solution you can use to test backups is the Test-DBALastBackup commandlet found in the dbatools module.

5. SQL Server Agent alerts for I/O errors

SQL Server includes built-in monitoring capabilities. However, you’ll need to configure the agent alerts, operators, and notifications manually. During our SQL Server health check, we’ll check for common alerts that should be enabled. Those include alerts for 823, 824, and 825 errors which could serve as early warning indicators of catastrophic disk I/O errors:

  • Error 823 – SQL Server uses Windows APIs to perform file I/O operations. After performing these I/O operations, SQL Server checks for any error conditions associated with these API calls. If the API calls fail with an Operating System error, then SQL Server reports Error 823. These errors usually indicate a problem with the underlying storage system or hardware or a driver that is in the path of the I/O request. See more on this error here.
  • Error 824 – SQL Server encounters this error message when a logical consistency check fails after reading or writing a database page. The error indicates that Windows reports that the page is successfully read from disk, but SQL Server has discovered something wrong with the page. It’s similar to 823, except Windows didn’t detect the error. There could be a problem with the I/O subsystem such as failing disk drives, firmware problems, faulty device drivers, and so on. See more on this error here.
  • Error 825 – This message indicates that the read operation had to be reissued at least one time, and indicates a major problem with disk hardware. This message does not currently indicate a SQL Server problem, but the disk problem could cause data loss or database corruption if not resolved. See more on this error here.

Worried about database corruption?

Looking for more information about detecting database corruption and what to do about it? Read Paul’s post mentioned earlier. He provides an overview and steps on how you may potentially be able to recover. Our hope is that you never find yourself in a situation where you can’t recover.

Worried that your database is corrupt? Let’s set up a SQL Server health check.

If you’ve already detected database corruption, give us a call. We can help.

 

Leave a Reply

Your email address will not be published. Required fields are marked *