Analyzing Disk Latency in Azure Virtual Machines – Part 1
If you’ve moved your SQL Server workload to Azure Virtual Machines, you’ve noticed there’s no shortage in the number of options available. Virtual machine sizes, disk types, caching vs no caching, support for bursting, etc.
When hosting SQL Server on Azure Virtual Machines, choosing the right combination of VM size and disk types is essential from both a cost and performance perspective.
However, workloads change over time and may require resizing the Azure VM or disks. For example, perhaps you deployed a new application that utilizes SQL Server on an Azure Virtual Machine. At first, usage was low, but now the application has become critical to the organization, and usage has increased significantly. You’ve noticed disk-related waits in SQL Server have increased, and you are wondering how to address the latency and end users have started to complain.
Tracking down disk performance issues can be tricky. There are multiple layers to consider. Is it the virtual machine or the disk(s) causing the bottleneck?
In Azure, disks and VMs have a cap on how much throughput and IOPs can be performed. When looking into disk performance issues, it’s essential to account for the following at both the VM level and Disk level, as each has different levels of performance:
- VM – Max cached and temp storage throughput (IOPs/MBps). Not all VMs support premium storage caching.
- VM – Max uncached disk throughput (IOPs/MBps)
- VM – Max data disks
- Disk – Max disk size
- Disk – Max throughput (MB/s)
- Disk – Max IOPs
To understand the VM level caps mentioned above, let’s review the three paths an IO can take: Cached, Uncached, Local/Temp disk.
Azure Virtual Machine Disk IO Paths
Cached
Purpose: Host caching in Azure is designed to improve performance by storing frequently accessed data closer to the VM. This caching mechanism enhances read operations and can significantly reduce latency.
Types of Operations: It’s typically used for read-heavy workloads. You can configure host caching to be read-only or read/write (not recommended for disks hosting transaction log files). Read-only caching is ideal for workloads that predominantly involve read operations, while read/write caching is suitable for a balance of read and write operations.
Performance: Cached I/O allows for higher IOPS and throughput as it utilizes the VM’s cache, which is faster than accessing data directly from the disk.
Limitations: The amount of storage available for host caching is limited and specified in the VM’s documentation. Also, cached I/O counts towards the VM’s cached limits.
Data Integrity: For read/write caching, writes are initially written to the cache and later written to the disk, which can be a concern for workloads requiring immediate persistence on disk.
Uncached
Purpose: Uncached disk I/O involves direct interaction with the disk without the intermediate caching layer. It is used for less frequently accessed data or when the latest data is required.
Types of Operations: Suitable for workloads that involve a significant amount of write operations or where data integrity and immediate persistence are crucial.
Performance: Generally, uncached disk I/O has lower performance compared to cached I/O due to the absence of the caching layer. The IOPS and throughput are limited by the disk’s and VM’s capabilities.
Limitations: The performance is constrained by the disk type and size and VM size. For example, a P30 disk can handle up to 5,000 IOPs.
Data Integrity: Since all operations are directly on the disk, there is immediate persistence of data, which is critical for certain applications and data security protocols.
Local/Temp disk
Purpose: Temporary disks provide short-term storage for applications and processes running on the VM. They are primarily used to store data that doesn’t need to be persistent, such as swap files, system caches, page files, or temporary data files (tempdb for example).
Performance: Temporary disks are typically faster than standard storage disks because they are located on the same physical machine as the VM. The performance of the temporary disk is tied to the VM size and offers high I/O throughput and low latency, making them ideal for temporary workloads and caching.
Data Persistence and Reliability: Data on the temporary disk is volatile. This means it persists only during the lifetime of the VM instance. If the VM is stopped or de-allocated (not just restarted), data on the temporary disk is likely to be lost. During some maintenance events or when the VM is moved to a different host hardware, the data on the temporary disk can also be lost.
Therefore, temporary disks should not be used for any data you need to keep.
To learn more on how Azure High-Scale VMs utilize disk caching, see here.
Azure VM Size Limitations
Each Azure Virtual Machine size has the following properties defined (some VM sizes include additional properties such as burstable credits and cached throughput):
- vCPU – Number of virtual CPUs
- Memory GiB
- Temp Storage
- Max Data Disks – Number of data disks that can be attached.
- Max Network Bandwidth (Gbps)
- Max NICs – Max number of network interface cards.
- Max uncached disk throughput IOPS/MBps – Combined total of all disks. For example, the remaining disks will encounter latency if one disk uses all available IOPs.
We’ll focus on the disk-related properties throughout this series.
Conclusion
In part 2, we’ll review an Azure VM experiencing high disk latency. We’ll use Azure VM Metrics to help identify where the bottleneck may be and how to alleviate the issue.
In the meantime, here are a few other articles that may be helpful.
- Azure shared disks – Failover Clustered Instance – SQL Server 2016
- Azure shared disks – Failover Clustered Instances
- How to Create SQL Server 2019 Failover Clustered Instances in Azure
- Identifying SQL Server Disk Latency
- 10 Data Storage Considerations for Growing Companies
If you’d like some assistance assessing your SQL Server workload on an Azure Virtual Machine, reach out. We’re happy to help.