One of the great things about working in the SME space is that you get exposure to lots of different technologies. I originally wrote this post for our internal blog and it was well received so thought I’d share it publicly.
There’s plenty written online on this topic so I won’t go deep into too many of the areas I discuss here but I thought it would be useful to highlight something which I think is sometimes overlooked. I’m not a storage expert by any measure and this is based on my experience (a lot from a scaled online backup platform that I designed) and the research I’ve done on this subject.
In modern computing (clients and servers) the disk subsystem is an area that can be overlooked which can lead to poor system or application performance. I see this a lot, especially in virtualised environments. Storage design decisions are often made in the early stages of building a customer environment and they’re not always easy to change later on. From what I’ve seen in the field Silverbug is much better at designing these storage solutions than some of our competitors, and with the advent of bigger and bigger drives the problem requires even more attention; more on why that is later.
I’m a big advocate of SSDs in client computers because for what I think is a reasonable cost they can more than double the speed at which you can use your computer. That’s a greater performance gain than you would get from any reasonable memory or processor upgrade. It’s the machine you use 8 hours a day and if you’re fast it means you can be a lot more productive.
SSDs are so much faster than traditional drives. In low memory situations, the OS can thrash the page file on the disk for virtual memory without dramatically impacting system performance. A good example of this is that on my MacBook Air I was able to build 2 separate domains with 2 Lync installations totaling 7 VMs whilst only sacrificing 6GB of RAM to VMware Workstation. A lot of virtual memory was used to support the VMs, but both of the Lync environments and the OS were usable. This is only possible because of the speed of the SSD.
On the flip side of this what I sometimes see is people over-allocating (in the sense of more RAM than they need rather than more RAM than the hypervisor host has) vRAM to virtual machines to make up for poor disk performance. A Lync server doesn’t need 32GB of RAM unless you’re hosting thousands of users for enterprise voice. A Domain Controller for a small client can easily survive on 2GB, an Exchange server for a company of 50 people should never need more than? -20GB.
If you use Microsoft’s tools for stress testing or the resource calculators for Lync or Exchange, you’ll see that the minimum system requirements per user is tiny. If you find yourself having to allocate vast Gigabytes of memory then there’s probably an underlying issue that needs to be investigated.
Whilst throwing other system resources at the problem may resolve the issue for a time it’s not really fixing it, it’s simply masking a problem that will probably come back and bite you later. Assigning more memory to VMs so that they can load more data into memory saving I/Os is crazy when you consider the cost of RAM vs disks. Equally, assigning multiple vCPUs to under-performing VMs won’t help you except in certain special circumstances.
There’s more to understanding disk performance that just making an educated guess based on your experience. There are some key principles to understand first. I’m going to discuss Input/Output Operations Per Second (IOPS) here, some claim that raw throughput needs to be considered too but this is something I almost never see being an issue.
IOPS - Input/Output Operations Per Second
IOPS is a common performance measurement for storage systems and disks. IOPS are basically the number of times a service “touches” a disk. All services at one point or another need to access their disks, whether it’s to load services into memory or retrieve an email for Exchange or some other data retrieval or storage function. Non-SSD disks have moving parts (read/write heads, spinning disks) so there’s a finite number of these random I/Os they can perform in a given time. SSDs on the other hand are, as the name implies solid state, and can rapidly service requests as there’s no reliance on physically moving the read/write head to an area of the physical disk.
According to Wikipedia (and this is a very rough guide),, commonly accepted averages for random I/O operations can be calculated as 1/(seek + latency) = IOPS. There are IOPS penalties for RAID configurations, hypervisor overheads other factors that should be considered.
As you can see the faster a drive revolves the more IOPS is can service.
SSDs are considerably higher (some go up to 10,000,000 IOPS) but vary depending on model. Unless you’re in a high performance scenario or dealing with very large scale storage requirements you don’t need to worry about IOPS when using SSDs. With SSDs you can be very conservative with vRAM allocation allowing an increased density of VMs saving on extra hardware.
You start to get problems when you request more IOPS from a disk or disk system than it can service. When you hit this saturation point the performance of services drop dramatically. In most scenarios where you see these sort of problems these services are virtual machines and the services that run on them. In a traditional direct install server a single disk array can easily support an OS and services. With the advent of virtualization, a single server or SAN (Storage Area Network) could be hosting many virtual machines.
Consider a server with 96GB RAM. You could easily host 15 VMs on the hardware allocating 4GB vRAM to each VM. Processor speed is rarely the constraining resource in virtualised environments these days. Each VM can vary greatly but having seen some performance issues because of oversubscribed arrays I use a guide of 40 IOPS per VM. Now, this a very simple measure and there are tools online as well as within Windows and through hypervisors that you can use to accurately measure how many IOPS a server is generating. A few I checked for this post
Exchange server hosting around 100 users with mailbox and online archives databases totalling 2TB. Exchange sever generates 300-400 IOPS.
Lync front end server servicing < 40 users averages less than 10 IOPS.
Services that are constantly reading from the disk such as SQL, Exchange and file servers, generate considerably more IOPS than those hosting services such as AD, VPN, Lync, DHCP, DNS, TMG etc.
Back to our example, this is an oversimplification but gives you an idea of how it works.
15 VMs at 40 IOPS = 600 IOPS.
Using the table above we’re going to need 6 * 10K RPM SAS drives to support the VM IOPS requirements. With the advent of larger and larger disks I sometimes see capacity rather than IO performance being the deciding factor in these situations. You may think you could stick in 2 * 900GB 10K SAS disks but you would really suffer with performance issues. You’re always better going for more disks of a smaller capacity than fewer disks at a higher capacity, I’ll explain why. Double the number of disks and roughly speaking you double your IOPS capacity. Having a smaller number of large disks gives you the same storage capacity but it can service fewer IOPS.
I’ve seen certain LUNs in production environments that are oversubscribed and the performance is really, really bad. Once you start to generate more IOPS than you can service it can be a tricky situation. The only way to improve things is to reduce IOPS by decommissioning or consolidating VMs or adding more disks. The latter is sometimes impossible if an existing chassis is full and if this happens then your upgrade options can suddenly get very expensive.
The key things I look for are disk queue length and disk latency. Both can be quickly checked in Windows and VMware. Disk queue length shouldn’t really ever go over 3 in Windows for a prolonged period of time. Disk latency over 100ms can cause you problems.