By Jim Hannan (@HoBHannan)
I was working on some documentation for a client and started thinking back on how long we have been using IOMeter — my best guess is since 2007. It is a great tool. You could argue that there are newer, better suited tools for database I/O testing available like:
- Oracle ORION
Nevertheless, I continue to use and recommend IOMeter. I think it is because of its approachability and simplicity. Is this blog I will cover the usage, history, and how to configure and run IOMeter.
The History of the Open Unofficial Storage Thread
Back in March 2009, a group of vSphere administrators got together and created a standard set of test to run on their storage unit. The thread is still active (although you are redirect to a new thread branch) here.
The purpose of the thread was to compare and test storage arrays before virtualizing workloads. The results were uploaded via CSV files or pasted into the forum to compare results against other storage units. It became essentially a database to compare data. The thread lives on today and hundreds maybe thousands of results. You can often find your exact model and compare results. Obviously a model is not the only determining factor in throughput, but it gives administrators a really good way to determine if performance is where it should be.
IOMeter Only on Windows
All of the tests in Open unofficial storage thread are conducted using a tool called IOMeter. IOMeter is licensed under the GPL (GNU Public License) and was written and distributed by Intel back in 1998. The primary application runs on Windows — there is a Linux binary, but it was written for older Linux kernel that can do only synchronous I/O not the faster more efficient asynchronous mechanism of today. For those of you unfamiliar with the difference in the async kernel and sync kernel, it is dramatic. Think of your I/O as water dripping out of a faucet. Now compare it to faucet running full blast. For this reason the only approachable solution is to run it on Windows, which for many customers make it less attractive for I/O testing. That being said, I am still an advocate for running IOMeter in Linux based shops. Remember that you are testing for I/O throughput not the OS. Something else I should tell you: NTFS v5 is faster than Linux ext3.
Since 2007, HoB has run thousands of test using IOMeter for customers and recorded the results. Below I have included the steps for IOMeter setup for Windows 2008.
To install IOMeter, double-click on the executable and follow the prompts. After the installation is finished, click on the Iometer icon to start the GUI and complete the basic configuration.
To run IOMeter on Windows Server 2008
- Right-click IOMeter and select Run as administrator to run IOMeter with administrative privileges.
- In the User Account Control window, select Allow.
- After IOMeter starts, a window called C:Program Files(x86)I…opened with Iometer is displayed. This thread drives the I/O and file creation. Do not close it.
- Select Open and go to the OpenPerformanceTest.icf configuration file location. The default configuration file is a standard agreed upon by a group of users who test VMware I/O performance. You can modify the file, but the default configuration file is often a good starting point for testing. The file and user forum are available at http://communities.vmware.com/thread/73745.
- Select Worker 1, and then select the drive. This creates the test output file if it does not already exist. Iometer creates a 4GB test output file by default.
- Under the Access Specifications tab, select a test and click Add. This example selects Max Throughput - 100% Read.
- Under the Test Setup tab, confirm that the settings are correct. The default values might be OK. You can modify the length of the test in the Run Time section.
- For a write test, select the Max Throughput - 100% Read test, select Edit Copy and change the slider Percent Read/Write Distribution to 100% Write.
For a read test, select Max Throughput - 100% Read. This test usually exposes any problems with the storage configuration, ESX/ESXi host, HBAs, or drivers.
- Under the Results Display tab, change the Update Frequency to 4 and select Last Update. Click Run Test (the green flag icon).
IOMeter asks for a filename to save the results. The results are written to a .CSV file with the data from the test. After the results file is specified, the test begins.
IOMeter first prepares a file for use by the test. If the defaults are used from the .icf configuration file, the file is 4GB and is named iobw.tst. The first time IOMeter runs it generates this file, so the actual test is delayed until the file is ready. Subsequent tests on the same drive use the existing file, so a new file is not generated.
The iobw.tst file must be manually cleaned up after testing is complete.
At HoB we really encourage customers to adopt a standard set of benchmarking test. There is so much to gain by doing so. A big part of benchmarking is testing I/O — and IOMeter is a good tool to assist with that. If you have any questions or comments let me know @HoBHannan.
Posted by Dave Welch in vSphere , VMworld 2012 , VMware , SMP FT , SMP Fault Tolerance , Oracle RAC , Mendel Rosenblum , John Sloan , Jim Chow , High Availability , HA , David House , Clustered Relational Database
By Dave Welch (@OraVBCA)
The most impressive session I have taken in from the last two VMworlds was 2011 US session BCO2874 “vSphere High Availability 5.0 and SMP Fault Tolerance – Technical Overview and Roadmap”. This session was reprised at VMworld 2012 as session INF-BCO2655 “VMware vSphere Fault Tolerance for Multiprocessor Virtual Machines—Technical Preview and Best Practices”. Don’t miss these sessions’ live demo of a 4 vCPU workload failover.
I’ve been watching Fault Tolerance (FT) eagerly since Mendel Rosenblum did a live demo of FT alpha code at VMworld 2007. Alas, FT’s 2009 GA release has been little more than bait for future capability to House of Brick customers as none of those customers’ Tier-1 workloads fit inside FT’s current single vCPU limit.
I wonder if VMware’s executive leadership has any idea what it is sitting on with SMP Fault Tolerance (SMP FT). I’m wondering if SMP FT could turn out to be the most disruptive technology anyone has seen in years. SMP FT certainly threatens a massive disruption to the clustered relational database market. I make that prediction due to two key SMP FT features that Oracle RAC can’t touch: approximately single-second failover, and no client disconnect.
The VMworld 2012 US session included engineering’s confession that the alpha code still has very substantial performance latency (they use the word “overhead”, which I am replacing with “latency” for clarity). I got the impression that they’re quite worried about the latency, and wondered if they even considered their current performance numbers fatal to a GA release. Here’s a screen shot of their latency numbers (at video minute 35):
The presenters then refer participants to the vSphere 4 FT Performance Whitepaper (“Section 3. Fault Tolerance Performance” beginning on p. 7) for further elaboration on their performance measurement methodology.
I’m betting there are plenty of extremely HA SLA-sensitive workload owners that will be more than happy to tolerate that much latency. For example, look no further than the stock exchanges where extremely distributed transactions must commit all transaction phases within a few seconds to avoid substantial revenue loss let alone regulatory penalties.
A corollary to Moore’s law is Intel executive David House’s prediction that chip performance would double every 18 months. I’ve read clarifications by other parties arguing that chip performance actually doubles every 20 to 24 months. Fine. Let’s call it 24 months. In his 2006 blog post, John Sloan extends that math to a 30-fold performance improvement each decade. (Related to that, my non-scientific mental survey tells me that less than 10% of House of Brick’s performance diagnostic engagements find CPU-constrained workloads.) In light of the fact that we are swimming in a world of CPU performance and it just keeps getting better, how obsessed do I imagine I’d be with SMP FT’s latency if I were the CTO of NYSE or NASDAQ? Not very. I could stand up my SMP FT-protected workload in a hardware refresh and still run circles around any x86 chip vendor's offering of just two hardware depreciation lifecycles ago.
Accordingly, I suggest the SMP FT product and engineering teams gate the product’s dot-zero release based on code stability only. Subsequent to that, I suggest the engineering/product teams prioritize their engineering efforts as follows:
- vCPU horizontal scalability
- Latency (and a distant third at that)
During the 2012 VMworld SMP FT session Q&A, Jim Chow answered a question by saying that SMP FT engineers were still considering the subject of the question in their architectural discussions. That answer gave me the impression that SMP FT’s GA release was probably at least a year away at that time. Despite the wait, I believe enterprises would do well to evaluate SMP FT’s business continuity promise now and get project planning underway accordingly.
By David Klee (@kleegeek)
Welcome to the sixth part in our series of posts regarding the virtualization of your business critical SQL Servers. In this post, I discuss Disaster Recovery of your SQL Servers while running on VMware. I wrap up the post with a discussion of techniques used to help demonstrate the power of virtualization and discuss the benefits to those individuals that might continue to fear (or not understand) virtualization.
High Availability and Disaster Recovery
To start the discussion, let’s begin with a reminder of the two definitions of High Availability (HA) and Disaster Recovery (DR).
High Availability (HA) is a design approach where systems are architected to meet a predetermined level of operational uptime, such as a Service Level Agreement (SLA). This means systems should have appropriate levels of redundancy while still keeping the environment as simple as possible as to not introduce more failure points.
Disaster Recovery (DR) is the process of preparing for, and recovering from, a technology infrastructure failure that is critical to a business. The core metrics of DR are the Recovery Point Objective (RPO) and Recovery Time Objective (RTO). The RPO is a measurement of the maximum time window where data could be lost in the event of a disaster. The RTO is the measurement of the maximum time that a core service or system can be offline after a disaster.
High availability is not disaster recovery, and disaster recovery is not high availability. Do not confuse the two, and do not assume that a solution for one is the same solution for the other.
In my previous post I talked about the options for SQL Server HA on VMware. We continue this topic now with a greater deep-dive on DR options for SQL Server on VMware.
Disaster Recovery Discussion Topics
A proper DR configuration must be geographically separated by a reasonable amount of distance. Here in the Midwest, the extreme weather plays a key role in determining DR distances. For example, a tornado can take out a few square miles of land, but what it does strike it will probably level. On the other hand, a blizzard could come through in the winter months and knock out power for large amounts of the area for days. In this scenario, the equipment suffers no real damage, but it is equally as unavailable. Both scenarios knock you offline for an indefinite period of time.
Be smart in your placement of your resources. One of the worst examples of this was from a friend whose company had its primary datacenter in Miami. The secondary datacenter was in New Orleans. Does anyone remember Hurricane Katrina? It was about to knock out their first datacenter, so the company failed over to the New Orleans site. Four days later it destroyed their secondary datacenter, and because the primary was still down, they were left with their datacenter on the third floor of a building with 15 feet of water in the lower floors. The business was down for weeks while off-site backups were gathered, a new datacenter and equipment allocated, and the business restored to the new site. It was an awful process.
The RPO and RTO come into more importance when discussing DR strategies. The recovery point and recovery time objectives must be determined before you start the DR planning process. They drive certain technologies as part of the architecture. These numbers can quickly eliminate several options and, sometimes, can help the business rethink the numbers.
One of the major points that few people discuss is the fail-back process. How do you get your data back to the primary datacenter after you have failed over and now have new transactions? What if your downtime requirements are tight?
So, without further ado, here are the methods for achieving DR with SQL Server on VMware.
VMware Site Recovery Manager
VMware Site Recovery Manager (SRM) and its little sister, vSphere Replication, are both means to handle SAN-to-SAN replication for disaster recovery purposes. The recovery time objectives with both of these products are low and, most importantly, the human steps in the failover process are few (or sometimes nonexistent).
SRM also handles failover and failback processes and can perform automatic testing of the replication process. It’s fantastic for those environments that have to routinely demonstrate the DR process without impacting the servers. Multiple failover and failback plans can be defined and carried out as required.
Single instance SQL Server on VMware works great with SRM, as long as you can demonstrate transactional integrity during the replication. Many SAN vendors have SRM guides for properly configuring the environment to work with SRM. Test transactional integrity with your normal (and maximum) workloads before putting these solutions into production.
However, the smallest RPO per VM or LUN replication is 15 minutes, and sometimes longer, depending on the SAN vendor. Keep this in mind. Sometimes, a combination of this strategy and a more real-time replication, such as database mirroring or AlwaysOn, could be used to complement the SRM strategy and reduce the RPO accordingly.
RTO: Very low compared to other solutions. Measured in minutes. Can dictate order in which VMs are powered on.
RPO: No shorter than 15 minutes, which can be problematic for some environments.
Failover Process: Simple, repeatable, and multiple plans can be defined for various situations.
Failback Process: Simple, repeatable, and multiple plans can be defined for various situations.
Pros: Low human intervention means failover process has a lower chance of errors during failover or failback. Can be tested and audited periodically without impacting operations.
Cons: RPO could be higher than business can allow.
vSphere Replication is a new feature of vSphere 5.1 that is included for free with all versions from Essentials Plus and above. Instead of orchestrating san-to-san LUN replication, this technology handles site-to-site replication of virtual machine change blocks of individual virtual machine disks. Failover is manual, but for a free product, this technology handles replication very well.
As with VMware SRM, the smallest RPO per VM replication is 15 minutes. Keep this in mind, and sometimes a combination of this strategy and a more real-time replication, such as database mirroring or AlwaysOn, could be used to complement the SRM strategy and reduce the RPO accordingly.
RTO: Very low compared to other solutions. Measured in minutes.
RPO: No shorter than 15 minutes, which can be problematic for some environments.
Failover Process: Simple, manual, and repeatable.
Failback Process: A manual process but is simple and repeatable.
Pros: Included with the core vSphere 5.1 suite in almost all licensing levels, and is simple to setup DR with commodity hardware and a 15-minute RPO.
Cons: RPO could be higher than business can allow. Failback process is manual.
SQL Server Asynchronous Mirroring
SQL Server database mirroring has been around since 2005 Service Pack 1. Asynchronous database mirroring allows a near-real-time replication of data between the two nodes. If a failover occurs, the secondary node fires up and picks up where the other left off. However, data might be missing if failover occurs while data is still in transit.
When running in this configuration, a witness server should also be used to facilitate the automatic failover. Otherwise, the decision to promote a secondary server to principal server is not automatic.
Also, asynchronous mirroring is only available in the Enterprise editions of the product. Standard edition gets you ‘full safety’ mode, which is real-time synchronous replication only. If your bandwidth between the two sites is either slow or has a high latency, you will feel the performance impact rather quickly. You could even experience significant delay in your production operations because of this impact. Unless the bandwidth AND latency between the two sites exceeds the transactional rates required by your workload during ALL periods of the day, synchronous mirroring is not recommended as a solution for production DR.
You will also notice that if the bandwidth throughput between the two sites is lower than required, data can stack up waiting to be replicated and the recovery point objective might not be obtainable.
RTO: Very quick, as the roles are flipped and the secondary node fires right up.
RPO: Variable, depending on transactional volume and available bandwidth.
Failover Process: Quick and generally without human intervention.
Failback Process: Resynchronize and fail the nodes back to primary. It is very straightforward and simple.
Pros: Quick failover times. Simple and not error-prone failover or failback processes.
Cons: Expensive licensing. Application must support mirroring failover target. Failover is at the database level and not the instance, so applications with multiple databases might require scripting.
SQL Server 2012 AlwaysOn Asynchronous Replication
SQL Server 2012 AlwaysOn is a blend of the virtual IP address and failover of Microsoft Failover Clustering and an improved derivative of SQL Server database mirroring. Because of the options that I previously covered here in this post, AlwaysOn can serve as both a DR and an HA solution, depending on the configuration. Again, I stress asynchronous replication here because of the impact of the speed and latency logistics around WAN synchronous replication.
RTO: Failing a node over to a DR site could be measured in seconds, depending on transactional volume and replication rates.
RPO: Could be as low as a second or less, depending on transactional volume and replication rates.
Failover Process: As simple as it gets. Happens in seconds. The Availability Group moves over and the application reconnects to the same IP address as before.
Failback Process: Even simpler. It’s the same process as failover, and just works!
Pros: Easy to setup and easier to manage. Extremely simple failover and failback processes.
Cons: Potentially expensive licensing. Must go through the upgrade process to SQL Server 2012.
Transactional replication allows the replication of data and schema changes from one server to one or more other servers. Data can be replicated at various time intervals in a number of ways. One good note about transactional replication is that you can select which tables get replicated. If some tables need to be replicated and others can be ignored, you can do this!
No automatic means for failover currently exist, so failover is a manual process. Data sources must have their targets changed, or for the more technically savvy, DNS aliases would need to be updated to point to the new server. The failover process would need to be documented and heavily tested. No automated fail-back procedure exists, so this process would also need to be planned for and tested heavily as well.
RTO: Varies depending on the complexity of the environment and the applications that connect to the database(s).
RPO: Varies depending on the replication time period to the distributor plus the time to have the subscribers fetch and apply the transactions.
Failover Process: Purely manual process.
Failback Process: Purely manual process.
Pros: Can selectively replicate databases or data subsets to one or more target servers on whatever time period you require.
Cons: Potentially complex and can be cumbersome for failover and failback processes.
Log shipping is a process for backing up, copying, and then restoring database transaction logs from a source database to a standby server. It has been around quite a while, and has been used successfully for DR purposes for years.
RTO: Relatively short, generally 15 minutes or less.
RPO: Varies, based on the configuration, and can be very short.
Failover Process: Purely manual process.
Failback Process: Purely manual process.
Pros: Reliable once configured and established.
Cons: Purely manual failover and failback. Downtime is guaranteed while failover occurs. Application must be reconfigured to point to the new server, or a DNS alias adjusted.
Backup and Restore
If a low RPO is not a requirement (and yes, those environments actually DO exist!), a simple replicated backup and an automated restore could suffice. If a somewhat lower RPO is needed, periodic transaction log backups could be replicated as well.
RTO: Varies. Many factors affect this figure: size of the database, speed of the servers and storage at the DR site, and the state of the replicated backup files, just to name a few.
RPO: Varies, and is probably poor. It could be as long as the time taken between backups AND / or the time to replicate the backup file(s) to the DR location.
Failover Process: Restore last known good and successfully replicated backup. If exist, restore any replicated post-backup transaction log backups.
Failback Process: Create a new backup and restore at the primary site.
Pros: Least complex and simple to manage.
Cons: Long RPO and possibly RTO. Manual process.
Why isn’t Clustering in this List?
Clustering is not a solid DR solution. Why not? Think about the following points.
- Shared storage is available at a single location, practically speaking. Extending a SAN’s high speed interconnects across a WAN is impractical.
- Your one shared-storage SAN is still a single point of failure. I don’t care what any SAN vendor says. It can still fail. It still needs maintenance. It still has a power cord that can be tripped over. Etc. etc.
Now ask that question again. Clustering is for high availability, not disaster recovery.
Now What? Demonstrating that Virtualization Helps
So now what? You now have a solid understanding of each component in the physical and virtual stacks. You know the methods for profiling the existing physical servers and building the appropriately sized virtual equivalents. You know about HA and DR strategies.
But what is the hardest part about virtualizing business-critical servers?
It’s the environment.
People come in many forms – from employees who were burned from a failed virtualization attempt. It can come from skeptics (my favorite challenge). It can come from an organization that is naturally resistant to change. It can be from those with no frame of reference and are fearful of adding another black-box layer to their environments.
So what do you do?
- You educate.
- You demonstrate.
- You amplify.
- You support.
Educate the stack and application owners and stakeholders. Identify and educate the organizational advocates that exist in any group of people. Educate a team on what the other teams expect from them. Educate a team on what advantages virtualization brings to them specifically. Educate them on the technologies and how to identify the key performance metrics that matter to them.
Education will make believers of most people.
Next, setup a virtualization proof of concept on similar, if not equivalent, hardware to what you run in production. Identify your strangest, largest, or most complicated servers, and then add the servers you fear. You know, those servers that have been cranky when changes so they sit isolated and no one touches them. Clone the databases and workloads onto the new, properly sized virtual machines. Simulate production-like workloads and measure all of the key performance metrics you identified when you profiled your physical server. I know that if you followed the steps outlined in the previous posts, you will find that the performance of those virtualized workloads at least matches the physical equivalent. The technology has caught up with the business need and is now functionally transparent.
Now that you have objectively demonstrated equivalent performance in the virtualized environment, run up and down every hallway at your organization shouting ‘Virtualization works!’ No, wait. That might be counter-productive in the long run. Instead, demonstrate the performance equivalent to everyone in the organization that needs to believe in the technology. Demonstrate core vSphere functionality, such as vMotion, snapshots, or resource changes, and every response is ‘Wow!’
Make the right people believers by letting them see it for themselves. These people can become your biggest virtualization advocates.
Finally, support everyone through the production migration process. Sure, there might be hiccups along the way, but the hardest part is done. Once the virtualization of your business-critical applications has completed, the process is complete. All new servers will be virtualized up-front, and discussions would happen only if the workload was not assumed to be suitable for virtualization (fringe cases do exist but it is rare). If performance issues arise, virtualization will not be immediately everyone’s scapegoat.
You have succeeded.
Special thanks go to my coworkers at House of Brick. We all fight the virtualization fight and enjoy the challenge. I believe we have developed the best business-critical workload virtualization team in the world, and thank you all for constantly pushing yourselves and each other to stay in front of this groundbreaking shift in IT’s core.
We all believe in the technology, and I hope it shows in our eagerness and determination to progress as the technologies continue to evolve. After reading this blog post series, I hope that you believe in the technology as much as we do. We will continue to post new and exciting technologies to this blog, all of which are furthering the quest for performance for your data. Stay tuned!
By Jim Hannan
My next few blog posts will feature VMFS, In-guest, and RDM storage presentations. This will be a comparison of the different methods, how they differ, and what tooling options are available. In the final post I will review some Use Cases we have implemented.
VMFS vs. RDM Part I
Storage is one of the most important components when providing infrastructure to business critical applications--databases in particular. Most DBAs and administrators are well aware of this. In conversations with our customers, we are asked a common question during the design stage:
“Should I use VMFS, RDM or In-Guest storage?”
The question refers to storage presentation types. Storage presentation is the methodology used to present storage to your workloads. Here are the storage options available to you:
I like this question because it leads into important conservations far beyond just performance. You are also deciding on tooling. The choice of tooling will decide what options you have available. Are you going to use SAN level snapshots, will the database files exist on volumes that can use SAN tools for cloning and backups? Will you be using VMware SRM (Site Recovery Manager), Fault Tolerance, vCloud Director, or VMware Data Director? This should all be considered will selecting the storage presentation type.
Before we get into tooling considerations, we should talk about performance. In my experience, the storage presentations mentioned above all offer Tier 1 performance. Implementation is key in determining whether a storage type offers good performance. Understanding your storage vendor’s best practices is fundamental to good performance. For example, iSCSI and NFS are optimized to use Jumbo Frames. If you skip this procedure of enabling jumbo frames, your storage will run at half the throughput it is capable of.
What about RAID level? It is an interesting time in the world of storage, MetaLUNs and SSD caching have made the way we approach storage much different than 5 years ago when high-end storage was always built on RAID10 with separation of “hot” files onto separate LUNs. We at HoB stay away from the generic recommendation that your (business critical) database must live on RAID 10 for simply the reasons mentioned above.
The graphic below is something that we often run into when helping customers implement storage infrastructure. We call it our Apple and Oranges Comparison. Within 5 seconds of showing this slide we typically see storage and vSphere administrators heads nodding up and down in agreement.
The comparison depicts a traditional physical environment with dedicated storage that is not shared. Adversely, the VMware environment is setup to share storage with many virtual machines. The fundamental difference that we find is that the vSphere environment is built with consolidation exclusively in mind, and this is flawed. Do not misunderstand our message -- building for heavy consolidation is a great way to build a vSphere cluster or clusters. But for your Tier 1 workloads you will need to think differently, the same procedures and consolidation ratios will not always work to meet business critical application SLAs.
We encourage customers to think about two criteria when considering their decision: performance (previously discussed) and tooling. For tooling, VMFS offers the maximum number of options, there really is no limitation to vSphere tooling or it’s complementary products like VMware SRM or vCloud Director. The other storage presentation RDMs and in-guest storage options do limit you on VMware tooling. An example of this is VMware Fault Tolerance. VMware FT is currently supported only with VMFS. Below is a diagram Dave Welch, our CTO and Chief Evangelist, put together. I like this slide because it simply displays the four primary storage presentation types. The top of the slide starts with our preferred storage type and finishes with our least recommended storage solution.
IMPORTANT: This does not mean that one of the storage presentation types is the only best practice or one is simply wrong for your IT organization. In fact, we assisted with implementations that have used one of each of the storage types. Our general rule is that we prefer VMFS over all other storage options until inputs lead us elsewhere.
It should be noted that there are two flavors of RDMs (physical and virtual). I will discuss the specific of each later in the blog.
You are probably wondering what the difference is between RDM-P and RDM-V? And what is In-Guest storage? First let us look at RDMs. Part II will explore In-Guest storage, and Part III of this series will discuss use cases for the different storage presentation types.
RDM-P and RDM-V
You can configure RDMs in virtual compatibility mode (or RDM-V), or physical compatibility mode (or RDM-P).
RDM-V specifies full virtualization of the mapped device. In RDM-V mode the hypervisor is responsible for SCSI commands. With the hypervisor as the operator for SCSI commands, a larger set of virtualization tooling is available.
In RDM-P the SCSI commands pass directly through the hypervisor unaltered. This allows for SAN management tooling of the LUN. The cost of RDM-P is losing vSphere snapshot-based tooling.
Note: The exception to this is the REPORT LUN command. REPORT LUN is used for LUN ownership isolation. For more information, see the online vSphere Storage documentation under the topic RDM Virtual and Physical Compatibility Modes.
A common implementation for use of RDM-Physical is for SCSI reservation technologies like Microsoft Clustering. A SCSI reservation is a way for a host to reserve exclusive access to a LUN in a shared storage configuration.
Oracle Cluster Services does not use SCSI reservations. Instead, Oracle relies on its own software mechanisms to protect the integrity of the shared storage in a RAC configuration. HoB recommends RDM-V rather than RDM-P for Oracle RAC configurations for those shops that choose not to follow the VMDK recommendation.
RDM-P allows SAN tooling access to the storage. For this reason, shops that heavily leverage such tooling might be tempted to configure RDM-P
Finally I leave you with a comparison or RDM-P vs. RDP-V. In the next blog I will cover In-guest storage and begin to discuss some storage presentation use cases.
As I find my social media chops, no doubt I’ll post more frequently with less content per post. But meanwhile…
I’m told that Oracle’s Larry Ellison also began contributing to the Twitterverse this summer.
House of Brick’s Oracle VBCA Boot Camp Sunday 9/30 came off better than I expected, and I had high expectations. I think the attendees feel good about it, too, based on their end-of-session group chorus calling for more. I’m inclined to put extra weight on this particular group’s feedback. They were more interactive than most groups and therefore challenged me in helpful ways. I’ll be channeling the feedback into House of Brick’s SQL Server boot camp Tuesday November 6th in Seattle in conjunction with the SQL Pass conference.
I’m declaring Colin Bieberstein of Husky Energy – Calgary as my boot camp guest of honor. I got the impression that Colin may have made significant personal sacrifices to attend on just two weeks’ notice.
As for Database 12c, I’m pleased that the announcement made before Larry Ellison took the stage aligned with what I’ve been already telling everyone. A few months ago Larry predicted the 12c GA release this December or January. That was never going to happen, minimally due to the increasing complexity of the release. Add to that the coding and QA challenges Oracle faces with what has always given the appearance of involving an Exa code branch.
The announcer said sometime in calendar year 2013. I’ve been telling everyone not a minute before June but more probable toward December 2013. I’ve also been saying don’t look for release stability worthy of production systems anytime before the middle of 2014. That’s not a hit on Oracle. It is just the nature of the beast with code this complex underpinning business-critical systems. I continue to be in love with their red stack software, and sincerely hope that the love affair continues.
As for Database 12c’s multitenancy (pluggable databases on top of a consolidated instance), I firmly believe it is tooling the wrong way at the wrong stack layer. That was my immediate reaction on December 19th, 2011 when I first got wind of the release’s architectural direction. On the contrary, I have been preaching for three years an architectural direction that I have dubbed “atomicity.” That is, in vSphere environments, move toward an alignment of one database instance, middle tier component, or utility per guest OS.
Atomicity accomplishes two major objectives. It makes architecture with tinker toy components much easier for those building initial stack prototypes and with less technical administrator involvement. Think vCloud Director. That accelerates time to market. Smaller memory/CPU workload alignments also dramatically facilitate live migration. That facilitates dramatically better utilization of an Oracle processor-based license, which in many cases can be a stack’s most expensive component.
I am constantly challenged in my travels by hardware-centric DBAs and System Administrators justly concerned about the prospect of managing four to ten times as many Oracle executable and OS instances. I tell them that 100% of their vSphere-experienced peers that I interact with in shops that know what they’re doing with vSphere say they would never go back. The model is to allocate a fraction of the savings that vSphere provides to patch automation tooling.
Database 12c is attempting virtualization not one but two layers higher in the stack than the vSphere platform that accomplishes virtualization in spades. When architecting system stacks, always assume to begin the discussion by tooling at the lowest layer of the stack possible, unless there are very compelling business and/or technical justifications to do otherwise.
Oracle VBCA is predictable when you know what you’re doing. I tell people that the one-time replatform from big iron to x86 is the hard part, minimally because, unlike us, enterprises don’t do it every day. I’m convinced the RISC UNIX->x86 hop is the single most important thing organizations can plan for in their move toward the world’s premier platform vSphere. Accordingly, Jeff Browning’s replatforming preso will get my nod for OOW best of show. He sleuthed out the fact that RMAN CONVERT’s sys.dbms_backup_restore endian translation used to work just fine into any x86 platform. That is until three years in October 2008, when patch 13340675 “fixed” it to only work with Exadata. You can also pick up Jeff’s preso recording at VMworld 2012.
Jeff invited me to lunch yesterday. The more time I spend with Jeff, the more humbled I am by how big his heart is. Thanks, Jeff. I continue to benefit from our professional and personal friendship.
The VMware 2012 Pavilion is appropriately themed with vFabric at every turn. Charles Fan is a humble, self-effacing man who would never bring up the incredible things he’s done for Joe Tucci. Charles’ vision for what is now called vFabric is really finding its voice, and that’s clear as you stroll around the VMware pavilion.
Yesterday vFabric’s Bill Bonin shared perspective with me--a month after VMware’s Chief Performance Officer Richard McDougall did the same thing--on how big the in-memory Hadoop market is projected to be in just a few short years. Guys, thanks to both of you for your one-on-one attention to make sure I keep Hadoop in my periscope.
VMware, I need you to align vFabric Data Director with vCloud Director strategically if not integrate them technically. Yet they are being marketed independently and have segregated internal organizations. I’m at a loss to explain the lack of product alignment as I now add the vFabric Data Director message to what I have always felt was VMware’s best tooling: was Lab Manager is vCloud Director. There’s an apparent massive operational deployment intersection here. This is no different than my incessant internal harping to you years ago that Lab Manager and Stage Manager were the same thing, despite the fact that they were separate code bases. You eventually merged Stage Manager into Lab Manager. C’mon, vFabric Data Director was announced a year ago. As one who provided architectural guidance into what was code-named “Aurora” and as a lightning rod evangelist for VBCA, I’m struggling to articulate the cogent unified vision for these separate products. It’s got to be even harder for your prospects to capture that vision. It’s going to get even worse now with your phenomenally shrewd acquisition of DynamicOps.
I ran into a former Oracle RAC employee yesterday who confirmed what we always knew. The statement that got deleted from the published My Oracle Support note (off the top of my head) “There are technical restrictions that prohibit the certification of RAC in a VMware environment” had to do with clock drift. (Clock drift went away with vSphere 4 in 2009).
Ron Zellars from the world’s largest ice cream factory - Wells Dairy - stopped by the VMware pavilion to say hi. We are proud to have helped them over a year ago with their EBS R12 upgrade and replatform of RAC to vSphere.
Brian and Richard from the U.S.’s largest appliance manufacturer - G.E. Appliances and Lighting - also stopped by to say hi. Brian’s business card now sports “Chief Evangelist” because CTO Lance Weaver thought it was cool on my card when GE and HoB got introduced here last year. I may have to declare Brian’s business card title as my biggest professional accomplishment for 2011!
Having said that, my most enjoyable encounter of the day was with Wize Commerce’s DBA Selina Lin out of San Mateo, CA. Selina took in my “Oracle RAC and VMware HA Tooling - A Decision Tree” VMware theater preso but couldn’t return Tuesday for the replay of my “Business Critical Applications Performance – VM vs. Native” preso. So we found a couple chairs just outside exhibit hall doors and took our time with that preso one-on-one. Selina, our ½ hour working session made my day. All the best to you.
My esteemed colleague Cisco’s Tushar Patel invited me to lunch today. Tushar was the VMware-side engineering force behind HoB’s groundbreaking VMware Oracle Solutions Lab at OOW 2007.
My only VMware 2012 pavilion complaint: inadequate white board space. Thank goodness for the iPad Paper app!
And speaking of apps, Uber is the coolest most practical iPhone/iPad app I’ve seen lately! Map all the available drivers in your geo and summon up the nearest one with a click without placing a call to a cab dispatcher.
Run Uber on the iPad 3 you’re going to win when you physically present Tuesday and Wednesday in the VMware Pavilion at 5:30. Your odds are pretty good as I’m thinking there just weren’t that many bodies present given the size of the daily prize.
To close out this mega-post: I was asked yesterday for feedback on a partner strategy event earlier this year. I answered with a concept that I’ve introduced into my professional activities from my volunteer teaching experience: “Never let planned content interfere with a productive discussion.”
By Jim Hannan
I recently discovered a white paper published by VMware on tuning latency-sensitive workloads:
Best Practices for Performance Tuning of Latency-Sensitive Workloads in vSphere VMs
Being part of a performance team that virtualizes business critical applications, we are always looking for better methodologies. So obviously a white paper written by VMware on how to improvement performance for latency sensitive workloads would be of interest.
This blog discusses some the tweaks and settings introduced in the white paper. I have also provided recommendations on whether we would suggest using each performance tweak. I think it is important to stress two things before considering using any of the settings.
- VMware very clearly states that for most workloads these tweaks are unnecessary. The category of application that would benefit from these settings are truly for latency sensitive applications that have SLAs in the sub-second range.
- In some cases we have not had the opportunity to benchmark or examine the results of settings. So care should be taken. At HoB, we believe the best approach is to first benchmark in a test environment before implementing in production.
Turn off BIOS Power Management
This is something we hope all customers are doing. The Intel Nehalem has two power management options:
- Intel Turbo Boost
C-state can increase memory latency according to VMware and is not recommended. Intel Turbo Boost should be left on. According to VMware it will increase the frequency of the processor should the workload need more power.
Tickless Kernel in RHEL 6
We have watched this from afar for 5 plus years. In the early versions of ESX, the guest would suffer from clock drift. Clock drift is when the clock of the OS falls behind. This was common for SMP would loads that were busy doing work or on an ESX host with constrained resources.
Moving off the RHEL 5 to RHEL 6 and the tickless kernels can reduce application reduce latency. The tickless kernel is a better time keeping mechanism. Additionally, VMware is claiming is can offer a better performance for latency sensitive applications.
NUMA Node Affinity
NUMA affinity basically assigns a VM to a NUMA node. This can be monitored with ESXTOP. I would not recommended this until it has been determined that NUMA latency is an issue. We say this because each application handles NUMA differently. Oracle for example chooses to not use NUMA as of 11g.
To monitor NUMA latency with ESXTOP
esxtop > f > g > enter > (capital) V
N%L < 80 for any VM than you the workload may have NUMA latency issues.
Most NICs support a feature called interrupt moderation or interrupt throttling. This basically buffers (or queues) interrupts for the network card so the host CPU does not get overwhelmed. This can be disabled on most NICs to give latency sensitive application better network throughput. VMware recommends this in only the most extreme cases. We agree that this should be used carefully. We would consider this with Oracle RAC workloads that are suffering latency issues on the RAC interconnect.
VMXNET3 Interrupt Coalescing
As of vSphere 5, VMXNET3 supports interrupt coalescing. Interrupt coalescing is similar to a physical NICs interrupt moderation. The philosophy behind interrupt coalescing is to benefit the entire cluster by reducing the CPU overhead of TCP traffic. However for latency sensitive applications--like Oracle RAC--it is best to disable the interrupt coalescing.
Go to VM Settings →Options tab →Advanced General →Configuration Parameters and add an entry for ethernetX.coalescingScheme with the value of disabled.
If you have a cluster or host that runs latency sensitive applications you can also disable it for the entire cluster with the setting below.
Click the host go to the configuration tab → Advance Settings → networking performance option CoalesceDefaultOn to 0 (disabled).
VMXNET3 Large Receive Offload (LRO)
Similar to the feature above, the VMXNET3 feature LRO aggregates multiple received TCP segments into a large segment before delivery to the guest TCP stack.
We recommend that you disable LRO all Oracle virtual machines.
# modprobe -r vmxnet3
Add the following line: /etc/modprobe.conf
(Linux version dependent): options vmxnet3 disable_lro=1
Next, reload the driver: # modprobe vmxnet3
Prevent De-scheduling of vCPUs
I have a hard time imagining that heavy workloads need to use this setting. You can ensure that a virtual machine is never de-scheduled from a pCPU (physical CPU) by configuring the advance setting monitor_control.halt_desched. This setting is similar to a tight loop an application might have when a process never sleeps. It just continually executes a loop. VMware describes this setting as the VM becoming the owner of the pCPU (or pCPUs). Which indicates to me that the pCPU is not eligible to schedule other workloads, it in effect becomes dedicated to the “pinned” workload.
Go to VM Settings →Options tab →Advanced General →Configuration Parameters and add monitor_control.halt_desched with the value of false.
By Jim Hannan
For many customers, virtualization has been a part of their infrastructure for years. They have benefited from virtualization far beyond consolidation ratios. The vSphere feature-rich tooling has offered them fast provisioning, better availability, better distribution of infrastructure resources, and many more benefits. In many of my consulting engagements I am asked by customers: “What additional tooling should we be looking at?” To which I simply reply:
You should look at VMware SRM and VMware vCloud Director.
At House of Brick, we believe that every customer can benefit from the above-mentioned tools. First, VMware SRM (Site Recovery Manager) offers approachable disaster recovery. VMware SRM achieves this by simplifying the run book (procedures followed in a DR scenario) and a simplified DR strategy for all applications.
About 5 years ago, VMware introduced a product called VMware Lab Manager. Lab Manager offered customers the ability to clone and build DEV/Test environments. It was uniquely positioned in the market because it offered some key features that were truly only achievable by virtual backed infrastructure. The first feature, called linked clones, allowed for fast provisioning and a small storage footprint. The second feature was something called a network fence. The Lab Manager’s fenced network works similar to your home firewall. It creates a layer 2 separated network with a NAT firewall. This allowed users to clone production systems without changing the network stack.
A few years ago, VMware introduced vCloud Director. VMware has dubbed this product the replacement for Lab Manager. It was built on the great features that Lab Manager offered, but has greatly improved upon these features and added some interesting new ones.
What is vCloud Director?
VMware vCloud Director is an IaaS (Infrastructure as a Service) tool. Built on top of VMware vSphere virtualization, VMware vCloud Director enables the consumption of virtual infrastructure resources. vCloud Director abstracts the consumer from the infrastructure components like datastores, DRS, HA, ESX hosts, and vMotion. The consumer is allowed to expend resources as a tenant inside of the vCloud Director framework.
Use Cases and Features of vCloud Director
- Testing New Software – Many customers use vCloud Director to test new software. vCloud Director allows users to quick build vApps from organization approved VM catalogs. In some cases, the test may need to be done in isolation to protect the production server.
- Building vApps for Testing and Developing Applications – Build vApps of business critical applications for testing and developing. Clone production vApps and keep the network stack intact by using vCloud Director vShield Edge appliances.
- Free Operational Resources for Provisioning Test and DEV Systems – Most organizations’ infrastructure teams spend about 40% of their time provisioning and administrating test and DEV systems. With vCloud Director you can free infrastructure resources up to work on production systems by allowing users to self-provision.
- Enhance VM Build Standardization – vApps can be built from pre-approved catalogs of VM templates.
- Fast Provisioning – Deploy vApps in 60 seconds with vCloud Director fast provisioning.
- Security – Secure applications with vShield Edge appliances.
Would you like to see a demo of the product?
You can reach us through the contact page on our website. We will be able to setup a demo of the product and discuss some of the use cases for how the product is being used by customers today.
By Jim Hannan
This blog post focuses on VMFS-5 enhancements in vSphere and how they improve Virtual Business Critical Applications (VBCA) performance and scalability. VMFS-5 offers several improvements from the previous version VMFS-3.
How VMFS-5 has improved in scalability and performance:
- Newly created VMFS-5 datastores use a single unified block size of 1MB. There is no more choosing between 2MB – 8MB during VMFS creation. Standard block size of 1MB allows for improved performance, but more on this later.
- VMFS-5 uses GUID Partition Table (GPT) rather than MBR, which allows for pass-through RDM (RDM-P or RDM Physical) files up to 60TB. RDM-V has a maximum size limit of 2TB.
- VMFS-5 does not use SCSI-2 reservations, but uses the Atomic Test and Set (ATS) VAAI primitives. SCSI-2 reservations and ATS are used when, among other things, a lock is needed. A lock is required to perform some operations, like creating a VMDK or creating and deleting snapshots. The ATS mechanism allows for locks that are more efficient and improved performance. ATS requires the VMFS-5 files system and VAAI enabled SAN.
- VMFS-5 uses SCSI_READ16 and SCSI_WRITE16 cmds for I/O (VMFS-3 used SCSI_READ10 and SCSI_WRITE10 cmds for I/O). Of the new features, this is the technology improvement I am least familiar with. My understanding of SCSI_READ10 vs. SCSI_READ16 is that SCSI_READ16 offers larger storage capacity. SCSI_READ16 uses a 64-bit Logical Block Addressing (LBA) field allowing it addressing eight Petabytes of storage. Obviously, a VMFS datastore cannot grow this large, but you can see the potential.
Reference: vSphere 5 FAQ: VMFS-5
Upgrading to VMFS-5
HoB recommends creating new VMFS-5 datastores rather than in place upgrades of VMFS-3 to VMFS-5. When upgrading a VMFS volume you only partially benefit from some of the new features. As an example, VMFS-3 volumes upgraded to VMFS-5 will retain the original block size instead of the new 1MB block size. This affects features like Storage vMotion and ATS. In the case of ATS, the ESX hypervisor will revert to the slower SCSI-2 reservation. With Storage vMotion, a slower datamover mechanism called fs3dm will be used instead of the faster fs3dm-hw. In my next blog, Part 3 – Storage vMotion enhancements for VBCA, I will address Storage vMotion and its datamovers (fsdm, fs3dm, and fs3dm-hw).
by Jim Hannan
What is VBCA?
What is Virtualizing Business Critical Applications (VBCA)? A Business Critical Application is exactly what it sounds like, an application that is critical to business success and day-to-day operations. Without it, businesses would struggle to function. Common VBCA applications are: Oracle RDBMS, SQL Server, Exchange, and SAP to name a few.
I believe vSphere 5, as much as anything else, is a VBCA release. What I mean by that is, when you compare it to previous releases the new features are primarily enhancements for providing a platform for VBCAs. VBCA workloads typically have large resource requirements like CPU or memory and, in the case of databases, I/O throughput. Creatively, this type of workload has been dubbed by VMware as a 'Monster VM'.
With vSphere 5, VMware has created a platform more than capable of running Monster VMs. This blog, and the next 5-6 following it, will highlight some of the key features that, in my opinion, give vSphere 5 the accolade of a “VBCA release”.
Part I - Scalable Virtual Machines and Enhanced co-scheduler
At this point it is very well known that the maximum vCPUs a single VM can run is 32 vCPUs. For me, this number is staggeringly high. At HoB, we have been virtualizing VBCA workloads as far back as ESX versions 3 and 4, with a maximum of 4 and 8 vCPUs respectively. In our estimation, 90% of the workloads can fit into a configuration of 8 vCPUs (or fewer). At Indiana University we experienced this first hand when assisting them with virtualizing their OnCourse system. The OnCourse Oracle database was previously on an AIX Power5 LPAR with 12 CPUs allocated. During peak workloads the Oracle database was consuming 9.5 processors. After virtualizing the workload and load testing, we determined not only that the workload would fit without the 8 vCPU max, but it was only using between 35% - 50% of the CPU. This left lots of scalability for the database. Fast forward to today. With a maximum of 32 vCPUs, the door has opened for virtualizing 95% of the workloads in existence.
How did VMware make the jump from 8 vCPU maximum to 32 vCPU maximum?
This is an intriguing question. The best public information available on this achievement is discussed the VMware Whitepaper - The CPU Scheduler in VMware ESX 4. The “relaxed” co-scheduler was first introduced in ESX version 3.0 with a maximum single VM vCPU configuration of 4. In this version, the VMware engineers adapted a cell model. The cells were assigned to pCPU (physical CPU). A common processor configuration during the ESX 3 release was a physical processor with 4 cores. As pCPU core counts increased with AMD and Intel chips, VMware determined that the cell model was no longer adequate.
In ESX 4 the VMware engineers moved from the 4 vCPU limit to 8 vCPU by eliminating the cell architecture to finer-grained locks. This allowed a single VM to span multiple pCPU and allows for scheduling of one vCPU for certain task. This gives the guest OS the ability to schedule one vCPU for single process or thread and greatly reduces the overhead or cost of CPU scheduling from the previous ESX version.
ESXi 5 further enhanced SMP scheduling, increasing SMT application performance. This increase in some cases can reach up to 10% - 30% (see What’s New in Performance in VMware vSphere 5.0). And, of course, the new 32 vCPU maximum per a single VM increased from the previous maximum of 8 vCPUs.
Here’s the evolution timeline of the co-scheduler:
There is some discord and significant misinformation in the Oracle community as to what the contractual obligations are associated with licensing only a subset of a vSphere cluster’s hosts for Oracle. Furthermore, there is all manner of opinion out there regarding the risks of asserting one’s contractual rights. This definitive opinion piece waters down relevant Oracle License and Services Agreement language to a summary that you can get your hands around. It includes observations on what we are seeing and, more importantly, what we are not seeing in terms of legal actions with regard to the OLSA.
I just offered a fairly comprehensive Oracle-on-vSphere licensing opinion piece to my colleague Jeff Browning who is EMC’s chief spokesman for all things Oracle storage-related. My post appears in Jeff’s Oracle sub-cluster licensing thread at April 2nd.