By David Klee (@kleegeek)
How much time are you wasting every month with your SQL Servers because of nonstandard server configurations? Sit back and add it up.
I’ll pause while you pick your jaw up off the floor.
Every server is different. They have different purposes. They could be from different vendors, from different purchase cycles in different years, and have different operating systems at different patch levels. Even identical hardware can have different BIOS and driver revisions. Windows and SQL Server can be wildly different with patches and Service Packs. How can you keep everything up to date when every configuration is unique?
Do you even try?
How standard are your SQL Server configurations? From an ongoing operations standpoint, as your servers converge towards a common configuration, you save time and reduce risk over the lifecycle of that server.
Want an example? Just look at Southwest Airlines as a prime example of standardization. The airline standardized on Boeing 737 aircraft for their entire fleet. By standardizing on one aircraft, they have one aircraft type to maintain and operate at all of their airports. Other airlines have many more types in operation. The maintenance crews can master just one aircraft, simplifying operations. Spare part stores can be kept at a minimum. You get the picture.
Why do I bring up this topic? Consider asking your server admins and DBAs to determine the number of distinct server configurations in your environment. Now, filter out those systems that require a special configuration (vendor requirements, SQL Server collation settings, security, etc.).
I will guess that this number is a bit shocking.
Now consider the time spent handling patches, firmware and driver updates, and routine maintenance that is tailored to each system. Every different configuration requires special considerations and care to manage properly, and each unique configuration requires its own amount of update rollback planning prior to each maintenance cycle. You DO have rollback plans, don’t you?
These numbers are astounding. This amount of time that you just calculated could be most of someone’s full-time job duties. Or more frequently, it could be the amount of time that pushes an employee into dangerous levels of overtime.
Where does virtualization fit in?
By now, you have read some of my posts and know that I am a huge proponent of virtualization. Virtualization can help you standardize your infrastructure. You can create a master SQL Server virtual machine template on any modern hypervisor, and use it to deploy your entire server infrastructure with the same standardized and pre-approved configuration. Of course, server customizations are normal, but the differences between your templates and your production deployments are few. The benefits of this practice can be tremendous.
Some benefits are obvious.
- Pre-approved templates mean guaranteed consistency with your server builds. The server was built to your standards going in, and you can guarantee that those standards are maintained with each server deployment.
- You also dramatically reduce your new server deployment time. Server deployments can now take mere minutes, rather than the weeks or months of a traditional physical server procurement and deployment cycle.
- You also eliminate the need to constantly monitor and upgrade things such as BIOS and hardware driver revisions. The only component that needs updating is the VMware Tools, and as of vSphere 5.1, those updates do not require a reboot.
Now, some of the benefits of template-based virtualization are much less obvious.
- Your maintenance and upgrade operation rollback plans drop from terrible, time-consuming processes each and every time you have to update or patch — down to just a few clicks, thanks to VM-level snapshots.
- The time and resources required to perform the inevitable hardware refresh goes to zero once the VM migration has been completed. Simply hot-migrate (via vMotion or Live Migrate) the VM to the new hypervisor hosts and you are done! This act alone will pay great dividends across the organization if you factor in staff time into the ROI calculations.
- You can now hot-clone a production server to an isolated network and test various configuration changes (and witness the effects) on a common server configuration, thanks to VM-level cloning operations. This task can reduce the risk of unexpected impacts of configuration changes on those servers with common templates.
Now, how do you handle standardization when your business treats server configurations like it is the Wild West?
- The business must sign off that these templates will be used across the board, and that one-off configuration changes be documented once they are approved.
- The business should also standardize on common configurations, policies, pre-installed tools, etc. Some businesses have different organizational units that each have their own server build standards. Physical and virtual server sprawl can turn into template sprawl if not kept in check, and the goal of VM standardization goes right out the window.
- Audit server configuration changes periodically, and check your audit history for compliance. Anyone sneaking configuration changes onto a server without going through a change control process should be discovered.
- Patch as needed, but try to keep the deployed servers as close to the same patch level as possible. This keeps the server configuration close, and makes testing patches and updates that much easier.
Keep your server builds as close as possible, and watch the time spent on rote operations drop! Your administrators will thank you for it.
House of Brick is pleased to announce the hiring of Bob Lindquist, industry veteran and leader in the virtualization of business critical applications (VBCA), as the business development director for the west region.
“Bob was one of the early visionaries in the industry to grasp the importance of virtualizing business critical applications such as Oracle and SQL Server-based systems,” said Jim Ogborn, VP of Client Services at House of Brick. “I am excited that he will bring his vision of virtualization and cloud computing coupled with House of Brick’s industry-leading VBCA services to our customers and partners in the west region.”
House of Brick is leading the industry in virtualizing business critical systems, having provided services to dozens of the Fortune 500 companies, as well as hundreds of other enterprises of all sizes. Service bundles such as the Oracle or SQL Server on VMware Enablement, and the Oracle License Review and Architecture Optimization have helped House of Brick customers save countless millions of dollars while improving high availability, disaster recoverability, and new feature time to market.
“Virtualizing business critical systems, such as those built on top of Oracle DB or MS SQL Server is sometimes an intimidating thing for our customers,” says Nathan Biggs, House of Brick CEO. “This fear can come from not having done this type of virtualization successfully before, and can also come from misunderstandings about performance, licensing, and support from their software vendors. House of Brick knocks down the technical and emotional barriers that our customers face in taking on projects like this, and provides key services for successful implementations.”
In commenting on the hiring of Bob Lindquist, Biggs said “Bob will bring tremendous energy to the growth we are experiencing at House of Brick. With the increased VBCA activity we are seeing in the west, Bob is a welcome and critical addition to our team.”
"In my 25 years as a customer, solution provider, and consultant, I have seen the transformation of IT in all organizations due to enabling technologies like virtualization,” said Bob Lindquist. “VBCA is the next enabler for providing IT-as-a-Service to lines of business. House of Brick is the industry specialist in this space, and I feel privileged to join such a strong team of IT visionaries."
Bob Lindquist immediately assumes business development responsibility for House of Brick’s customers and partners in the western states region, including Arizona, California, Colorado, Nevada, New Mexico, and Utah. Come meet Bob at House of Brick’s Suite at Mandalay Bay during VMware’s Partner Exchange, February 23rd through the 28th in Las Vegas.
By Jim Hannan (@HoBHannan)
As virtualization adoption increases for Tier 1 workloads, many organizations are looking for ways to automate deployments. Oracle VM has touted this as one of its premier features (Users Can Deploy Oracle Real Application Clusters Up to 10 Times Faster with Oracle VM Templates). In my opinion Oracle has done a really good job with this. As a virtualization software provider, Oracle recognized early on the importance of automating the deployment of Oracle workloads. Oracle primarily does this with Oracle VM templates, which I believe have some shortcomings (these shortcomings are the same for VMware templates). Templates for both Oracle and VMware have the same concept. A virtual machine is created and the copied to a template for future deployments. The problem with templates is that are they are out-of-date within the next release of a security patch or organizational standard change. Managing and updating the template itself becomes a burden with dimensioning returns the older the template gets.
Many organizations have recognized the shortcomings of templates and have moved to automated deployments. HoB has had the opportunity to assist and help customers with automation. Each organization we have worked with has had a slightly different approach to the solution, but fundamentally they all used a similar approach.
- Create a virtual machine (vCPU, memory, network)
- Boot the virtual machine (typically using a boot image)
- Begin automated OS install (most common for Linux is kickstart)
- Use scripts to further customize the build
- Install Oracle software (process varies)
- Create the Oracle database (process varies)
The table below contains the different steps as outlined above with the various tool options. Take note that in the tool column, I have added numbers, e.g.  or . This represents different tools for the particular step.
In my next blog, I will do a walk through of some of the tools mentioned above.
By David Klee (@kleegeek)
The week of November 5th was a very busy week here at House of Brick. First, knowing that the SQL PASS Summit was taking place in Seattle, we held a five-hour boot camp for Virtualizing Your Business Critical SQL Servers at the Sheraton Seattle Hotel -- just two blocks from the Seattle Convention Center. We consider ourselves the leaders in the virtualization of Business-Critical Oracle and SQL Server workloads on VMware. Based on the positive comments from SQL Saturday and SQL Server User Group attendees over the last year, we decided to release the full five-hour training at this free event.
I co-presented the boot camp with Dave Welch (HoB's CTO and Chief Evangelist), and I must say that I had a blast. We had a solid attendee roster, and the enthusiasm and conversation from the attendees was fantastic. I feel as if we could have spoken for twice the amount of time and still not tackled every component of this topic.
The outline of the training was as follows:
- Introduction to SQL Server Virtualization
- Virtualization Trends
- Common Objections and Misconceptions
- Physical Stack Fundamentals
- SQL Server Licensing Concepts
- Storage, vSphere Host, and Networking
- Virtual Machine Layer
- VM and Guest Operating System Customizations
- Virtual Storage and Presentation Options
- Installation and Optimally Configure SQL Server
- SQL Server on VMware Prototype
- Benchmarking and Baselining Performance
- Workload Selection
- Beyond the Prototype
- Disaster Recovery and High Availability Options
- SQL Server Clustering
- SQL Server 2012 AlwaysOn
This event was House of Brick's first full-length SQL Server training opportunity, but it most certainly will not be the last. After the overwhelmingly positive feedback from the attendees, we are working to develop an official 2013 training curriculum for both Oracle and SQL Server. More details will be posted here once these offerings are formalized.
(Image Source: David Klee)
Once SQL PASS Summit started, I spent a lot of time in the exhibit hall in the VMware Corporation booth. Over the next three days, I answered hundreds of questions on all topics from ‘What is Virtualization?’ to advanced questions such as ‘Is the PVSCSI controller really better than the LSI SAS controller?’ I spent quite a bit of time talking about VMware’s latest offering in the database space, the next version of vFabric Data Director. This product rev was announced at the show, and as soon as it is released GA it will support truly private-cloud-based SQL Server databases in addition to the previous versions, which supported Oracle and vPostgres database platforms.
By far, House of Brick's most exposure of the week came with the opportunity I had to co-present a session with the infamous Kevin Kline in a session called ‘Managing SQL Server in a Virtual World’. To speak with Kevin in this setting -- a presentation room filled almost to capacity, and a terrific array of questions and comments from attendees -- was one of the best professional moments of the year for me.
(Image Source: Kevin Kline)
After this event, we are actively planning our 2013 SQL Server community involvement avenues, and simply cannot wait to get started!
By David Klee (@kleegeek)
Welcome to the sixth part in our series of posts regarding the virtualization of your business critical SQL Servers. In this post, I discuss Disaster Recovery of your SQL Servers while running on VMware. I wrap up the post with a discussion of techniques used to help demonstrate the power of virtualization and discuss the benefits to those individuals that might continue to fear (or not understand) virtualization.
High Availability and Disaster Recovery
To start the discussion, let’s begin with a reminder of the two definitions of High Availability (HA) and Disaster Recovery (DR).
High Availability (HA) is a design approach where systems are architected to meet a predetermined level of operational uptime, such as a Service Level Agreement (SLA). This means systems should have appropriate levels of redundancy while still keeping the environment as simple as possible as to not introduce more failure points.
Disaster Recovery (DR) is the process of preparing for, and recovering from, a technology infrastructure failure that is critical to a business. The core metrics of DR are the Recovery Point Objective (RPO) and Recovery Time Objective (RTO). The RPO is a measurement of the maximum time window where data could be lost in the event of a disaster. The RTO is the measurement of the maximum time that a core service or system can be offline after a disaster.
High availability is not disaster recovery, and disaster recovery is not high availability. Do not confuse the two, and do not assume that a solution for one is the same solution for the other.
In my previous post I talked about the options for SQL Server HA on VMware. We continue this topic now with a greater deep-dive on DR options for SQL Server on VMware.
Disaster Recovery Discussion Topics
A proper DR configuration must be geographically separated by a reasonable amount of distance. Here in the Midwest, the extreme weather plays a key role in determining DR distances. For example, a tornado can take out a few square miles of land, but what it does strike it will probably level. On the other hand, a blizzard could come through in the winter months and knock out power for large amounts of the area for days. In this scenario, the equipment suffers no real damage, but it is equally as unavailable. Both scenarios knock you offline for an indefinite period of time.
Be smart in your placement of your resources. One of the worst examples of this was from a friend whose company had its primary datacenter in Miami. The secondary datacenter was in New Orleans. Does anyone remember Hurricane Katrina? It was about to knock out their first datacenter, so the company failed over to the New Orleans site. Four days later it destroyed their secondary datacenter, and because the primary was still down, they were left with their datacenter on the third floor of a building with 15 feet of water in the lower floors. The business was down for weeks while off-site backups were gathered, a new datacenter and equipment allocated, and the business restored to the new site. It was an awful process.
The RPO and RTO come into more importance when discussing DR strategies. The recovery point and recovery time objectives must be determined before you start the DR planning process. They drive certain technologies as part of the architecture. These numbers can quickly eliminate several options and, sometimes, can help the business rethink the numbers.
One of the major points that few people discuss is the fail-back process. How do you get your data back to the primary datacenter after you have failed over and now have new transactions? What if your downtime requirements are tight?
So, without further ado, here are the methods for achieving DR with SQL Server on VMware.
VMware Site Recovery Manager
VMware Site Recovery Manager (SRM) and its little sister, vSphere Replication, are both means to handle SAN-to-SAN replication for disaster recovery purposes. The recovery time objectives with both of these products are low and, most importantly, the human steps in the failover process are few (or sometimes nonexistent).
SRM also handles failover and failback processes and can perform automatic testing of the replication process. It’s fantastic for those environments that have to routinely demonstrate the DR process without impacting the servers. Multiple failover and failback plans can be defined and carried out as required.
Single instance SQL Server on VMware works great with SRM, as long as you can demonstrate transactional integrity during the replication. Many SAN vendors have SRM guides for properly configuring the environment to work with SRM. Test transactional integrity with your normal (and maximum) workloads before putting these solutions into production.
However, the smallest RPO per VM or LUN replication is 15 minutes, and sometimes longer, depending on the SAN vendor. Keep this in mind. Sometimes, a combination of this strategy and a more real-time replication, such as database mirroring or AlwaysOn, could be used to complement the SRM strategy and reduce the RPO accordingly.
RTO: Very low compared to other solutions. Measured in minutes. Can dictate order in which VMs are powered on.
RPO: No shorter than 15 minutes, which can be problematic for some environments.
Failover Process: Simple, repeatable, and multiple plans can be defined for various situations.
Failback Process: Simple, repeatable, and multiple plans can be defined for various situations.
Pros: Low human intervention means failover process has a lower chance of errors during failover or failback. Can be tested and audited periodically without impacting operations.
Cons: RPO could be higher than business can allow.
vSphere Replication is a new feature of vSphere 5.1 that is included for free with all versions from Essentials Plus and above. Instead of orchestrating san-to-san LUN replication, this technology handles site-to-site replication of virtual machine change blocks of individual virtual machine disks. Failover is manual, but for a free product, this technology handles replication very well.
As with VMware SRM, the smallest RPO per VM replication is 15 minutes. Keep this in mind, and sometimes a combination of this strategy and a more real-time replication, such as database mirroring or AlwaysOn, could be used to complement the SRM strategy and reduce the RPO accordingly.
RTO: Very low compared to other solutions. Measured in minutes.
RPO: No shorter than 15 minutes, which can be problematic for some environments.
Failover Process: Simple, manual, and repeatable.
Failback Process: A manual process but is simple and repeatable.
Pros: Included with the core vSphere 5.1 suite in almost all licensing levels, and is simple to setup DR with commodity hardware and a 15-minute RPO.
Cons: RPO could be higher than business can allow. Failback process is manual.
SQL Server Asynchronous Mirroring
SQL Server database mirroring has been around since 2005 Service Pack 1. Asynchronous database mirroring allows a near-real-time replication of data between the two nodes. If a failover occurs, the secondary node fires up and picks up where the other left off. However, data might be missing if failover occurs while data is still in transit.
When running in this configuration, a witness server should also be used to facilitate the automatic failover. Otherwise, the decision to promote a secondary server to principal server is not automatic.
Also, asynchronous mirroring is only available in the Enterprise editions of the product. Standard edition gets you ‘full safety’ mode, which is real-time synchronous replication only. If your bandwidth between the two sites is either slow or has a high latency, you will feel the performance impact rather quickly. You could even experience significant delay in your production operations because of this impact. Unless the bandwidth AND latency between the two sites exceeds the transactional rates required by your workload during ALL periods of the day, synchronous mirroring is not recommended as a solution for production DR.
You will also notice that if the bandwidth throughput between the two sites is lower than required, data can stack up waiting to be replicated and the recovery point objective might not be obtainable.
RTO: Very quick, as the roles are flipped and the secondary node fires right up.
RPO: Variable, depending on transactional volume and available bandwidth.
Failover Process: Quick and generally without human intervention.
Failback Process: Resynchronize and fail the nodes back to primary. It is very straightforward and simple.
Pros: Quick failover times. Simple and not error-prone failover or failback processes.
Cons: Expensive licensing. Application must support mirroring failover target. Failover is at the database level and not the instance, so applications with multiple databases might require scripting.
SQL Server 2012 AlwaysOn Asynchronous Replication
SQL Server 2012 AlwaysOn is a blend of the virtual IP address and failover of Microsoft Failover Clustering and an improved derivative of SQL Server database mirroring. Because of the options that I previously covered here in this post, AlwaysOn can serve as both a DR and an HA solution, depending on the configuration. Again, I stress asynchronous replication here because of the impact of the speed and latency logistics around WAN synchronous replication.
RTO: Failing a node over to a DR site could be measured in seconds, depending on transactional volume and replication rates.
RPO: Could be as low as a second or less, depending on transactional volume and replication rates.
Failover Process: As simple as it gets. Happens in seconds. The Availability Group moves over and the application reconnects to the same IP address as before.
Failback Process: Even simpler. It’s the same process as failover, and just works!
Pros: Easy to setup and easier to manage. Extremely simple failover and failback processes.
Cons: Potentially expensive licensing. Must go through the upgrade process to SQL Server 2012.
Transactional replication allows the replication of data and schema changes from one server to one or more other servers. Data can be replicated at various time intervals in a number of ways. One good note about transactional replication is that you can select which tables get replicated. If some tables need to be replicated and others can be ignored, you can do this!
No automatic means for failover currently exist, so failover is a manual process. Data sources must have their targets changed, or for the more technically savvy, DNS aliases would need to be updated to point to the new server. The failover process would need to be documented and heavily tested. No automated fail-back procedure exists, so this process would also need to be planned for and tested heavily as well.
RTO: Varies depending on the complexity of the environment and the applications that connect to the database(s).
RPO: Varies depending on the replication time period to the distributor plus the time to have the subscribers fetch and apply the transactions.
Failover Process: Purely manual process.
Failback Process: Purely manual process.
Pros: Can selectively replicate databases or data subsets to one or more target servers on whatever time period you require.
Cons: Potentially complex and can be cumbersome for failover and failback processes.
Log shipping is a process for backing up, copying, and then restoring database transaction logs from a source database to a standby server. It has been around quite a while, and has been used successfully for DR purposes for years.
RTO: Relatively short, generally 15 minutes or less.
RPO: Varies, based on the configuration, and can be very short.
Failover Process: Purely manual process.
Failback Process: Purely manual process.
Pros: Reliable once configured and established.
Cons: Purely manual failover and failback. Downtime is guaranteed while failover occurs. Application must be reconfigured to point to the new server, or a DNS alias adjusted.
Backup and Restore
If a low RPO is not a requirement (and yes, those environments actually DO exist!), a simple replicated backup and an automated restore could suffice. If a somewhat lower RPO is needed, periodic transaction log backups could be replicated as well.
RTO: Varies. Many factors affect this figure: size of the database, speed of the servers and storage at the DR site, and the state of the replicated backup files, just to name a few.
RPO: Varies, and is probably poor. It could be as long as the time taken between backups AND / or the time to replicate the backup file(s) to the DR location.
Failover Process: Restore last known good and successfully replicated backup. If exist, restore any replicated post-backup transaction log backups.
Failback Process: Create a new backup and restore at the primary site.
Pros: Least complex and simple to manage.
Cons: Long RPO and possibly RTO. Manual process.
Why isn’t Clustering in this List?
Clustering is not a solid DR solution. Why not? Think about the following points.
- Shared storage is available at a single location, practically speaking. Extending a SAN’s high speed interconnects across a WAN is impractical.
- Your one shared-storage SAN is still a single point of failure. I don’t care what any SAN vendor says. It can still fail. It still needs maintenance. It still has a power cord that can be tripped over. Etc. etc.
Now ask that question again. Clustering is for high availability, not disaster recovery.
Now What? Demonstrating that Virtualization Helps
So now what? You now have a solid understanding of each component in the physical and virtual stacks. You know the methods for profiling the existing physical servers and building the appropriately sized virtual equivalents. You know about HA and DR strategies.
But what is the hardest part about virtualizing business-critical servers?
It’s the environment.
People come in many forms – from employees who were burned from a failed virtualization attempt. It can come from skeptics (my favorite challenge). It can come from an organization that is naturally resistant to change. It can be from those with no frame of reference and are fearful of adding another black-box layer to their environments.
So what do you do?
- You educate.
- You demonstrate.
- You amplify.
- You support.
Educate the stack and application owners and stakeholders. Identify and educate the organizational advocates that exist in any group of people. Educate a team on what the other teams expect from them. Educate a team on what advantages virtualization brings to them specifically. Educate them on the technologies and how to identify the key performance metrics that matter to them.
Education will make believers of most people.
Next, setup a virtualization proof of concept on similar, if not equivalent, hardware to what you run in production. Identify your strangest, largest, or most complicated servers, and then add the servers you fear. You know, those servers that have been cranky when changes so they sit isolated and no one touches them. Clone the databases and workloads onto the new, properly sized virtual machines. Simulate production-like workloads and measure all of the key performance metrics you identified when you profiled your physical server. I know that if you followed the steps outlined in the previous posts, you will find that the performance of those virtualized workloads at least matches the physical equivalent. The technology has caught up with the business need and is now functionally transparent.
Now that you have objectively demonstrated equivalent performance in the virtualized environment, run up and down every hallway at your organization shouting ‘Virtualization works!’ No, wait. That might be counter-productive in the long run. Instead, demonstrate the performance equivalent to everyone in the organization that needs to believe in the technology. Demonstrate core vSphere functionality, such as vMotion, snapshots, or resource changes, and every response is ‘Wow!’
Make the right people believers by letting them see it for themselves. These people can become your biggest virtualization advocates.
Finally, support everyone through the production migration process. Sure, there might be hiccups along the way, but the hardest part is done. Once the virtualization of your business-critical applications has completed, the process is complete. All new servers will be virtualized up-front, and discussions would happen only if the workload was not assumed to be suitable for virtualization (fringe cases do exist but it is rare). If performance issues arise, virtualization will not be immediately everyone’s scapegoat.
You have succeeded.
Special thanks go to my coworkers at House of Brick. We all fight the virtualization fight and enjoy the challenge. I believe we have developed the best business-critical workload virtualization team in the world, and thank you all for constantly pushing yourselves and each other to stay in front of this groundbreaking shift in IT’s core.
We all believe in the technology, and I hope it shows in our eagerness and determination to progress as the technologies continue to evolve. After reading this blog post series, I hope that you believe in the technology as much as we do. We will continue to post new and exciting technologies to this blog, all of which are furthering the quest for performance for your data. Stay tuned!
As I find my social media chops, no doubt I’ll post more frequently with less content per post. But meanwhile…
I’m told that Oracle’s Larry Ellison also began contributing to the Twitterverse this summer.
House of Brick’s Oracle VBCA Boot Camp Sunday 9/30 came off better than I expected, and I had high expectations. I think the attendees feel good about it, too, based on their end-of-session group chorus calling for more. I’m inclined to put extra weight on this particular group’s feedback. They were more interactive than most groups and therefore challenged me in helpful ways. I’ll be channeling the feedback into House of Brick’s SQL Server boot camp Tuesday November 6th in Seattle in conjunction with the SQL Pass conference.
I’m declaring Colin Bieberstein of Husky Energy – Calgary as my boot camp guest of honor. I got the impression that Colin may have made significant personal sacrifices to attend on just two weeks’ notice.
As for Database 12c, I’m pleased that the announcement made before Larry Ellison took the stage aligned with what I’ve been already telling everyone. A few months ago Larry predicted the 12c GA release this December or January. That was never going to happen, minimally due to the increasing complexity of the release. Add to that the coding and QA challenges Oracle faces with what has always given the appearance of involving an Exa code branch.
The announcer said sometime in calendar year 2013. I’ve been telling everyone not a minute before June but more probable toward December 2013. I’ve also been saying don’t look for release stability worthy of production systems anytime before the middle of 2014. That’s not a hit on Oracle. It is just the nature of the beast with code this complex underpinning business-critical systems. I continue to be in love with their red stack software, and sincerely hope that the love affair continues.
As for Database 12c’s multitenancy (pluggable databases on top of a consolidated instance), I firmly believe it is tooling the wrong way at the wrong stack layer. That was my immediate reaction on December 19th, 2011 when I first got wind of the release’s architectural direction. On the contrary, I have been preaching for three years an architectural direction that I have dubbed “atomicity.” That is, in vSphere environments, move toward an alignment of one database instance, middle tier component, or utility per guest OS.
Atomicity accomplishes two major objectives. It makes architecture with tinker toy components much easier for those building initial stack prototypes and with less technical administrator involvement. Think vCloud Director. That accelerates time to market. Smaller memory/CPU workload alignments also dramatically facilitate live migration. That facilitates dramatically better utilization of an Oracle processor-based license, which in many cases can be a stack’s most expensive component.
I am constantly challenged in my travels by hardware-centric DBAs and System Administrators justly concerned about the prospect of managing four to ten times as many Oracle executable and OS instances. I tell them that 100% of their vSphere-experienced peers that I interact with in shops that know what they’re doing with vSphere say they would never go back. The model is to allocate a fraction of the savings that vSphere provides to patch automation tooling.
Database 12c is attempting virtualization not one but two layers higher in the stack than the vSphere platform that accomplishes virtualization in spades. When architecting system stacks, always assume to begin the discussion by tooling at the lowest layer of the stack possible, unless there are very compelling business and/or technical justifications to do otherwise.
Oracle VBCA is predictable when you know what you’re doing. I tell people that the one-time replatform from big iron to x86 is the hard part, minimally because, unlike us, enterprises don’t do it every day. I’m convinced the RISC UNIX->x86 hop is the single most important thing organizations can plan for in their move toward the world’s premier platform vSphere. Accordingly, Jeff Browning’s replatforming preso will get my nod for OOW best of show. He sleuthed out the fact that RMAN CONVERT’s sys.dbms_backup_restore endian translation used to work just fine into any x86 platform. That is until three years in October 2008, when patch 13340675 “fixed” it to only work with Exadata. You can also pick up Jeff’s preso recording at VMworld 2012.
Jeff invited me to lunch yesterday. The more time I spend with Jeff, the more humbled I am by how big his heart is. Thanks, Jeff. I continue to benefit from our professional and personal friendship.
The VMware 2012 Pavilion is appropriately themed with vFabric at every turn. Charles Fan is a humble, self-effacing man who would never bring up the incredible things he’s done for Joe Tucci. Charles’ vision for what is now called vFabric is really finding its voice, and that’s clear as you stroll around the VMware pavilion.
Yesterday vFabric’s Bill Bonin shared perspective with me--a month after VMware’s Chief Performance Officer Richard McDougall did the same thing--on how big the in-memory Hadoop market is projected to be in just a few short years. Guys, thanks to both of you for your one-on-one attention to make sure I keep Hadoop in my periscope.
VMware, I need you to align vFabric Data Director with vCloud Director strategically if not integrate them technically. Yet they are being marketed independently and have segregated internal organizations. I’m at a loss to explain the lack of product alignment as I now add the vFabric Data Director message to what I have always felt was VMware’s best tooling: was Lab Manager is vCloud Director. There’s an apparent massive operational deployment intersection here. This is no different than my incessant internal harping to you years ago that Lab Manager and Stage Manager were the same thing, despite the fact that they were separate code bases. You eventually merged Stage Manager into Lab Manager. C’mon, vFabric Data Director was announced a year ago. As one who provided architectural guidance into what was code-named “Aurora” and as a lightning rod evangelist for VBCA, I’m struggling to articulate the cogent unified vision for these separate products. It’s got to be even harder for your prospects to capture that vision. It’s going to get even worse now with your phenomenally shrewd acquisition of DynamicOps.
I ran into a former Oracle RAC employee yesterday who confirmed what we always knew. The statement that got deleted from the published My Oracle Support note (off the top of my head) “There are technical restrictions that prohibit the certification of RAC in a VMware environment” had to do with clock drift. (Clock drift went away with vSphere 4 in 2009).
Ron Zellars from the world’s largest ice cream factory - Wells Dairy - stopped by the VMware pavilion to say hi. We are proud to have helped them over a year ago with their EBS R12 upgrade and replatform of RAC to vSphere.
Brian and Richard from the U.S.’s largest appliance manufacturer - G.E. Appliances and Lighting - also stopped by to say hi. Brian’s business card now sports “Chief Evangelist” because CTO Lance Weaver thought it was cool on my card when GE and HoB got introduced here last year. I may have to declare Brian’s business card title as my biggest professional accomplishment for 2011!
Having said that, my most enjoyable encounter of the day was with Wize Commerce’s DBA Selina Lin out of San Mateo, CA. Selina took in my “Oracle RAC and VMware HA Tooling - A Decision Tree” VMware theater preso but couldn’t return Tuesday for the replay of my “Business Critical Applications Performance – VM vs. Native” preso. So we found a couple chairs just outside exhibit hall doors and took our time with that preso one-on-one. Selina, our ½ hour working session made my day. All the best to you.
My esteemed colleague Cisco’s Tushar Patel invited me to lunch today. Tushar was the VMware-side engineering force behind HoB’s groundbreaking VMware Oracle Solutions Lab at OOW 2007.
My only VMware 2012 pavilion complaint: inadequate white board space. Thank goodness for the iPad Paper app!
And speaking of apps, Uber is the coolest most practical iPhone/iPad app I’ve seen lately! Map all the available drivers in your geo and summon up the nearest one with a click without placing a call to a cab dispatcher.
Run Uber on the iPad 3 you’re going to win when you physically present Tuesday and Wednesday in the VMware Pavilion at 5:30. Your odds are pretty good as I’m thinking there just weren’t that many bodies present given the size of the daily prize.
To close out this mega-post: I was asked yesterday for feedback on a partner strategy event earlier this year. I answered with a concept that I’ve introduced into my professional activities from my volunteer teaching experience: “Never let planned content interfere with a productive discussion.”
By David Klee (@kleegeek)
Welcome to part three in our series of posts regarding the virtualization of your business critical SQL Servers. This post continues our discussion with details around how to architect the VMware infrastructure and the new virtual machine so that it meets or exceeds the physical server specifications. These best practices are a direct result of years of virtualizing SQL Server, and include all of our lessons learned on infrastructure and configuration.
Scale Up or Scale Out?
The first architectural challenge I always encounter is the discussion around scaling up or scaling out. Scaling up means more resources assigned to a single virtual machine, along with either more than one SQL Server instance or many distinct applications all connecting to each instance. In contrast, scaling out your SQL Server virtual machines means more independent virtual machines with smaller workloads.
I instinctively choose to scale out whenever possible.
If you have a huge workload already that is centralized around one application, a single workload is your only option. You could try sharding, or try using SQL Server 2012 AlwaysOn and a smart data-access layer to split your reads from your writes, but this is not usually feasible. Most organizations use virtualization to consolidate workloads onto a smaller physical footprint, but if you have a lot of huge virtual machines, you run a greater risk of encountering limitations that can induce performance problems and reduce some of the flexibility inherent with virtual machines. Such risks include:
- CPU scheduling penalties from idle resource scheduling or competing thread requests.
- I/O and network path bottlenecks.
- Additional overhead on the ESXi hosts with resources dedicated to resource management (CPU, memory overhead).
- Increased difficulty scheduling routine maintenance if a lot of applications are dependent on one database server.
On the flip side, scaling up your virtualized workloads has its advantages in certain situations. For example, certain licensing scenarios might limit the number of virtual machines you are allowed to operate. You have fewer servers to patch and manage. That means less of a security footprint, fewer IPs and broadcast traffic on the network, etc.
However, in the vast majority of the scenarios that I encounter, scaling out makes more sense. I try to stick with the atomic model whenever is appropriate.
The atomic model implies that you are treating your workloads more as single-purpose appliances than consolidated general-purpose servers. If your licensing allows, divide your workload into small units. Doing so will increase flexibility and workload balancing by keeping the workload units small.
Physical Server Architecture
First, we’ll focus on your physical hardware. I always go through the hardware with the following checklist.
At the physical server level (we’ll call it the VM host), make sure you disable any “green” settings inside the BIOS. You’re already saving quite a bit of power through the consolidation you get through virtualization. You do not want your consolidated server host running artificially slower than top performance when the CPUs are ramped up.
Make sure the CPUs are set to high performance mode. Enable virtualization extensions in your CPUs (i.e. Intel VT-x). You get the full CPU feature set exposed to VMware, which boosts performance.
Disable some form of auto-reboot on hardware issue in the BIOS as well. HP has a setting called ‘Disable Automatic Server Recovery’. This feature will physically reboot a server in the event of a hardware problem. All this does for you is re-introduce a possibly faulty server back into your server farm so that it can fail again and cause more disruptions.
Enable Hyper-Threading in your Intel-based CPUs. After Intel’s first attempt at Hyper-Threading I never thought I could say this; Intel's first attempt was poor, and very few applications benefited from it. Today’s implementation of Hyper-Threading works much better, and I endorse enabling it.
Virtual Machine Construction
Do everything you can to ensure all of your VMs are 64-bit. VMware stopped adding performance enhancements for 32-bit servers years ago, and the latest flavors of Windows Server are 64-bit only. 32-bit servers introduce architectural difficulties with SQL Server and memory management that I want the world to put behind them.
When architecting your virtual machine, you should now have a good idea of the amount of CPU resources that you require to run your workload. Please do not assume that you will require an eight vCPU virtual machine if your workload runs on an eight-core physical server. If your virtual machine’s workload does not require (and utilize) all eight vCPUs to properly run, the overhead of the additional idle resource scheduling will actually slow your virtual machine down.
Sometimes this overhead is enough to notice with application performance. This overhead is measured in a statistic called CPU Ready Time. CPU Ready Time is a measurement of how long a vCPU has to wait in a ready-to-run state before it is scheduled for physical CPU time.
Our requirements for your business critical SQL Servers is that your CPU Ready Time never exceeds 500ms, and that the average is below 300ms. As more large VMs exist on a single server, you will experience more CPU contention, even if your CPU statistics on the host look good at a glance.
The chart above shows the impact on CPU Ready Time by simply balancing a virtual machine workload. CPU Ready Time went from abysmally terrible to just bad after we balanced the workload without making any other changes. Further work went into getting these numbers to under 250ms across the cluster.
Be conservative when you allocate vCPUs to your workloads. You can always add more if the workload requires it.
Never overcommit your business-critical workload memory on a single host. Period.
Customers rarely run out of CPU resources on a server before they run out of memory resources, and therefore they seemingly always try to overcommit memory to squeeze more VMs onto a host. All this does is force your hosts to try to reclaim memory from running VMs. If you see red at a host level, you are in trouble. It means that VMware is beginning the process to reclaim VM memory.
Guess what? SQL Server usually has the larger amount of memory on a given host, and therefore VMware will try to snag memory from that VM. SQL Server will consume all the memory you give it for its buffer cache, and generally does not like to give up that memory without a fight.
Set a full memory reservation for your business-critical SQL Servers so that you never have to worry about VMware attempting to snag memory from these VMs. Never disable the balloon driver within VMware Tools, because now you never have to worry about it!
To start with your virtual machine storage, validate your storage subsystem performance. Performance is the absolute bottom line in any virtualized environment. To be honest, I couldn't care less what RAID type or number of disks is in your arrays. I only care about three metrics:
- Maximum throughput
- Maximum IOPs
- Minimal latency
Performance is the top priority, and some features in arrays can offset the negative attributes of other features. Your performance for any given virtual machine must meet or exceed what you require, be it virtualized or other. I always say you need an average I/O latency under 25ms. I suggest that you have under 50% spindle busy in your disk pools. I also look for at least 60MB/s for sustained write operations (both physical and random) with a workload that has pierced your cache. The spindle busy metrics come from your SAN management interface, but the other two metrics can be quickly and easily measured with tools such as SQLIO (see my SQLIO Analyzer tool) or IOMeter.
All non-operating system virtual disks attached to your SQL Server virtual machines should be using the VMware Paravirtual SCSI (PVSCSI) driver. VMware built this driver for reducing CPU overhead associated with I/O operations, and will give you a measurable performance improvement as well.
The only other design requirement that I have is that at least two paths available to any storage LUN and, if possible and/or available, multipathing drivers installed so that VMware can actively use all available paths. Tools such as EMC PowerPathVE or the Dell EqualLogic MEM driver can be used to accomplish this task.
For those still running Windows Server 2003, partition disk block alignment is a major concern and is simple to fix. You could be reading two or more blocks from your SAN for one simple read within SQL Server, and that is guaranteed to slow overall performance of your SQL Server. Fixing it gives dramatic performance improvements, as the following customer study demonstrates.
In part four of this series, I discuss a number of the best practices that we follow that you should be aware of when installing and configuring your SQL Server instances inside the virtual machines. These include specific details around the operating system and SQL Server instance configuration tweaks. The final part in this series will discuss the methodology I use for demonstrating that a virtual SQL Server performs as well as the physical counterpart in an apples-to-apples comparison.
Stay tuned and check back in a couple of weeks for the next installation!
By David Klee (@kleegeek)
Welcome to part two in our series of posts regarding the virtualization of your business critical SQL Servers. This installment continues our discussion, detailing how to understand the actual workload of the physical server. To properly virtualize a server, the performance of the physical server must be understood so that you know how to objectively demonstrate the raw performance of the virtualized equivalent. This means that a proper performance benchmarking methodology should be created, and system baselines be maintained. These benchmarks are then repeated periodically and turned into baselines. When virtualizing your business critical SQL Servers, the virtualized proof-of-concept (POC) server is benchmarked and compared against the baselines of the physical server to objectively demonstrate equivalent performance.
I have yet to encounter resistance to the ‘lack of performance’ virtualization argument when I can produce objective results that show that the servers perform equivalently. Here is how to do it!
First, if I were to ask you for an average run time for nightly SQL Server backups today, and a projection of how long they will take six months from now, could you provide it?
I didn’t think so.
Very few organizations that I encounter can produce a proper system baseline. Regardless of the infrastructure underneath it, every one of your systems should be performance baselined so you have a solid understanding of how it performs today. Baselines can then be maintained to help generate projections for growth of CPU, memory and disk consumption, transactional volume, and anything else measurable in your environment.
The easiest way to start collecting performance data is to use Perfmon, which is built into any modern Windows platform. I set it to collect data every five minutes, store to a day-stamped file, and then rotate the file every night at midnight. This gives me a solid reference for how the system performs every day so that I can start to understand performance characteristics. For example, what sort of disk and CPU hit does my system take every night when system backups or an antivirus scan runs? Is a normal Monday’s workload different from a Friday?
We use specific Perfmon counters to understand these performance characteristics. This list includes the following operating system counters:
- % Processor Time
- % Privileged Time
- Processor Queue Length
- Available Mbytes
- Paging File
- % Usage
- Physical Disk
- Avg. Disk sec/Read
- Avg. Disk sec/Write
- Disk Reads/sec
- Disk Writes/sec
- Process (sqlservr.exe)
- % Processor Time
- % Privileged Time
This list of Perfmon counters also includes the following list of SQL Server counters:
- SQL Server:Access Methods
- Forwarded Records/sec
- Full Scans/sec
- Index Searches/sec
- SQL Server:Buffer Manager
- Buffer cache hit ratio
- Free List Stalls/sec
- Free Pages
- Lazy Writes/sec
- Page Life Expectancy
- Page Reads/sec
- Page Writes/sec
- SQL Server:General Statistics
- User Connections
- SQL Server:Locks
- Lock Waits/sec
- Number of Deadlocks/sec
- SQL Server:Memory Manager
- Total Server Memory (KB)
- Target Server Memory (KB)
- SQL Server:SQL Statistics
- Batch Requests/sec
- SQL Compilations/sec
- SQL Re-Compilations/sec
- SQL Server:Latches
- Latch Waits/sec
All of the SQL Server counters can be collected from within SQL Server via system DMVs, but the Windows counters cannot.
Details on how to take the Perfmon output file and convert it to a readable format are located here: http://blogs.technet.com/b/askperf/archive/2008/05/20/two-minute-drill-relog-exe.aspx
Once you get these items into a readable (or importable) format, the sky is the limit on what you can do with this data. You can build a macro that compares and contrasts data in MS Excel. You could import them into a database. Build whatever system works best for you and your environment.
Raw disk performance is paramount to solid SQL Server performance. To better quantify the performance, no better (free) tool exists than SQLIO. It is free to use and is available from Microsoft at the following URL: http://www.microsoft.com/en-us/download/details.aspx?id=20163
I use this tool almost daily to benchmark client disk subsystems. It allows an administrator to select a drive and workload size and then drive varying types of I/O tests in varying degrees of intensity. I have created a test script and a workload analyzer that assists with a quick setup, test run, and then a full analysis. It is available at http://tools.davidklee.net.
SQL Server and Windows Metrics
My first resource for investigating a server that I am unfamiliar with is Glenn Berry’s SQL Server Diagnostic Queries at http://sqlserverperformance.wordpress.com. He maintains a version of his queries for each version of SQL Server. Download and execute the query set for the version of SQL Server you are running, and look at the output of each query. If you find a set of results that you are interested in, store the output and include it in some data container for future reference.
In addition to these queries, any number of other SQL Server and Windows metrics should be collected. Some of the items that I routinely collect are:
- Windows disk capacity versus used numbers.
- SQL Server maintenance timings (full/diff/log backups, index and statistic maintenance, dbcc checkdb, any custom nightly jobs, etc.).
Anything measurable that is important to your environment could (and probably should) be repeatedly measured for your benchmarks.
One of the best measures of your server performance is through executing queries against a sample workload that are timed and baselined in the production environment. Through workload generation you could you reproduce a workload similar to production in your testing environment. You can then execute queries against the test VM to compare runtimes and things like CPU and memory impact.
Here’s a little hint – did you know you can add a flag before a query runs to get the logical I/Os and exact runtime of a query? This little trick works great for benchmarks.
Normally, if you run a query such as this (using the DVDStore DS2 database as a sample workload):
You get the following output:
You can see that it runs in 16 seconds.
What about this output with the same resultset?
You get the following statistics:
This is much better. You can now see the local reads and the exact execution time. You also know that tempdb was used to assist this query from the creation of a ‘Worktable’.
The bottom line with benchmarks is to create a repeatable, consistent methodology for benchmarking your environment. It must be repeatable in order to quickly reproduce it in the event of an emergency. This allows you to determine if a performance problem exists in your system stack. It should be documented and clear to follow.
What if a system configuration change alters the performance of a system component, and it is reported a month after the change happens? How easy is it to pinpoint that change as the culprit? This is why that we recommend benchmarking your system after every major system change. A major system change can include any of the following examples:
- OS / application version upgrades
- Service pack application
- Hardware updates
- BIOS / firmware upgrades
- Networking changes
- Additional workloads placed on shared devices (such as storage, VMware hosts, etc.)
Performing quick benchmarks after these changes can identify performance issues very quickly, and you know exactly what changed at that point. It amazes me how small things can dramatically alter a system’s performance levels, and how hard it usually is to identify after the fact.
Now, what good is a benchmark if you do not have a running average metric to compare it against?
A baseline is a rolling average of your repeatable benchmarks. You should routinely (and not just once a year either) benchmark your systems and compare against a rolling baseline to see how things are performing. Once completed, update your benchmarks accordingly with the updated data.
You could even develop a system that collects periodic benchmarks and maintains a rolling baseline automatically. It’s a fairly trivial task to construct one of these systems, but they will take time to construct. There are many products out that you could purchase that capture some of these metrics. Again, the sky is the limit with what you can construct and build here.
The bottom line with benchmarks is that not only do you have an objective measure of the average performance of your systems, but, in the event of a problem, you have an objective means of defending your system. It can even help point out performance problems in other systems. If a performance problem does show up in your system, you have the ability to quickly determine the area that requires more focus, and the means to prove when it is resolved.
In the next installment of this series, I discuss how to proceed with the virtual machine architecture now that you have objective means to demonstrate your physical server’s raw performance. I will present a number of the best practices that we follow that you should be aware of when building your SQL Server virtual machines. These are specific details around the virtual machine, operating system, and SQL Server instance configuration tweaks. I will also discuss how to prove that a virtual SQL Server performs as well as the physical counterpart in an apples-to-apples comparison.
Stay tuned, and check back for the 4th part of this series where I will discuss the perfect build of a SQL Server virtual machine!
By David Klee (@kleegeek)
Welcome to a series of posts regarding the virtualization of your business critical SQL Servers. Throughout this series we will be dispelling the various myths and misconceptions around this topic. We will also present specific details around our best practices for a business critical SQL Server virtual machine, operating system, and instance. We will also talk through the process of how to prove that in an apples-to-apples comparison of a physical and virtual SQL Server, the performance is at least equivalent.
Look for us at a SQL Saturday (www.sqlsaturday.com) near you! This topic is one that is near and dear to my heart, and I present on this topic frequently.
What is a Business Critical SQL Server, and Why Virtualize It?
A business critical SQL Server is just that – it is a SQL Server that your business absolutely depends on. If this server crashes, or data is lost, your business could fail. At a minimum, your employees could be left with nothing to do while it is down. These are the systems that, as management, you place the most resources and most care for high availability and disaster recovery. They absolutely must be running and recoverable.
The benefits of virtualization on these systems are tremendous. Individually, each one of the following benefits should be enough to get an organization excited about virtualization. Together, these benefits revolutionize the way datacenters are architected and managed.
Added Flexibility, Efficiency, and Agility
When virtualized, the application is effectively freed from the underlying infrastructure. VMs can move from resource node to resource node transparently, allowing the resources underneath to automatically handle growth and spikes without a disruption to business. An administrator can provision new servers in minutes instead of days with pre-configured VM templates.
Improved Disaster Recovery
Due to the decoupled nature of virtual machines from the underlying hardware, disaster recovery of virtual machines is much simpler than that of their physical counterparts. Through multiple means, VMs can be continually replicated to a DR location, audited and tested, and failed over and failed back quickly.
Increase your application uptime with built-in features such as VMware High Availability (HA) and Fault Tolerance (FT). VMware HA can minimize application outages in the event of a hardware failure. VMware FT can eliminate the application outages altogether. Avoid the downtime normally associated with hardware maintenance with vMotion and Storage vMotion.
Easier Development / Test / QA Environments
Ordinarily, constructing development, test, and QA environments that match production requires the same sorts of hardware as the production environment. Keeping these environments in technology and configuration sync with production can be cost prohibitive. With VMware, entire production stacks can be cloned and placed in a development, test, or QA role. These systems can be routinely refreshed with just a few clicks. This will accelerate the application development lifecycle because developers receive a development environment that is seemingly identical to production.
One of the obvious benefits of virtualization is server consolidation. The hardware server count is reduced, which lowers the server support and warranty costs, reduces the hardware footprint in the datacenter, and lowers power and cooling costs. Licensing can also be optimized to save even more capital. I normally speak less on consolidation when virtualizing business critical systems than with other environments or tiers of servers.
Why Virtualize Business-Critical Systems?
First of all, ask yourself - why not? The technology has evolved to the point where it is functionally transparent to the stack.
As of 2010, there are more virtual machines on this planet than there are physical servers. Even though the remaining physical servers are in the minority, these are the vast majority of business-critical systems. Take a look at the bell curve below.
The vast majority of lower-tier servers have already been virtualized. Businesses are sitting, waiting to cross the chasm to the business-critical system. This area is where the vast majority of capital is spent to maintain. This is the area where the lion's share of the productivity and revenue lies. This is the area that is most vital to the business. This is the area where businesses are most cautious when addressing virtualization.
This is the area where virtualization can benefit the organization the most.
Myths and Misconceptions
A number of myths and misconceptions exist around virtualizing business-critical systems. With proper education, planning, and understanding, these can be eliminated and virtualization can do what it does best – helping your organization’s bottom line.
People seem to harbor a tremendous number of misconceptions around performance. I am constantly shocked when I talk to people who insist that virtualization continues to inflict a serious penalty in performance because of virtualization overhead. Some hypervisors have more overhead than others, and older versions of VMware vSphere did have a noticeable overhead, but VMware vSphere 5 has become transparent. As of a pre-release version of vSphere 5, storage has a 100-microsecond latency per I/O, and this latency linearly scaled all the way to one million IOPs. The only reason this benchmark was ended was because they ran out of storage to attach to the testbed. When it takes a benchmarking team to measure your system’s minute virtualization overhead, I declare it functionally transparent.
More often than not, these performance concerns come from some sort of virtualization trial (or even worse - a failed production go-live) that was performed in the past. Poor results (rightfully so) put a bad taste in people’s mouths. However, the investigation of their virtualization trial normally demonstrates a bad and unfair test. For example, the following diagram demonstrates a typical virtualization proof-of-concept system stack.
On the left is an average production system stack. On the right is the virtualization POC system stack.
What is wrong with this picture? Five dramatic items are different between the two stacks.
- The workloads are not the same. The POC has a much heavier workload placed on it.
- Only one storage path exists.
- The disk configuration is different – RAID-5 versus RAID-10.
- SATA disks are used instead of Fiber Channel drives.
- The amount of service processor read/write cache in the SAN is much lower.
In my experience, most people butcher a business-critical virtualization POC because the host hardware is dangerously overcommitted and the storage is completely overwhelmed. In this scenario, the CPU utilization is guaranteed to cause CPU Ready times to shoot through the roof, which will negatively impact VM performance. Storage performance is already at a disadvantage due to the disk configuration. RAID-5 suffers a write performance penalty when compared to RAID-10, and the lack of cache only magnifies the difference. SATA disks have a lower number of rated IOps than fiber channel disks.
This poorly constructed virtualization POC of business-critical systems is doomed to fail. As a result, the organization will now declare that virtualization cannot handle their top-tier systems.
It does not have to be this way.
With an apples-to-apples virtualization POC, or proper architecture if using equipment that is not at the same performance level of the production stack, the business-critical POC can succeed, paving the way to full production virtualization.
We seem to always to field a lot of questions regarding support stances from different vendors. In truth, some vendors are nicer than others when answering this question. Some vendors have support statements in writing that differ from what their salespeople say. Some vendors have not certified their software for use on a virtual platform and refuse to support it (which tells me they are so woefully ignorant on the topic and/or so lazy that they cannot perform simple engineering validation tests, which are guaranteed to pass).
Microsoft has supported and embraced virtualization for years, and has a published support policy for their applications on VMware. It is officially supported via the Server Virtualization Validation Program. In a nutshell, if the server hardware and hypervisor platform has been validated (and if it is on the VMware HCL, chances are it is), Microsoft supports it.
You can read VMware’s official customer support statement at http://vmware.com/support/policies/ms_support_statement.html.
You can read Microsoft’s support statement, located in KB897615 at http://support.microsoft.com/kb/897615.
Another misconception is that virtualization performance will be progressively negatively affected as the database size increases.
Database size has no impact on performance. Period.
If it works in the physical world, it will work in the virtual world. If you have serious storage performance degradation due to a workload size, you have a misconfiguration somewhere in the stack outside of the virtual infrastructure.
The only factors that matter for database performance are execution counts, concurrent connections, and SQL I/O access paths. Space has nothing to do with it. As the database size grows, concerns emerge such as backup and recovery throughput, disaster recovery operations, and migrations as needed. However, these concerns have nothing to do with virtualization. These are concerns that exist no matter the platform. No distinction between the physical and virtual environment exist!
Licensing is one of those fun topics that, when mentioned, everyone cringes. It should not be. When done right, virtualization can potentially lower your licensing costs. A dedicated SQL Server cluster could be constructed where the physical cores are licensed, or a sub-cluster of an existing vSphere cluster could be licensed. It all depends on your environment, your agreements with Microsoft, and your server architecture. But, never fear licensing. Evaluate your environment and determine how much licensing money virtualization could save you.
In the upcoming parts of this series, I discuss a number of the best practices that we follow that you should be aware of when building your SQL Server virtual machines. These are specific details around the virtual machine, operating system, and SQL Server instance configuration tweaks. I will also discuss how to prove that a virtual SQL Server performs as well as the physical counterpart in an apples-to-apples comparison.
Stay tuned, and check back in a couple of weeks for the next part of this series where I discuss the perfect build of a SQL Server virtual machine!
Part 4 – vSphere 5 vMotion
by Jim Hannan
In Part 4 of vSphere 5 Advantages for a VBCA, I turn my focus to vMotion and it’s enhancements in vSphere 5. VMware vMotion offers the ability to live-migrate virtual machines from one ESX host to another.
It amazes to me to think how long vMotion has been around. Most of us remember our first demo of a vMotion migration. vMotion, originally introduced in ESX 2.0 with vCenter 1.0, offered a feature unlike any in the industry. For many organizations, it was the quintessential reason to virtualize their workloads.
Care to guess what interactive application was used by VMware to demo the first migrations?
How vMotion Works
The majority of the work performed by vMotion is copying the virtual machine memory over to the destination ESX host. This memory migration can be broken down into the following 3 phases:
Phase 1: Guest Trace Phase
Consider this an accounting step. vMotion needs to account for each memory page in the virtual machine and any changes to pages during a vMotion operation. Traces are placed on each memory page in order to track guest memory pages; this allows for vMotion to track memory modifications during vMotion.
Phase 2: Pre-copy Phase
The pre-copy phase is done in iterations. First, a full sweep and copy of all memory blocks is done. The second iteration only copies the blocks that have changed since the first iteration. From here, additional iterations may be run depending on the ability of previous iterations to keep up with the changed blocks.
Phase 3 - Switchover Phase
The switchover phase is the final step in the migration. This is the cutover step, which quiesce the source VM and places the destination VM in a resume state. This step typically occurs in less than one second.
All three of these phases are greatly optimized with the vSphere 5 vMotion enhancements.
- Multiple network adapter: VMkernel can transparently load balance a single vMotion across multiple network adapters.
- Round-trip latency limit for vMotion networks has increased from 5 milliseconds to 10 milliseconds. This, in House of Brick’s opinion, is a future looking enhancement that will natively allow for long-distance vMotion.
- Improved memory tracing (Phase 1): According to VMware documentation, the tracing mechanism has been optimized to place traces faster.
- Improvements to allow the vMotion to effectively use the full bandwidth of a 10GbE interface. This actually may come as a surprise to some people, but many applications struggle with this. It's mostly due to the CPU overhead TCP/IP introduced into the network stack. From my observations, this is no longer an issue with vMotion.
- Stun During Page Send (SDPS): Ensures the vMotion will be successful during memory block change over. This new feature should be viewed with caution for VBCA applications that are latency sensitive. Let's take a look at what SDPS does and what it means for your VBCA in the next section.
SDPS for VBCAs
In some rare cases, a VM will have memory changes that occur faster than the vMotion iteration can keep up with. In these cases, SDPS will slow the virtual machine down enough to allow the vMotion to complete the migration. This “slow down” could cause unwanted application latency. In the case of Oracle RAC and its interconnect, this could cause node eviction.
The SDPS feature can be disabled, but this is not recommended by HoB. Instead, we recommend that vSphere administrators leave this feature enabled and build a vMotion network capable of moving memory blocks fast enough to prevent an overrun.
VMware recently benchmarked vSphere 5 vMotion against vSphere 4.1. One of the many tests VMware ran to compare vMotion performance was a database workload test. Using a SQL Server running the open source DVD Store Version 2 (DS2), VMware generated a RDBMS workload that ran during the vMotions. As you can see in the graph below, one test running with 2 NICs for vMotion was approximately 42% faster than vSphere 4.1 with one NIC.