May 17
2012

Alerting Versus Monitoring

Posted by Solutions Architects in VMware , SQL Server , Oracle , Monitoring , Business Critical Applications , Alerting

By David Klee (@kleegeek)

In our world, monitoring an IT infrastructure is defined as having accurate, up-to-date knowledge of the current state and health of critical servers. Sending alerts based on monitoring metrics means that administrators should be notified when something is unhealthy and that admins can (and should) take corrective measures. However, people sometimes blur the lines and alert on items that should not be alerted on. Alternatively, overlooking critical issues and not alerting on them can destroy a business.

When we set up monitoring on business critical systems (specifically VMware, Windows, and SQL Server), many key metrics are monitored. Occasionally we store them for historical purposes. These items include host and guest CPU, memory and disk activity, disk used and capacity, and event and error tracking. Sometimes we will run in-depth SQL Server performance data collection and collect items such as Page Life Expectancy, Buffer Cache Hit Ratio, recompilations, data or log file autogrows, tempdb usage, and other important items to profile and trend.

Most of these items are initially thought to be important enough to set an alert However, is there really anything you should do about your CPU running higher than average for five minutes? What if my Page Life Expectancy falls below some predetermined threshold for ten minutes during a backup? What if I miss an alert that I care about because other alerts were bulk deleted and this alert was in the middle of the list?

At House of Brick, we have two levels of alerting: warning and critical. A warning causes a notification to go out, but sometimes it is more informational. Other times the notification is something to be investigated during the next business day. Either way, these items are not critical and administrators should not be alerted in the middle of the night. Critical alerts, on the other hand, are very important. These alerts are for items that simply cannot wait. Even more important is that the alerts are delivered in a timely manner. If a business critical server crashes, the business demands that it get resolved as soon as possible. The following items jump to mind as examples for critical alerting:

  • SQL Server error states 17-25
  • Core services are stopped
  • Database integrity problems
  • Scheduled job failures
  • Disks approaching full in the operating system or datastores approaching full in VMware
  • Host memory ballooning that is not resolving itself
  • Active Directory domain controllers failing to replicate to other DCs
  • And the obvious server or device failure

Warning alerts are less critical. Examples of these conditions are:

  • A sample benchmark query is running a bit slower than average, but still returns a recordset
  • SAN to SAN replication is backed up but still transmitting
  • VMware host memory utilization is over 80% but below 90%
  • A virtual disk is approaching 80% full

Other alerting is subtler: What if a scheduled job runs long? What if you have ten times over the average inbound database connections? Even more subtle: What if my primary data loading scheduled job is taking twice as long today as it did six months ago? What if my database growth has an organization that will completely run out of available space in four months? How do you detect these sorts of trends? How do you classify these? Monitoring and alerting should be well thought out before going live. A full list of devices, performance metrics, error states, and scenarios for your specific environment should be developed and then analyzed. They should be configured by resource type so that the critical alerts are appropriate for the underlying cause. Critical alerts should all be items that you welcome being woken up in the middle of the night or interrupted on weekends to handle so that they cause the least disruption to business. They should be something you can actively do something about to fix. Do not send out critical alerts on items that the administrators cannot take action on immediately. Monitoring your environment also means that you proactively review the environment for baseline changes. Are things running slower today than they were six months ago? If so, can you quantify specifically how much of a different exists? Have you automated the analysis and have the reports automatically delivered so you do not skip the routine checks?

You should be monitoring your environment and capturing performance statistics. You should be alerting on items that are of high importance. You should be proactively protecting your business.

May 10
2012

vSphere 5 Advantages for VBCA -- Part 2

Posted by Solutions Architects in vSphere 5 , vSphere , VMFS 5 , VBCA , Storage Performance

By Jim Hannan

This blog post focuses on VMFS-5 enhancements in vSphere and how they improve Virtual Business Critical Applications (VBCA) performance and scalability. VMFS-5 offers several improvements from the previous version VMFS-3.


How VMFS-5 has improved in scalability and performance:

  • Newly created VMFS-5 datastores use a single unified block size of 1MB. There is no more choosing between 2MB – 8MB during VMFS creation. Standard block size of 1MB allows for improved performance, but more on this later.
  • VMFS-5 uses GUID Partition Table (GPT) rather than MBR, which allows for pass-through RDM (RDM-P or RDM Physical) files up to 60TB. RDM-V has a maximum size limit of 2TB.
  • VMFS-5 does not use SCSI-2 reservations, but uses the Atomic Test and Set (ATS) VAAI primitives. SCSI-2 reservations and ATS are used when, among other things, a lock is needed. A lock is required to perform some operations, like creating a VMDK or creating and deleting snapshots. The ATS mechanism allows for locks that are more efficient and improved performance. ATS requires the VMFS-5 files system and VAAI enabled SAN.
  • VMFS-5 uses SCSI_READ16 and SCSI_WRITE16 cmds for I/O (VMFS-3 used SCSI_READ10 and SCSI_WRITE10 cmds for I/O). Of the new features, this is the technology improvement I am least familiar with. My understanding of SCSI_READ10 vs. SCSI_READ16 is that SCSI_READ16 offers larger storage capacity. SCSI_READ16 uses a 64-bit Logical Block Addressing (LBA) field allowing it addressing eight Petabytes of storage. Obviously, a VMFS datastore cannot grow this large, but you can see the potential.

Reference: vSphere 5 FAQ: VMFS-5

Upgrading to VMFS-5

HoB recommends creating new VMFS-5 datastores rather than in place upgrades of VMFS-3 to VMFS-5. When upgrading a VMFS volume you only partially benefit from some of the new features. As an example, VMFS-3 volumes upgraded to VMFS-5 will retain the original block size instead of the new 1MB block size. This affects features like Storage vMotion and ATS. In the case of ATS, the ESX hypervisor will revert to the slower SCSI-2 reservation. With Storage vMotion, a slower datamover mechanism called fs3dm will be used instead of the faster fs3dm-hw. In my next blog, Part 3 – Storage vMotion enhancements for VBCA, I will address Storage vMotion and its datamovers (fsdm, fs3dm, and fs3dm-hw).

May 03
2012

How SQL Server 2012 AlwaysOn Blurs the Lines Between High Availability and Disaster Recovery

Posted by Solutions Architects in SQL Server 2012 , High Availability , Disaster Recovery , AlwaysOn

David Klee  (@kleegeek)

Over the last month, I have been exploring the new features within SQL Server 2012. I have paid special attention to AlwaysOn, a new feature that effectively blurs the lines between high availability (HA) and disaster recovery (DR).

In previous versions of SQL Server, these two items were mutually exclusive. Both were managed separately and, in some cases, even managed by different people.

Historical Options for High Availability include:

  • Microsoft Failover Clustering - A SAN must be used to present storage to all of the cluster nodes. Dealing with the quorum and the shared drive failover was cumbersome, and even routine maintenance like rolling patches could cause headaches. Most people continue to cringe at the mention of MSFC.
  • Database mirroring - Mirroring in synchronous mode worked well, but since the mirrored pair presents two IP addresses for the application to connect to, the application needed to be mirror-aware. This was easier said than done; mirroring only worked in pairs, but was configured to fail over at a single database level. Consideration was not given to applications that had two or more databases, such as Microsoft SharePoint. Scripting was needed to fail over more than one database.

Historical Options for Disaster Recovery include:

  • Database mirroring - Mirroring in asynchronous mode also worked well but the application still needed to be mirror-aware. Scripting was still needed to fail over more than one database.
  • Transaction log shipping - This option was complex and failover was not automatic. However, you could have multiple nodes in the background for reporting or other purposes.

SQL Server 2012 AlwaysOn takes the best of these technologies, adds features to fill in some gaps, and presents it in an extremely simple to manage package.

Microsoft Failover Clustering, running without shared storage, handles the virtual IP address for all cluster nodes. Availability Groups, or groups of databases common to an application or purpose, can even have their own virtual IP address.

AlwaysOn allows you to set up four nodes in the Availability Group. Each secondary node can be configured for synchronous or asynchronous replication. Each can be configured to allow read-only access and database backups and reporting can now be processed from these nodes. Asynchronous nodes can be set up at a DR location for branch-level reporting and failover and failback is automatic.

Each Availability Group can contain multiple databases, which eliminates the need to manually script and keep things up to date as the environment changes.

In the terms of raw manageability, the amount of time saved by using these features is easily demonstrated through reduced complexity and simplification of maintenance procedures.

In the event of an HA or DR event, you want the simplest, most robust solution possible. Your business demands the highest uptime possible. Can your business afford NOT to evaluate SQL Server 2012 AlwaysOn?

Apr 26
2012

vSphere 5 Advantages for VBCA -- Part 1

Posted by Solutions Architects in vSphere , VBCA , Oracle on VMware , Oracle

by Jim Hannan

What is VBCA?

What is Virtualizing Business Critical Applications (VBCA)? A Business Critical Application is exactly what it sounds like, an application that is critical to business success and day-to-day operations. Without it, businesses would struggle to function. Common VBCA applications are: Oracle RDBMS, SQL Server, Exchange, and SAP to name a few.

I believe vSphere 5, as much as anything else, is a VBCA release. What I mean by that is, when you compare it to previous releases the new features are primarily enhancements for providing a platform for VBCAs. VBCA workloads typically have large resource requirements like CPU or memory and, in the case of databases, I/O throughput. Creatively, this type of workload has been dubbed by VMware as a 'Monster VM'.

With vSphere 5, VMware has created a platform more than capable of running Monster VMs. This blog, and the next 5-6 following it, will highlight some of the key features that, in my opinion, give vSphere 5 the accolade of a “VBCA release”.

Part I - Scalable Virtual Machines and Enhanced co-scheduler

At this point it is very well known that the maximum vCPUs a single VM can run is 32 vCPUs. For me, this number is staggeringly high. At HoB, we have been virtualizing VBCA workloads as far back as ESX versions 3 and 4, with a maximum of 4 and 8 vCPUs respectively. In our estimation, 90% of the workloads can fit into a configuration of 8 vCPUs (or fewer). At Indiana University we experienced this first hand when assisting them with virtualizing their OnCourse system. The OnCourse Oracle database was previously on an AIX Power5 LPAR with 12 CPUs allocated. During peak workloads the Oracle database was consuming 9.5 processors. After virtualizing the workload and load testing, we determined not only that the workload would fit without the 8 vCPU max, but it was only using between 35% - 50% of the CPU. This left lots of scalability for the database. Fast forward to today. With a maximum of 32 vCPUs, the door has opened for virtualizing 95% of the workloads in existence.

How did VMware make the jump from 8 vCPU maximum to 32 vCPU maximum?

This is an intriguing question. The best public information available on this achievement is discussed the VMware Whitepaper - The CPU Scheduler in VMware ESX 4. The “relaxed” co-scheduler was first introduced in ESX version 3.0 with a maximum single VM vCPU configuration of 4. In this version, the VMware engineers adapted a cell model. The cells were assigned to pCPU (physical CPU). A common processor configuration during the ESX 3 release was a physical processor with 4 cores. As pCPU core counts increased with AMD and Intel chips, VMware determined that the cell model was no longer adequate.



In ESX 4 the VMware engineers moved from the 4 vCPU limit to 8 vCPU by eliminating the cell architecture to finer-grained locks. This allowed a single VM to span multiple pCPU and allows for scheduling of one vCPU for certain task. This gives the guest OS the ability to schedule one vCPU for single process or thread and greatly reduces the overhead or cost of CPU scheduling from the previous ESX version.

ESXi 5 further enhanced  SMP scheduling, increasing SMT application performance. This increase in some cases can reach up to 10% - 30% (see What’s New in Performance in VMware vSphere 5.0). And, of course, the new 32 vCPU maximum per a single VM increased from the previous maximum of 8 vCPUs.

Here’s the evolution timeline of the co-scheduler:

Apr 12
2012

Oracle-on-vSphere Licensing

Posted by Dave Welch in vSphere , Oracle support , Oracle on VMware , Oracle

There is some discord and significant misinformation in the Oracle community as to what the contractual obligations are associated with licensing only a subset of a vSphere cluster’s hosts for Oracle. Furthermore, there is all manner of opinion out there regarding the risks of asserting one’s contractual rights. This definitive opinion piece waters down relevant Oracle License and Services Agreement language to a summary that you can get your hands around.  It includes observations on what we are seeing and, more importantly, what we are not seeing in terms of legal actions with regard to the OLSA.

I just offered a fairly comprehensive Oracle-on-vSphere licensing opinion piece to my colleague Jeff Browning who is EMC’s chief spokesman for all things Oracle storage-related. My post appears in Jeff’s Oracle sub-cluster licensing thread at April 2nd.

Jeff then took the initiative to replicate my post to his own blog here and the EMC Community Network here.

Apr 11
2012

SQL Server 2012 AlwaysOn On VMware

Posted by Solutions Architects in vSphere , SQL Server 2012 , MS SQL Server , AlwaysOn

ByDavid Klee (@kleegeek)

A couple of weeks ago Microsoft released SQL Server 2012 to the wild. Even though this release is in its infancy, organizations and their database administrators should begin to explore it and plan for the adoption of this release into their corporate technology roadmap. Multiple features in this release are individually tremendous and, as a whole, can change the way a business thinks about database management. Database recovery and high availability have just been revolutionized.

Prior to SQL Server 2012, SQL Server had multiple solutions for high availability and disaster recovery built in. However, each solution had at least one major limitation. For example, failover clustering configuration and management is nontrivial in its complexity as well as its single point of failure shared disk requirement.  The shared disk configuration on VMware forced administrators to use Raw Device Mappings (RDMs) or in-guest disk presentation. Database mirroring creates fantastic failover and high availability features, but lacks a simple IP address that legacy applications can connect to in order to utilize the failover features.

Additional features are included in SQL Server 2012 that should make database administrators even happier. These new features include: offloading backups to different servers to reduce their performance impact, index improvements that dramatically simplify their management, self-contained databases that make migrations and transportability simpler, and tools and features to allow a much more granular insight into processes and events inside the engine itself.

Our best practices for installing and configuring SQL Server on a VMware vSphere environment have not changed much with this latest release. As always, we are actively evaluating the new features and working to refine our recommendations to make this version perform at top speed.

New features that Database Administrators Should Care About

AlwaysOn

In our opinion, SQL Server 2012 AlwaysOn is the single best feature to come out of the SQL Server team in this release, revolutionizing database recovery and availability.

AlwaysOn takes the automatic failover process and virtual IP features of Microsoft Failover Clustering (MFC) but removes the dependency on shared disks. On VMware, this removes the complexity of RDM or in-guest disk management. It also adds functionality, originally added for mirroring of databases between two servers, that allows up to four independent systems to be part of this mirroring replication group. Up to four instances can be members of the AlwaysOn cluster, and the replication can be split between synchronous and asynchronous to different members at the same time. Secondary replicas can even be opened up for read-only activity without having to deal with painful database snapshots. These same replicas can also be used to move the impact of a database backup off of the primary system and onto the replicas.

Failover can be configured to be automatic and seamless to all applications. This now includes legacy applications that could not be configured for a failover node in previous releases. SQL Server’s disaster recovery options have been blended with its high availability tooling to provide a seamless, simple to manage package.

Database availability groups can be configured, which lets multiple databases fail from one system to another. This eliminates the scripted failover processes if a single application requires more than one database on an instance and all of the databases are mirrored. You can even fail over across subnets now.

This feature alone is enough to make us re-tool existing environments and go back to former clients and advocate for the immediate investigation of SQL Server 2012.

Note: Database mirroring is still enabled in SQL Server 2012. However, it is now deprecated in favor of AlwaysOn. AlwaysOn currently requires Enterprise Edition. Our collective prediction is that when database mirroring is finally removed from future versions, AlwaysOn will be available in Standard Edition.

Extended Events GUI

Extended Events allows a DBA to extract event-driven information about the engine. DBAs must learn this tool, as the legacy Profiler tool is deprecated and will soon be gone. Extended Events were first introduced in SQL Server 2008. They were very difficult to manage and maintain because no GUI existed, the T-SQL required to create them was nontrivial, and the output was XML and took time to parse. Industry adoption was negligible.

SQL Server 2012 adds a GUI to manage the Extended Event sessions. This new version also adds a Data Viewer to let you view the real-time output of your Sessions. These tasks are much lighter on the system than a Profiler trace and can be stored on the server for re-use at any time.

Contained Databases

One of the trickiest features that constantly plague database administrators is how to migrate one or more databases to a different instance. These headaches are due to the structural nature of the engine itself. Namely, in how it maps system logins to database users, how scheduled jobs are managed, servers linked together, or even more obtuse items such as default and database collations (SAP anyone?).  Contained databases now bundle and contain these entries into more manageable items to save DBAs hours during planning and deployments.

ColumnStore Indexes

ColumnStore indexes are the engine’s first column-based indexing feature. It allows for a simple creation of a covering index on one or more columns, but splits each column into its own group of pages. Compression ratios benefit from having similar values on the same group of pages, and aggregate queries not only become sequential, but also must read fewer pages. The performance boosts are reported to be substantial.

The best use for this feature is for reporting database tables, as these indexes are (currently) read only. This might sound bad, but an existing architecture could be tweaked with partitioning so that enabling this feature does not break an application.

Microsoft released a great whitepaper on this subject entitled: “Columnstore Indexes for Fast Data Warehouse Query Processing in SQL Server 11.0”.

Licensing

SQL Server 2012 has simplified to three main editions – Standard, Business Intelligence (BI), and Enterprise editions. Datacenter Edition is no longer present. It has also shifted from socket-based licensing to core-based. Server and Client Access Licenses (CALs) are still available, but only for Standard and BI editions. The virtualization of SQL Server 2012 now also recommends that the organization purchase Software Assurance (SA). If SA is not purchased, vMotion activities are artificially greatly hindered.

What do all of these changes mean for you?

You must now purchase at least four licenses, as it requires a minimum of four core licenses per socket. They are sold in two core packs. If you virtualize, SA should be purchased if you wish to allow the hypervisor to freely move your virtual machines to different cluster nodes. Otherwise, you are subjected to a 90-day gap between vMotion operations rule. Purchase licenses for all of your cores in the cluster (or sub-cluster if using means to confine your VMs to certain machines) and you can run “unlimited number of instances [up to] the number of core licenses assigned to the server” (SQL Server 2012 Licensing Quick Reference Guide, page 10).

The total price for your organization to run this technology has probably just gone up.

If you are interested in more details around the new licensing model, review the SQL Server 2012 Licensing Quick Reference Guide.

SQL Server 2012 on VMware vSphere Best Practices

  • First and foremost, go through the exercise of verifying and validating the amount of licensing you need to successfully maintain compliance with Microsoft licensing regulations.
  • Tweak your BIOS settings for optimal performance: set CPUs to high performance mode, enable turbo boost.
  • Never overcommit resources! Use VM resource reservations. Never disable the balloon driver.
  • Use eagerzeroedthick disks for all of your SQL Server data, log, and tempdb object drives.
  • Use multiple virtual SCSI adapters to optimize I/O distribution.
  • Evaluate the Paravirtual SCSI driver (PVSCSI) to check for CPU reduction and/or I/O performance improvements.
  • Evaluate the VMXNET3 virtual network adapter driver for CPU reduction and network performance improvements.
  • Schedule processor affinity for specific NUMA nodes (VM settings – Resources – Advanced CPU – Scheduling Affinity and NUMA Memory Affinity).
  • Ensure that your virtual machine CPU NUMA configuration best matches your physical machine’s configuration.
  • Disable physical network adapter interrupt moderation, virtual interrupt coalescing, and Large Receive Offload (LRO).
  • Configure and use Windows Large Pages and ‘Lock Pages in Memory’ privileges for service account.
  • Configure and use Instant File Initialization.
  • Set Min and Max Server Memory in SQL Server.
  • Ensure absolute redundancy and eliminate single points of failure in your AlwaysOn design. Use anti-affinity rules to ensure that AlwaysOn members are never placed on the same ESXi hosts.

As customer adoption of this new version increases, these best practices will be explored, baselined and benchmarked, reviewed, and any new best practices that we develop will be shared on this blog.

Stay tuned – coming soon we plan on developing a "SQL Server 2012 AlwaysOn on VMware" series of posts to help you get the most out of your SQL Server 2012 installations.

Jan 19
2012

Large Pages Performance on SQL Server 2008R2 on VMware vSphere 5.0 – Part One

Posted by Solutions Architects in VMware , Virtualization , SQL Server , Large Pages

This week while going through some old VMware documentation, I discovered a reference to performance gains while using Windows Large-Page allocations. It is located on page 11 of the Performance and Scalability of Microsoft SQL Server on VMware document (http://www.vmware.com/files/pdf/perf_vsphere_sql_scalability.pdf). However, while it claims performance of SQL Server improves when Large Pages are enabled, the document is from the ESX 3.5 era. What happens to the performance on a modern system like vSphere 5.0 and SQL Server 2008R2? If there are benefits, is it worth the drawbacks? Talk to any seasoned SQL Server DBA and most will tell you to avoid Large Pages unless Microsoft explicitly tells you to enable it, but most do not have solid reasons or experience backing their opinions. Why the FUD?

What are Large Pages?

First, let’s talk about Windows memory. Windows uses the Virtual Address Space to manage its memory map. These Virtual Addresses are stored and managed by Windows in a structure called Page Tables. Each Virtual Address has a corresponding Page Table Entry in the Page Table. This table has to be searched when the OS references memory pages. To increase performance of this lookup process, the processor maintains a cache called the Translation Look-Aside Buffer (TLB). Memory is stored in the Virtual Address Space in two block sizes – Large and Small. For this conversation, I’m ignoring the Itanium platform, which has different block sizes. Small Pages are 4KB in size. Large Pages are 2MB in size. When Large Pages are enabled, the number of entries in the TLB is dramatically reduced. More translation is required to reference Small Pages than Large Pages, and a penalty in performance is felt.

Large Pages and VMware

VMware ESX and ESXi hosts can accommodate large page requests of a guest, and this feature has been supported and enabled by default since ESX 3.5. The host CPU must support hardware-assisted memory virtualization (i.e. Intel EPT Hardware Assist and AMD RVI Hardware Assist). VMware has published studies describing Large Page performance improvements of guest workloads at the following URLs.

  • www.vmware.com/files/pdf/large_pg_performance.pdf
  • http://www.vmware.com/files/pdf/techpaper/vsp_41_perf_memory_mgmt.pdf
  • http://communities.vmware.com/docs/DOC-6912

VMware claims up to a 19% performance gain in SQL Server 2008 by enabling Large Pages in SQL Server and enabling hardware-assisted memory virtualization in the CPUs. Reference http://www.vmware.com/files/pdf/perf_vsphere_sql_scalability.pdf for more details.

Large Pages and SQL Server

Trace flag 834 allows SQL Server to use Large Pages for the Buffer Pool. A few restrictions apply.

  • Windows Server and SQL Server must both be 64-bit.
  • The service account that SQL Server is running under must have the “Lock Pages in Memory” privilege set.

Previously, Enterprise edition was required to support Large Pages natively. However, trace flag 845 on appropriately patched servers (reference MS KB 970070 for more details) allows Standard Edition SQL Servers to join the party.
To add the trace flag into the SQL Server startup parameters list, start the SQL Server Configuration Manager and add the following entry into the SQL Server instance Advanced Startup Parameters property: ;-T834.

SS1
When the trace flag is set and the SQL Server service restarted, you will notice that the ERRORLOG now contains entries for ‘Using large pages for buffer pool’ and the amount of memory used

SS2

Make sure to set the maximum buffer pool memory by configuring the ‘Maximum server memory’ setting to prevent this problem from ever occurring.

SS3

Drawbacks

Enabling Large Pages might give a boost to performance but does not come without downsides. First, VMware’s Transparent Page Sharing (TPS) is rendered ineffectual for this virtual machine. A 2MB memory block is almost guaranteed to be unique, rather than a 4KB block, so do not count on TPS to save any memory on the host. If a host is overcommitted, this could reduce performance of other guest VMs because the balloon driver could be more aggressive in ballooning for memory reclamation. However, best practices dictate that a host not be overcommitted for business critical systems so this should be a nonissue.
Second, verify the boot time of your SQL Server when Large Pages are enabled. SQL Server must claim and zero out the memory it will be using before the service can completely start up. This process takes time. It can potentially take quite a while. This configuration can take a SQL Server startup time from a minute or two to over 30 minutes, depending on your environment. Be very careful that by enabling Large Pages you do not inadvertently violate your SLA during an unplanned outage. Remember, VMware HA requires a restart of a server on the new host if a host fails. This reboot time could be substantial if large amounts of memory are allocated.
Another problem can come if the service needs to restart and the memory it now requires to be contiguous is no longer that. If so, SQL Server will fail to start and you will see the following error in the ERRORLOG file.

SS4
*snip*

SS5

Performance Benefits

Check back soon for our performance summary of SQL Server 2008 R2 on vSphere 5.0 Large Pages performance!

Summary

So, in summary:

Requirements

  • Minimum guest memory allocation of 8GB
  • ‘Lock Pages in Memory’ permissions granted to service account
  • 64-bit Windows and SQL Servers only
  • Max SQL Server buffer pool memory setting must be configured

Pros

  • Potential measurable performance improvement (YMMV)

Cons

  • Potential significant delay in service startup time
  • Eliminates transparent page sharing for this VM on VMware host
  • SQL Server service restarts success rate could be an issue

Keep in mind that Large Pages is not a VMware feature – it’s a Windows feature. VMware simply extends it. Large Pages can be enabled whether the workload is physical or virtual, but all workloads should be virtualized, right?

Jan 17
2012

Solution Architects Introduction

Posted by Solutions Architects in Untagged 

Welcome to the House of Brick Technologies Solutions Architect blog.

Who are we?

We are Jim Hannan and David Klee (@kleegeek), Solutions Architects from House of Brick. We form the foundation of the technical pre-sales and advocacy group. Jim focuses on Oracle on VMware. David focuses on SQL Server on VMware. Both focus on advanced VMware and environmental performance tuning topics.

Jim Hannan began working with Oracle technologies in 2003 while helping to maintain an Oracle E-Business application. Jim joined the team at House of Brick in 2006, where he got his first taste of virtualization on VMware ESX 3.01. Since that time he has helped many of House of Brick customers virtualize their Oracle Databases and Oracle Application Servers. Jim specializes in virtualization technologies, Oracle, storage and Linux/Unix administration. Jim is a VMware Certified Professional. When Jim is not working with technologies he enjoys bicycling and Creighton basketball and baseball games.

David Klee is a technologist who loves making things go faster. He has been with House of Brick since January of 2010. He uses bleeding edge technology and environment tuning to help infrastructures and databases perform at their maximum potential. He specializes in SQL Server, VMware and other virtualization technologies, programming, system administration, storage, networking, and project management. He speaks periodically at SQL Server and VMware events. He is certified as a Microsoft Certified IT Professional in SQL Server 2005 and 2008 Database Administration as well as VMware Certified Professional and VMware Certified Advanced Professional in Datacenter Design. Outside of work David enjoys just about anything with at least one wheel and a motor, and occasionally you can find him around town playing billiards.

What topics will we be writing about?

We will be focusing on Oracle and SQL Server news, architecture, performance, configuration, and troubleshooting while in physical and virtual environments. We will post our experiences while in the field, technologies we are experimenting with, items of interest in the technology news circles, and other items that we feel strongly about.

Why are we starting this blog?

We are committed to sharing our knowledge with the technical community in the hopes that our knowledge can spark interest in virtualization and performance tuning and perhaps help you in your professional endeavors.

We hope you enjoy reading! If you have any questions or comments, we’ll be very interested to hear what you think.

Jim and David
Oct 06
2011

Reflections On Steve Jobs - His Impact on My Career

Posted by Nathan Biggs in Untagged 

I hope that you will indulge a personal retrospective in this post. Just moments ago, my daughter looked on her iPhone and gasped as she read the news that Steve Jobs had passed away. For a family as committed to the Apple ecosystem as we are, the announcement took us all by surprise. It was almost like losing a family friend. Of course, I never met Steve Jobs, but his impact on my career has been profound. His passing has caused me to reflect on just how much my career has been intertwined with Apple's iconic visionary.

It was in the spring of 1984 that I first had a glimpse of an Apple Macintosh computer. I was in my senior year of high school, and in the school's inaugural computer science class. My friend, Bryan and I had the opportunity to go to the office where it was being demonstrated. I was not sure what to think. The person doing the demo pushed it as a tool for making banners and other WYSIWYG functions. I fancied myself a programmer, not a banner publisher, and so that function did  not initially appeal to me. What was fascinating though was the idea of a mouse and movable pointer. I immediately went to work on programming my Commodore 64 with the joystick pointer to see if I could mimic the functionality.

Bryan and I spent the next two years in missionary service—Bryan in Japan, and I in Korea. We both came back to Arizona to pursue a college education, but the entreprenuerial bug bit both of us. We decided to launch a Japanese/Korean translation service called Trans-Pacific Mediation. This was my first business venture since selling eggs from our chickens door-to-door as a kid. I read about these two college students that had started Apple Computer in California, and just knew that if they could launch something then so could we. Bryan and I decided to focus on hotels in the Phoenix area for selling our translation services. We offered to translate and produce business cards, menus, signs, etc. One of our challenges was finding a computer that could produce print-quality materials in English, Japanese, and Korean. Remembering our first experience with a Mac, we knew that is what we needed. I took $3,000 out of savings, and with a student discount, purchased my first Macintosh computer—a Mac SE with a 20MByte hard drive.

Trans-Pacific Mediation did not last long (my first, but not last hard lesson in the requirement that a business be adequately funded to even have a chance). The business reason for buying the Mac went away, but it was too late...I was hooked. Since that time, over 20 years ago, I have purchased 13 more Macs, ranging from the original iMac, to Mac Minis to PowerBooks, and MacBook Pros. None of the subsequent Macs cost as much as my very first Mac SE. I have never purchased a Windows computer (unless you count the iMac that came with Windows for dual-boot).

I went to Arizona State University and entered the Computer Systems Engineering program. I had two options for my course of study, an Intel path where we would learn the 8086 family of processors, or the Motorola path where we studied the 68xx processors. Of course, as a newly converted Mac addict, I was just sure that the 68000 processor in my Mac SE had to be better than what Intel was doing at the time, so I took the Motorola path. That was not the most popular choice. MS-DOS, and soon to follow versions of Windows were staking out a dominant position in the industry. My trusty Mac was right there with me though. While other students were spending countless hours in the lab, I was able to use my modem to dial in to the school's Vax and do my programming from home. While other students produced their programming code onto green and white folded dot matrix paper, I was publishing mine on a laser printer with headers and footers. I don't know if it got me any better scores (probably not), but I liked the professionalism that the Mac allowed me to convey.

While I was in school at ASU, I was able to pick up an internship at Microchip Technology in Chandler, AZ. They were looking for an engineering student that knew how to use a Macintosh to produce technical documentation. Once I graduated, I got my first job as an engineer at Microchip. My devotion to the Macintosh followed me throughout my tenure at Microchip, even though they were dark times for Apple.

During Steve Jobs' absence from Apple Computer in the 90's, I became an Evangelista (you would have to have been there to know). I followed Guy Kawasaki and the faithful few who knew that Apple had a better platform, and Bill Gates must have done something underhanded in getting Windows out there so prevalently. Guy Kawasaki was Apple's Chief Evangelist. A title that I insisted on for House of Brick's Dave Welch (which he has subsequently adopted emotionally, even if only on paper before).

Steve Jobs, and now Guy Kawasaki were my emotional mentors when it came to technology and entrepreneurship. I moved to Nebraska by connecting to the (new) Internet and finding a job on Careerlink.org. I became an owner of a software development company called Priority Technologies. It was there that I got my first iPod. A click wheel unit with 10GBytes of hard drive space for $500. I have bought countless iPods since then, but none as small in capacity, large in size, or as expensive as the first.

Now as CEO at House of Brick, I have started to make headway in the proliferation of Apple products. We have become avid iPad users, and are now getting more people requesting iPhones. My family has ditched cable in favor of Mac Minis on all our HD TVs, streaming iTunes, Netflix, and Hulu.

Through all these years, at every major decision point that I have encountered in my career, an Apple product has been there seemingly enabling or influencing my decision. I have taken courage as I saw Steve Jobs overcome and face his own career and life challenges. I have fancied myself as a Jobs-like visionary in the limited scope of my own pursuits. Even if I do not have the same visionary outlet as Steve has nurtured in creating the Apple ecosystem, I patterned myself technically, creatively, and entrepreneurially after his determination and drive. I saw how the iCEO revolutionized computing with the iMac, music with iTunes and the iPod, mobile phones with the iPhone, and last but not least, a whole new compute paradigm with the iPad.

The industry will go on in the coming days, weeks, and even years about how Steve Jobs has made such a huge impact on society. Everything they say will be right on about his life-changing vision. For me though, Steve Jobs has had a real and lasting impact on me personally and professionally. I feel a real sense of loss at Steve's passing. Not because I knew him, but because he has been there with me at every step of my career.

 

Nathan

Oct 03
2011

Storage Performance Metrics

Posted by Dave Welch in vSphere , VMware , Storage Performance , Orcle

Storage Performance Metrics

This blog entry is the continuation of the blog entry dated August 5, 2011

Average Read Time

  • 20ms
  • Sustained peaks of >20ms for no more than about five minutes

Attributes:

  • Un-cached random reads
  • Assuming 90% read, 10% write, with writes colliding

The Average Read Time performance attribute is important because there are schema/application designs that can pierce the cache, regardless of how big that cache is.

Average read time can be pulled from Oracle Statspack and AWR reports at the tablespace and datafile level.

Read time can also be pulled from the vSphere Client:

Select a host or a VM in the navigation pane > Performance Tab > [Advanced] > select Data Store > select Real Time > click the Read Latency box.

Spindle Busy Average

< 50%

To run R/esxtop to display device, kernel, guest and queue latency start an r/esxtop session:

$ esxtop > v > [Enter]

Look at the %USD column.

LUN percent busy

 

$ nmon > d


SCSI Queue Depth

<= teens (after having reconfigured all storage path devices to 128, up from default of 32)

The default SCSI queue depth for ESX is 32.

$ esxtop > d > f > [Enter].  Verify that QSTATS is selected with an asterisk to the left as pictured below.

Storage Latency

For day-to-day monitoring of disk throughput, latency is the ideal metric. It is more accurate in determining if your database is suffering from I/O throughput issues. The performance redlines for disk latency should be 20ms or lower. With latency it is acceptable to have burst periods of higher than 20ms latency, but not consistently.

R/esxtop monitors disk latency at three distinct layers, the device or HBA, the kernel or ESX hypervisor and the guest or virtual machine. Use the free IOmeter storage benchmarking tool in a Windows guest (only the Windows version can do asynchronous writes, not the Linux version).  Configure the tool for the Oracle database default block size and a 32GB streaming write (large enough to pierce the write cache of many SANs).

To run R/esxtop to display device, kernel, guest and queue latency start an r/esxtop session:

$ esxtop v > f > h > i > j > enter

  • The GAVG or guest latency is actually a combination of the device latency + hypervisor + any additional guest OS overhead.
  • The KAVG is the kernel latency or hypervisor latency. High latency reported at the kernel can be due high SCSI queues or device drivers.
  • DAVG is the device latency or HBA latency. We use the term HBA generically, including an actual fiber channel HBA or iSCSI NIC. Device latency usually indicates a bottleneck at the storage or SAN layer.

By monitoring the four distinct layers we can clearly see the HBA is an issue with the SAN configuration. If just GAVG was high then we would have had looked at the Linux guest. This customer had very serve device latency (DAVG). This related back to a SAN issue that was later fixed by the storage vendor with a firmware upgrade.

Storage Throughput

Given a 4Gb Fibre Channel fabric, you should be looking at >100MB/sec/storage path sustained. If you are getting less than that, consider splitting out storage paths.

CPU Ready Time

The VMware definition of CPU Ready is, "the time a guest waits for CPU from the host while in a ready-to-run state" (VMware ESX Server 3: Ready Time Observations - http://www.vmware.com/pdf/esx3_ready_time.pdf)  We also refer to CPU Ready Time as the ”guest heartbeat.”

We generally monitor CPU Ready Time through VI client.  ESXTOP can also be used.   However, we prefer the VI client because it measures CPU Ready in milliseconds.

Select the virtual machine > Performance tab > click Advanced > click Chart Options... > CPU > Real Time > and select Ready


It is normal for a guest to average between 0 - 50ms of CPU ready time.  Anything over 300ms and you will experience performance problems. We’re comfortable with up to 300ms CPU Ready Time on average, with a high water mark of 500ms.

ESX Memory Oversubscription

Do not oversubscribe memory shared by Tier-1 Oracle workloads.  VCS should report no ballooning.

Oracle is very aggressive at using all of the resources available to it. When an Oracle instance starts up, it allocates memory for the Oracle SGA. This allows Oracle to use a contiguous space of memory for caching of data and SQL statements. As long as the instance is running, it will not give the memory back even if blocks are unused. Oracle memory is less dynamic then CPU. It tends to level out after the Oracle SGA has been allocated. You may see brief periods of spikes when the PGA is being used for a RMAN backup, but other than slight anomalies it tends to stay static. As with CPU, measure memory during peak workloads.

We recommend memory sizing by using the following 3 tools:

  • Oracle SGA Advisor.
  • Oracle PGA Advisor.
  • OS metric collection (for example, nmon)

Start with the Oracle advisors first to determine if the SGA and PGA are accurately sized. We consider the SGA critical to virtual machine sizing. If, for example, the Oracle SGA is undersized this will affect your buffer cache hit and shared pool, negatively impacting performance. It will also skew your OS metric collection. After the SGA and PGA are sized accurately then use the OS collected metrics to determine the appropriate memory size of the virtual machine.

Allocate enough ESX physical memory to at least cover the Oracle SGA, and preferably as large as the sum of the SGA, the PGA high water mark, and memory used by the shadow processes.  If you use reservations, make sure that there is enough extra physical memory in the cluster that the failed over VM will not refuse to start.

In vSphere Client, select the VM > Performance tab > Advance tab > Chart Options… > Memory > Real Time > Active


Network Performance

We confess that we rarely run into network configuration issues that impede VMware Infrastructure performance.  Just the same, we’ll offer a standard.  There is always the possibility that a workload changes its virtual machine memory blocks at a high rate.  A workload so hot that it generates dirty memory blocks faster than the proprietary vMotion interconnect can move in thirty seconds is justification for beefing up the interconnect hardware capacity through faster NICs (or converged technology), teaming NICs, or both.

iperf Network Load Test

Conceptually the network load test should be used to force the max throughput of the network fabric and interfaces. Similar to the memory it flushes out any potential drive issues, hardware issues or network topology issues. This test is to validate performance and push the max throughput of the interfaces to determine if there are driver issues or stability problems.

Iperf is a network testing tool. It is able to create a TCP or UPD data stream between two nodes (or virtual machines). Iperf is open source software available for both Linux and Windows at http://sourceforge.net/projects/iperf/. Below is a simple test that runs for 500 seconds testing the maximum throughput of the network interfaces.

Server Node

$ iperf -s -i 5

Client Node

$ iperf -c -t 500 -i 5 –m

Dropped Packets

For network packets on a LAN, you ideally should not see dropped packets. Dropped packets typically indicate congestion in the network or failing hardware. One percent dropped packets in either direction can throttle throughput by as much as 15 percent.

In vSphere, dropped packets can be monitored by selecting the ESX host > Performance tab > Advanced > Network > Real Time > select None in Counters > then check Receive packets dropped and Transmit packets dropped

<< Start < Prev 1 2 3 Next > End >>