Archive

Posts Tagged ‘Performance’

VMware breaks the 50,000 SPECweb2005 barrier using VMware vSphere 4

June 17th, 2009 No comments

Looking forward to seeing if it delivers on the promises of performace, here it is an iteresting reading. You may be interested to have a look at this white paper first What’s New in VMware vSphere™ 4: Performance Enhancements

VMware breaks the 50,000 SPECweb2005 barrier using VMware vSphere 4

VMware has achieved a SPECweb2005 benchmark score of 50,166 using VMware vSphere 4, a 14% improvement over the world record results previously published on VI3. Our latest results further strengthen the position of VMware vSphere as an industry leader in web serving, thanks to a number of performance enhancements and features that are included in this release. In addition to the measured performance gains, some of these enhancements will help simplify administration in customer environments.

The key highlights of the current results include:

  1. Highly scalable virtual SMP performance.
  2. Over 25% performance improvement for the most I/O intensive SPECweb2005 support component.
  3. Highly simplified setup with no device interrupt pinning.

Let me briefly touch upon each of these highlights.

Virtual SMP performance

The improved scheduler in ESX 4.0 enables usage of large symmetric multiprocessor (SMP) virtual machines for web-centric workloads. Our previous world record results published on ESX 3.5 used as many as fifteen uniprocessor (UP) virtual machines. The current results with ESX 4.0 used just four SMP virtual machines. This is made possible by several improvements that went into the CPU scheduler in ESX 4.0.

From a scheduler perspective, SMP virtual machines present additional considerations such as co-scheduling. This is because in case of a SMP virtual machine, it is important for ESX scheduler to present the applications and the guest OS running in the virtual machine with the illusion that they are running on a dedicated multiprocessor machine. ESX implements this illusion by co-scheduling the virtual processors of a SMP virtual machine. While the requirement to co-schedule all the virtual processors of a VM was relaxed in the previous releases of ESX, the relaxed co-scheduling algorithm has been further refined in ESX 4.0. This means the scheduler has more choices in its ability to schedule the virtual processors of a VM. This leads to higher system utilization and better overall performance in a consolidated environment.

ESX 4.0 has also improved its resource locking mechanism. The locking mechanism in ESX 3.5 was based on the cell lock construct. A cell is a logical grouping of physical CPUs in the system within which all the vCPUs of a VM had to be scheduled. This has been replaced with per-pCPU and per-VM locks. This fine-grained locking reduces contention and improves scalability. All these enhancements enable ESX 4.0 to use SMP VMs and achieve this new level of SPECweb2005 performance.

Very high performance gains for workloads with large I/O component

I/O intensive applications highlight the performance enhancements of ESX 4.0. These tests show that high-I/O workloads yield the largest gains when upgrading to this release.

In all our tests, we used SPECweb2005 workload which measures the system’s ability to act as a web server. It is designed with three workloads to characterize different web usage patterns: Banking (emulate online banking), E-commerce (emulates an E-commerce site) and Support (emulates a vendor support site that provides downloads). The performance score of each of the workloads is measured in terms of the number of simultaneous sessions the system is able to support while meeting the QoS requirements of the workload. The aggregate metric reported by the SPECweb2005 workload normalizes the performance scores obtained on the three workloads.

The following figure compares the scores of the three workloads obtained on ESX 4.0 to the previous results on ESX 3.5. The figure also highlights the percentage improvements obtained on ESX 4.0 over ESX 3.5. We used an HP ProLiant DL585 G5 server with four Quad-Core AMD Opteron processors as the system under test. The benchmark results have been reviewed and approved by the SPEC committee.

Sw2005_KL

We used the same HP ProLiant DL585 G5 server and the physical test infrastructure in the current as well as the previous benchmark submission on VI3. There were some differences between the two test configurations (for example, ESX 3.5 used UP VMs while SMP VMs were used on ESX 4.0; ESX 4.0 tests were run on currently available processors that have a slightly higher clock speed). To highlight the performance gains, we will look at the percentage improvements obtained for all the three workloads rather than the absolute numbers.

As you can see from the above figure, the biggest percentage gain was seen with the Support workload, which has the largest I/O component. In this test, a 25% gain was seen while ESX drove about 20 Gbps of web traffic. Of the three workloads, the Banking workload has the smallest I/O component, and accordingly had relatively smaller percentage gain.

Highly simplified setup

ESX 4.0 also simplifies customer environments without sacrificing performance. In our previous ESX 3.5 results, we pinned the device interrupts to make efficient use of hardware caches and improve performance. Binding device interrupts to specific processors is a technique common to SPECweb2005 benchmarking tests to maximize performance. Results published in the http://www.spec.or/osg/web2005 website reveal the complex pinning configurations used by the benchmark publishers in the native environment.

The highly improved I/O processing model in ESX 4.0 obviates the need to do any manual device interrupt pinning. On ESX, the I/O requests issued by the VM are intercepted by the virtual machine monitor (VMM) which handles them in cooperation with the VMkernel. The improved execution model in ESX 4.0 processes these I/O requests asynchronously which allows the vCPUs of the VM to execute other tasks.

Furthermore, the scheduler in ESX 4.0 schedules processing of network traffic based on processor cache architecture, which eliminates the need for manual device interrupt pinning. With the new core-offload I/O system and related scheduler improvements, the results with ESX 4.0 compare favorably to ESX 3.5.

Conclusions

These SPECweb2005 results demonstrate that customers can expect substantial performance gains on ESX 4.0 for web-centric workloads. Our past results published on ESX 3.5 showed world record performance in a scale-out (increasing the number of virtual machines) configuration and our current results on vSphere 4 demonstrate world class performance while scaling up (increasing the number of vCPUs in a virtual machine). With an improved scheduler that required no fine-tuning for these experiments, VMware vSphere 4 can offer these gains while lowering the cost of administration.

View article…

Dtrace for Windows? Windows Performance Toolkit

June 17th, 2009 No comments

So you have performance troubles on Windows, you probably already pulled the sysinternals from the shelve. But did you already know the Windows Performance toolkit for hardcore performance troubleshooting?

This toolkit has three tools;

xperf.exe – Captures traces, post-processes them for use on any machine, and supports command-line (action-based) trace analysis.

xperfview.exe – Visual Trace Analysis Tool – Presents trace content in the form of interactive graphs and summary tables

xbootmgr.exe – Automates on/off state transitions and captures traces during these transitions.

So what do these tools do?

Performance Analyzer is built on top of the Event Tracing for Windows (ETW) infrastructure. ETW enables Windows and applications to efficiently generate events, which can be enabled and disabled at any time without requiring system or process restarts. ETW collects requested kernel events and saves them to one or more files referred to as “trace files” or “traces.” These kernel events provide extensive details about the operation of the system. Some of the most important and useful kernel events available for capture and analysis are context switches, interrupts, deferred procedure calls, process and thread creation and destruction, disk I/Os, hard faults, processor P-State transitions, and registry operations, though there are many others.

One of the great features of ETW, supported in WPT, is the support of symbol decoding, sample profiling, and capture of call stacks on kernel events. These features provide very rich and detailed views into the system operation. WPT also supports automated perf testing. Specifically, xperf is designed for scripting from the command line and can be employed in automated performance gating infrastructures (it is the core of Windows PerfGates). xperf can also dump the trace data to an ANSI text file, which allows you to write your own trace processing tools that can look for performance problems and regressions from previous tests.

More info:

http://blogs.msdn.com/ntdebugging/archive/2008/04/03/windows-performance-toolkit-xperf.aspx
http://msdn.microsoft.com/en-us/performance/cc825801.aspx
http://download.microsoft.com/download/5/E/6/5E66B27B-988B-4F50-AF3A-C2FF1E62180F/COR-T594_WH08.pptx

Download the tools here:

YouTube Preview Image