Information Age (London, UK)

Benchmarking relevance?

Benchmarking relevance?

Like many of the world’s largest software vendors, database software giant Oracle uses product benchmarking results as the cornerstone of multi-million dollar marketing campaigns. One of Oracle’s recent advertisements compared the performance of its flagship 8i database with rival IBM’s DB2 database: “Oracle runs SAP [applications] four times faster.”

Oracle is hardly alone. BEA Systems gleaned publicity after boasting that its application server, WebLogic, could run four times faster than IBM’s own application server, WebSphere, under IBM’s own ‘Trade2’ online brokerage benchmark. BEA also says that under the Java ‘Pet Store’ benchmark, its product is over 50% faster than Oracle’s 9i Application Server.

Effective marketing material, no doubt, but IT decision-makers need to ask themselves to what extent should such claims of superiority be taken at face value?

Not at all, argues Ulrich Marquard, vice president of performance, benchmarking and data archiving at German software giant SAP. “In the Oracle case, our research shows the numbers Oracle compared [to DB2] were produced on hardware with an absolutely different numbers of CPUs [central processing units],” he says. In fact, Oracle’s database was tested on a 128-way Symmetric Multiprocessing (SMP) computer architecture, while IBM’s was tested on a 24-way SMP system.

Perhaps not surprisingly, technology industry analysts’ reaction to the way suppliers use benchmarking data is scathing. “It is useless,” says Ted Schadler, group director at Forrester Research’s TechRankings business which assesses technology product strengths and weaknesses across a broad set of subjective and objective categories.

Benchmarking, in itself, is not without its uses, of course. In the early 1990s, for example, software performance benchmarking data (generated by the independent Transaction Processing Performance Council, or TPC) was very helpful to IT decision-makers, says Susan Dallas, research director at analyst Gartner. When client-server systems were first introduced, organisations struggled to get all of the different components of an IT system to work together and to figure out its potential throughput and the source of bottlenecks, she explains.

But over the last few years, the validity of benchmarking has dwindled as vendors have seen them as key marketing vehicles – even going as far as to build benchmarking optimisers into product code. So just how relevant are benchmarks in helping users’ make good software purchasing decisions?

Reality check A major problem is that most benchmarks by their very nature are carried out under laboratory conditions. “Benchmarks are not the real world [of most enterprise IT architectures],” says Mark Brockbank, a consultant in the software group at IBM.

For example, application server software that is capable of processing large volumes of web transactions per second in a test environment may struggle to reproduce a fraction of that performance on an organisation’s actual systems. “This could be down to an organisation using different hardware, but our research shows it is the design of the [organisation’s own] application that determines 90% of the performance characteristics that they get in the end,” adds IBM’s Brockbank.

However, apart from generating performance data, analysts still see an important role for benchmarking within specific technology markets, particularly when applied to emerging technologies. This is certainly the case with security software, says Dallas at Gartner.

The NSS Group provides independent third-party testing and certification services for security products, for example. In particular, NSS specialises in performance- and feature-oriented tests on firewalls, intrusion detection system products and public key infrastructure products. Bob Walder, director of NSS, acknowledges that “vendors will present their test results in a certain way that will make their product look better.”

To support his argument, he contrasts how NSS and an intrusion detection system (IDS) software vendor, which he declines to name, benchmarked the vendor’s software. NSS used a system to blast web traffic of 148,000 packets at 64 bytes per second at the IDS software to see how many malicious attacks, such as denial of service packets, it recognised. This test focuses on pure performance or “raw sniffing speed”, says NSS Group’s Walder.

The IDS software vendor, by contrast, used 1,514 byte packets instead of 64 byte packets to perform its own benchmarking tests. Increasing the size of packets dramatically means that fewer packets travel through its IDS software. “This way, the vendor could say it achieved a detection rate of over 90% at 100% network saturation, but when we tested it [with 64 byte packets] it scored way, way lower than that, and in fact, the software crashed,” says Walder.

In addition to security product performance capabilities, end users are now demanding far more detailed benchmarking criteria about software including functionality, reliability and integration, according to analysts.

This is an area that Forrester Research moved into in 2000 with its web-based TechRankings service. In partnership with product testing laboratory Doculabs, Forrester provides analysis and testing data on products from eight software markets.

Each product is tested against between 400 and 600 criteria, says Schadler.

In this format, benchmarking appears to be of more use to prospective customers – an acceptable ‘second best’ to an organisation running its own real-life target applications against different products. Benchmarking, however, will never replace the need to carry out quality assurance and fine tuning of whole systems and their individual components.

Stress tests The most common form of benchmarks are bespoke – that is, tests conceived and run internally within organisations to gauge the performance characteristics of their existing or proposed applications. But over the past decade, a slew of standard software benchmarks have emerged that score the performance of core software products.

TPC (Database applications) By far the most prominent of the independent benchmarking and testing organisations is the Transaction Processing Performance Council (TPC). Founded in 1988, the TPC is a non-profit organisation that provides a strict set of parameters for benchmarking hardware and systems software as they process database transactions and queries. The tests are audited and verified by the TPC – albeit on behalf of hardware and software vendors.

The TPC tests cover four areas: TPC-C simulates an order-entry environment; TPC-H measures the capability of a system to process ad hoc queries; TPC-R tests configurations for their ability to handle advanced decision-support processes; and TPC-W is a transactional web benchmark measuring web interaction processing.

Tests are typically carried out by server vendors in cooperation with one of the major database software companies and are scored both on raw performance and price/performance. But there is one caveat: while results have to be approved by the TPC, vendors are under no obligation to publish unfavourable results.

Pet Store (Application servers) Vendors themselves also provide benchmarks for software infrastructure technologies. This is the case with the Java Pet Store application benchmark developed by Sun Microsystems. Pet Store has become the industry benchmark for testing the capabilities of application servers that adhere to the Java 2 Enterprise Edition (J2EE) interoperability standard. The idea is to simulate ecommerce activities of an online pet store.

Trade2 (Application servers) Systems and software giant IBM has also developed the Trade2 benchmark to show off the performance characteristics and cost benefits of its WebSphere Application Server product. Formerly known as the WebSphere eBusiness Benchmark, Trade2 tries to replicate the real-world application behaviour of an online brokerage firm to measure the overall performance, as well as the individual components, of WebSphere. However, the benchmark can be run against other application servers.

SAP SAB (Various) A different approach to benchmarking has been taken by German software giant SAP. Since 1993, it has provided a standard set of application benchmarks for other vendors to test new hardware, system software components, and database systems against SAP applications. The company’s Standard Application Benchmarks (SAP SAB) are available for most core modules of its enterprise resource planning system, mySAPcom, including financials, retail, sales and distribution. For example, a relational database software vendor might use SAP SAB to test performance capabilities, including scalability, concurrency and multi-user behaviour of systems running on its database.

This also benefits SAP because it can measure how well specific products work with SAP applications.

Hardware trials

Compared to the complexity of software benchmarks, hardware tests are far more focused on measuring fixed elements of systems, such as the performance of a computer’s processors, rather than overall system performance. The de facto standard performance test for processors is that of the Standard Performance Evaluation Corporation (SPEC), a non-profit organisation formed by hardware vendors to create standardised benchmarks that can genuinely rate servers and workstations.

But SPECrate results cannot exist in a vacuum; system’s performance can be substantially influenced by interconnected components such as a computer’s disk and memory. This is one reason why hardware vendors have relied on the benchmarks of the Transaction Processing Performance Council, which aim to measure entire system workloads.

Rateable value Market research organisation Forrester Research’s TechRankings service aims to provide organisations with a level of data about software products that empirical benchmarks lack.

It combines Forrester’s research data, including vendor interviews and analysis, with laboratory-based product testing by its partner Doculabs. However, unlike other benchmark tests, such as the TPC’s, once products are tested vendors cannot prevent publication of the analysis – even if it is unfavourable.

Forrester claims that end users are now putting pressure on vendors to put their products through its TechRankings test – to prove their products are up to scratch. Product test areas include content management software, integration servers, and commerce platforms.

COPYRIGHT 2002 Information Age Media Ltd.

COPYRIGHT 2008 Gale, Cengage Learning