The challenges of data management in biotech: booming life sciences research has IT resources bursting at the seams – Storage Networking

The challenges of data management in biotech: booming life sciences research has IT resources bursting at the seams – Storage Networking – Industry Overview

Christine Taylor Chudnow

Pharmaceutical companies typically spend over $200 million and around 12 years to hypothesize, develop, and seek FDA approval for a new drug. Pharmaceuticals and other life sciences companies are desperately seeking faster and cheaper ways to invent more unique drugs and are hoping to capitalize on the tremendous advances in genomic and proteomic research. To aid in their research, life sciences companies have made tremendous use of biotech tools: computing hardware and software developed specifically for the life sciences sector.

Life sciences researchers depend heavily on these biotech key areas: In silico biology, which are computational tools that translate raw data into workable models or simulations, guiding target selection and drug development. Bio-informatics and genome research, also computational tools but for processing the overwhelming amount of data gathered through genome research, such as the Human Genome Project. (Human genome research is not the only game in town. The other hottest genomic research subjects are mice, worms, yeast, and, believe it or not, puffer fish.) Proteomics is high-throughput expression analysis and characterization, impacting diagnostic and therapeutic product development.

During the tremendously challenging drug-discovery process, life sciences researchers and manufacturers commonly use dozens of specialized technologies to identify commercially important genes, discover their functions, validate drug and product development targets, and identify and develop clinical candidates. For example, Millennium Pharmaceuticals uses six different types of technologies during its drug-discovery procedures:

* Biosensors, which are highly sensitive microchips that use tiny samples to analyze interactions between proteins and proteins and small molecules.

* High-throughput DNA sequencing software, which automates the process of capturing, storing, and analyzing data about the DNA sequence.

* High-throughput screening, which draws from a library of validated targets to test multiple compounds against each drug target, discovering and reporting appropriate responses.

* Imaging, which displays data while retaining correct spatial relationships between individual data points.

* Informatics, which are sophisticated computational tools that access and interpret public and proprietary databases on a wide range of scientific information.

* Microfluidics, which tests genomes and drugs in minuscule volumes measured in nanoliters.

The Search for Power

A common search function underscores the industry’s need for processing power. Scientists often search across genomic and proteomic databases to compare sequences, which are letter strings that represent genes and proteins. Four-letter strings represent genes, with each letter standing for a different nucleic acid (nucleotide). Twenty-letter strings represent proteins, each letter standing for a different amino acid. Since these sequences can match on many different aspects of letter strings, sequencing requires high-powered, specialized search programs such as BLAST (Basic Local Alignment Search Tool) that can quickly range across multiple-terabyte databases.

To increase processing power to these kinds of searches, as well as to bioinformatics operations and in silico experiments, biotech has adopted supercomputing, clustering, and grid computing.

Supercomputers: The Bioinformatics Center of the Institute for Chemical Research (ICR) at Japan’s Kyoto University uses supercomputers as the network servers for its KEGG (Kyoto Encyclopedia of Genes and Genomes), a major global genome database system. Three Sun Microsystems Sun Fire 15K computers work at the system’s core, each one containing 72 CPUs, 144GB of memory, and 15TB of storage. The Sun Fire offers SMP (symmetrical multi-processing) architecture, foundational to many life sciences applications.

As part of its Blue Gene life sciences compute project, IBM has partnered with the Department of Energy’s (DOE) National Nuclear Security Agency. Working with Lawrence Livermore National Laboratory, IBM is jointly designing a new Blue Gene supercomputer called Blue Gene/L. It is slated to operate at about 200 teraflops (200 trillion operations per second). (IBM and Lawrence Livermore co-developed the world’s current record-breaking supercomputer, the “ASCI White” machine now in operation at the lab.) Compaq is racing IBM to see if it can better its record in the life sciences and is collaborating with genetic database company Celera Genomics and the DOE’s Sandia National Labs. The Compaq supercomputer should be able to do at least 100 teraflops per second, which is about eight times faster than the existing ASCI White but considerably slower than Blue Gene/L. Both have target dates in 2004.

Clusters: Many life sciences companies can’t afford to keep a Sun Fire in the basement and, therefore, turn to clusters for running parallel compute operations. A cluster is a group of tightly integrated servers that acts as a single computer. Although clusters running proprietary clustering software can be costly, a popular PC-based clustering approach is called a Beowulf Cluster. Beowulf clusters combine open-source software and applications (usually Linux), economical PC servers, and a high-speed backbone. This configuration can yield affordable, virtual supercomputing for applications whose data or tasks can be processed in parallel. The NIH, for example, has invested heavily in 176-node “Biowulf” (proving that life sciences researchers actually do have a sense of humor).

Grids: Another method of providing massive processing power along with collaborative features is comprehensive grids. Grid computing seeks to create virtual supercomputers by expanding the clustering concept across multiple locations and domains. Though we are far from a theoretical, world-spanning “Grid,” today’s grids can flexibly support dynamically changing organizations and computing requirements. Grids operate across geographical distances and over multiple networks, hardware, and operating systems. Computers that contribute processing resources to the grid do not have, to be dedicated solely to the grid, unlike clustered servers. Grids also allow partners to share data and application resources, which benefits biotech companies–most of them are suffering from a slow rate of internal drug development, and by collaborating they hope to sustain growth rates and satisfy investors.

Security needs loom large for life sciences industries such as healthcare, which must meet a variety of strict federal privacy and security regulations. Life sciences computing security must work across a range of processes: anti-virus, intrusion detection, vulnerability assessment, firewalls, and VPNs (virtual private networks). There has been a dramatic rise in external security threats, and the costs of dealing with network intrusions is climbing. Most internal security breaches are due to incorrect IT configurations and end-user ignorance, and are not usually malicious. That is not true about external security breaches, which have grown far more sophisticated since teenaged-hacker clays. Corporate piracy is not unknown, and the industry cannot rule out the possibility of cyber-terrorist attacks on its biological and chemical laboratories.

Security software developer Riptech divides the major life sciences security threats into four categories: inadequate border protection, remote access systems with weak access controls, application servers running widely available faulty scripts, and misconfigured or default-configured systems.

Inadequate border protection: The easiest way to penetrate a network is to identify an Internet gateway without a well-configured firewall. Missing or inadequate firewalls are not uncommon in areas such as healthcare and life science start-ups where IT might not be properly staffed or funded. At the University of Washington Medical Center, for example, a hacker easily entered a corporate database and downloaded thousands of patient medical records. The network access point lacked a firewall.

Remote-access systems with weak access controls: Hackers frequently bypass firewalls and access controls through remote access points such as dial-in servers by simply dialing directly into a server. In addition to the missing firewall, the UW Medical Center had weak network username and password policies.

Application servers that use well-known, exploitable scripts: Many application servers commonly depend on well-known scripts with equally well-known security holes. Unless these scripts are patched or replaced, the servers remain at risk. Although biotech application servers are unlikely to use well-known scripts, most of them exist on the same network as at-risk general servers.

Misconfigured and default-configured systems: Inexperienced network administrators can easily misconfigure operating systems and firewall applications, or install them using default configurations. Default configuration parameters can make installation easier, but hackers are often familiar with the default settings. Vulnerabilities often include easily available, high-risk services and default user accounts with known passwords. Once a hacker identifies a target network, his or her next step is often to find a system with default configurations or common misconfigurations.

Riptech suggests that companies make good use of firewalls to protect critical network connection points, VPNs to safely provide access to internal network resources via the Internet, deploy intrusion detection technology to generate alerts in the face of suspicious network activity, and practice strong authentication technologies to protect remote access systems.

For example, First Genetic Trust provides genetic data handling and bioinformatics services to pharmaceutical companies, medical researchers, and health care providers engaged in genetic research. The company is also an online portal for genetic information, education, and counseling services for individual patients’ decisions regarding the use of their private genetic information. The company called on HP to help build and deploy a genetic banking technology platform. This platform manages and protects genetic data critical to clinical trials, helping accelerate research in genetic-based medicine, diagnosis, and treatment. First Genetic Trust’s infrastructure is built on UNIX servers using HP Virtualvault, an enterprise security application that protects the Web server, applications, transactions, and critical data.

External security also deeply impacts grid computing, which is being actively developed in the life sciences arena. Security is a large, and largely unanswered, part of the life sciences grid equation, especially concerning broad public or global grids. Grid tool developer Globus cites three main requirements for grid security: authenticated, secure communication between elements of a computational Grid; cross-organizational security support, making a centrally managed security system unnecessary; and support for single sign-on sessions for Grid users across multiple resources and/or sites.

Data and Knowledge Management

Due to the nature and volume of life sciences data, the industry suffers from inefficient data collection, data sharing, information distribution, and decision-making procedures. Data warehousing, for example– which allows users to transparently query and analyze data from multiple, heterogeneous, distributed sources–is rare in the sciences. Scientific data is volatile. As scientists conduct research, their understanding of data frequently changes. This requires data warehouse administrators to continually update the warehouse to read the source data’s new format, which is demanding and not easily automated. In addition, researchers often issue multipart queries that must access several different databases, yet must return a unified answer.

Software approaches to the problem of data integration concentrate on the following key requirements: access the original data sources, handle redundant and missing data, normalize analytical data from different data sources, conform terminology to industry standards, access the integrated data as a single logical repository, and use metadata to traverse domains.

The Interoperable Informatics Infrastructure Consortium (13C) is an international body of life sciences and information technology organizations, dedicated to developing common protocols and specifications for data exchange and knowledge management for the life sciences community. According to the 13C, drug development in today’s post-genomic era requires a tremendous technology investment in hardware and software in order to make use of the incredible masses of genomic data, and it must integrate the tremendous body of work that already exists in databases throughout the world. However, there is, as yet, no widely accepted interoperability framework for sharing and exchanging massive amounts of data across databases, and between pharmaceuticals, healthcare, biotech organizations, and academia.

There are promising approaches today, largely based on flavors of XML and/or Java-enabled middle-ware. XML-based integrated databases are separate databases that use a common DTD (document type definition), which enables XML-enabled applications to query separate databases. However, although some DTDs are widely accepted in the life sciences industry, there is no commonly accepted language. For example, database tool developer Incogen developed a public mark-up language called Visual Bioinformatics Mark-up Language (VBML), while LabBook, which markets a Genomic XML Viewer, uses BSML (Bioinformatic Sequence Mark-up Language).

The University of Minnesota has been using Java and JDBC technologies. JDBC is an API that lets users access tabular data sources such as spreadsheets and flat files, and also provides cross-DBMS connectivity to a number of SQL databases. Professor Ernest Retzel, director of the Center for Computational Genomics and Bioinformatics, is developing a set of publicdomain Java visualization tools that can cull the results of multiple gene research applications into a unified interface. “Some people still take the approach that they’re working on a gene, instead of an entire system of genes,” said Retzel. “If you want to just explore one thing at a time, you can do that with a text report. But if you’re looking at 40,000 or 50,000 properties in an organism, and how things change when you salt stress or heat stress or freeze something, it’s very different. You can’t look at text reports anymore; there’s too much information to deal with.”

IBM, a large life sciences player, developed the Discovery Link suite to handle gene and protein database queries. Based on DB2 and DB2 Relational Connect middleware, DiscoveryLink provides single-query access to existing databases, applications, and search engines. The suite is not confined to DB2 databases, but accesses other relational database systems (RDBMS) including Oracle, Sybase, and Microsoft SQL. It also uses DB2 Life Sciences Data Connect, which allows a DB2 federated system to integrate research data from a variety of scientific disciplines from heterogeneous distributed sources. The application presents queries to disparate databases using XML-based wrappers, a type of middleware database interpreter that allows users to access differing data sources as if from a single database.

Life sciences needs the help. The stock market has rewarded public companies that are farther along in the drug discovery cycle but is punishing companies that the market perceives as pure R&D outfits. Private funding has been more forgiving–there is tremendous potential in gene and protein research for medicine and agribusiness–but even professional investors want to see some return on their money. In the scramble to prove a healthy ROI, biotech companies are looking to processing power and robust data management tools to speed up the drug discovery process.

Knowledge management is a vital aspect of data sharing. KM in the life sciences may include any and all information, data, programs, documents; embedded knowledge systems, processes, etc.; tacit knowledge-skills, experience; innovative and deductive capabilities; communication processes and cultures; relations with customers and other partners. It involves human intellectual processes (values, knowing, company culture, and cultural background of the worker), the nature of knowledge (subjective interpretation of data/information), and IT tools to facilitate knowledge management.

Life sciences companies use common groupware suites that lend themselves to KM, such as Lotus Notes and Microsoft Exchange. These applications, which include email, group scheduling, contact management, threaded discussion, and extensive document-sharing features, prove useful in life sciences environments. However, groupware developed specifically for life sciences centers around the in silico discovery process. For example, Accelrys integrates biological modeling and discovery lab applications into its Discovery Studio Project Knowledge Manager (DS ProjectKM), an Oracle-based groupware infrastructure that captures and stores the information that Discovery Studio applications generate. The application can then enable research teams to re-use the resulting information.

Other types of applications also serve KM in the life sciences industry, such as content management, personalization engines, portals, catalog management, document management, and digital-asset management. Of these, bio-tech portals are extremely common and have been in academic use for more than a decade. These portals can range in sophistication from simple Web link pages and user communities to intensive database search portals. Content searchers are important as well, for while federated databases use software to integrate different database formats, content searchers go out to a variety of content sources to return pattern matches. For example, British KM developer Autonomy uses advanced, pattern-matching technology (non-linear, adaptive, digital signal processing) to extract a document’s digital essence. and determine the characteristics that give the text meaning. Once the software has identified and encoded the key concepts’ unique signatures, it creates Concept Agents to seek out similar ideas in a variety of content sources.

Knowledge research is a vital part of life sciences, but few of these companies host their own research databases–the sector has produced so much data over the last few years that individual storage capacity is soon outstripped. Most life sciences companies will augment their internal research by searching huge public or proprietary life sciences databases. For example, major biotech developer Incyte hosts the Proteome Bio-Knowledge Library, which is a collection of databases compiled from available protein information. Incyte uses proprietary tools to search biological literature for relevant entries, extracts it, and uses it to populate its databases. The company then presents the data to subscribing researchers in a simplified user interface. Many public databases also see heavy use, and are usually attached to universities.

In spite of limited returns, governments and private investors continue to pour money into the life sciences sector, which remains a hotbed of possibilities for medicine, husbandry, and agriculture. Biotech is a necessary tool to store, compute, analyze, and share our growing knowledge of complex living organisms.

COPYRIGHT 2002 West World Productions, Inc.

COPYRIGHT 2003 Gale Group