Categories
Software Magazine

Distributed drumbeat still a bit hollow – Data Resource Management: Distributed DBMS Technology

Distributed drumbeat still a bit hollow – Data Resource Management: Distributed DBMS Technology – buyers guide

Janis Herter

DISTRIBUTED DRUMBEAT STILL A BIT HOLLOW

The MIS director threw his hands up in the air and exclaimed, “Sure, our databases are distributed. If you see any of them, would you please send them home?”

In that sense, distributed databases have been around for a long time. However, technology is just emerging to fit a more contemporary definition: Distributed DBMS systems reside at multiple sites in a network, but appear to the user and applications as one database at the user’s own site.

Although the modern definition seems simple, there are many ways to satisfy it, each with trade-offs. “If you’re looking to purchase, realize that distributed database’ is not a binary state of ‘have it’ or ‘don’t,'” cautioned Jeffrey Tash, president of Database Decisions, a consulting firm in Newton, Mass.

DRIVING FORCES

The apocryphal MIS director has already achieved the first step in understanding how to go about implementing a distributed DBMS solution. People, more than technology, distribute data. If technology alone were the answer, probably the most reliable and cost-efficient solution would be huge centralized mainframe-based database managers. But the current is running the other way–user demand spurred the development of low-end and midrange computing.

That trend brought a whole new set problems. Personal computers, departmental computing and geographically distributed mainframes led to the need to unify information, but the data’s owners are only willing to share it, not give it away. Culturally speaking, the issue is that Americans especially hate to depend on anyone else, and want control over their own data. For many users, PCs were ego-effective long before they were cost-effective. That is one big reason why we arrived at today’s state of affairs.

A major force behind today’s distributed DBMS evolution is the need to access or consolidate privately owned data that is physically distributed. For the most part, data remains physically distributed; centralizing the data is not cost-effective or politically possible.

For many new systems, and many new bodies of data, centralization is clearly the second choice. In the case of distributing systems that five years ago might have been centralized, Ken Jacobs of Oracle Corp., Redwood City, Calif., stated, “It’s the economics of mid- to low-range computing. Compared to a central site, networks are reliable and economical, although difficult to administer.”

Another benefit of distributed databases is the ability to increase computing capacity through adding more computers rather than expanding existing computers. “If your cart becomes too much for your horse to pull, do you trade for a bigger horse, or get a second horse?” asked Umang Gupta, president of Gupta Technologies, Menlo Park, Calif.

“But the most common reason for using distributed systems is to give access to users from their desktop to data stored everywhere,” he continued. “An insurance agent might keep the data on the local customers, and make a policy change by accessing corporate actuarial information, changing the local database, and then sending the update back to the mainframe.”

Distributed database systems can be categorized as homogeneous or heterogeneous–for example, three Ingres sites from Ingres Corp., Alameda, Calif., vs. a system built on one Oracle and two IBM DB2 databases.

Under one approach to distributed data, which many vendors support, databases are linked through a gateway, or intelligent interface. The location of the data is not user-transparent; so strictly speaking, this approach doesn’t satisfy the definition of distributed DBMS used earlier because users have to know where to go to get their data.

Usually the database being linked to is available for query, not update. This is more properly referred to as remote data access (RDA), networked systems or distributed processing. Focus, from Information Builders Inc., New York, is one of the most advanced and complete systems of this sort.

This is certainly a valuable technology in today’s business computing world, but the scope of distributed databases extends past data access. Today’s gateways would be greatly improved if the RDA standard for handling updates and messages among different products were in place. There is some progress–IBM is working with OSI to help resolve differences between its LU6.2 protocols and OSI’s Distributed RDA.

From a user or application program perspective, the next level of distributed DBMS complexity involves granting query and update capability across multiple physical databases. For query purposes, the databases may be homogeneous or heterogeneous; but for best performance, the requests are optimized. Queries, or requests that are optimized for performance, will follow the most direct route to where the data resides, regardless of whether the data is pulled from one (homogeneous) or several (heterogeneous) databases.

However, in most current distributed DBMS implementations offering an update capability, the databases are probably homogeneous, because it is easier to enforce locking and integrity controls. An ongoing concern is that when an update capability is granted, data can be changed by the user. This change affects the integrity of the data if not handled properly (especially if across multiple environments, updating data in several databases at once; that is, heterogeneous).

The highest level of complexity is heterogeneous distributed databases with full update capability. This capability is available today in only certain combinations of products, and with varying degrees of programming effort required. The available products still fall short of the goal that a distributed system should look exactly like a nondistributed system.

According to Eric Wasiolek, marketing manager at Ingres Corp., a distributed database system should handle the following three function: distributed queries, with optimization; distributed updates, with recovery; and distributed database administration. The administration function is critical, yet often neglected by most vendors.

According to Database Decisions’ Tash, “Today’s single biggest need is not for a distributed database. It is for a distributed data dictionary. This isn’t even a technical problem; it’s a data administration problem. We’ve got a Tower of Babel.”

Ingres/Star, Ingres’ gateway product, offers a global data dictionary that resides on one central machine. While technically sound, the approach shares the same problem that centralizing previously distributed data creates. “Politically, it’s often not feasible to centralize what was previously separate. The sites won’t give up control. It only works when a new system is centralized from the start, and then deployed,” said Tash.

Ingres/Star’s global data dictionary contains information about the location of data, and this is used to optimize requests. The reward is significantly improved performance. The penalty occurs when the central site goes down. To compensate for this vulnerability, the network must be more robust to prevent central site failure or bottleneck, or the central site must be replicated, increasing overhead.

However, a central site is not a necessary evil for query optimization. Informix Software, Inc., Menlo Park, Calif., has developed an optimizer without centralizing the data dictionary. It does this by sending messages to dictionaries at other nodes to determine how to satisfy the request most efficiently.

Tash noted, “This is certainly more advanced that Ingres’ technique, but with all of these messages, at what point does the opimization cost more than the query?”

DBMS INDEPENDENCE

C.J. Date, a consultant now with Codd & Date, Inc., San Jose, Calif., developed 12 rules for an ideal distributed database system. Today, no software meets all of these criteria. None of the 12 rules states that a distributed system must use one DBMS, but rule 12–DBMS independence–cannot be satisfied with today’s technology, Data asserts.

Enabling different DBMSs to act as equal partners in a distributed system should be possible, but it will certainly be more difficult than using one DBMS product at multiple sites. One reason, among many, for this difficulty is the number of different SQL dialects. However, in the interim, imperfect heterogeneous distributed DBMS implementations may prove to be more valuable than perfect implementations of a homogeneous system because most organizations run multiple DBMSs. Reality always wins.

Another of Date’s rules concerns data replication. The advantages of replicated data are fewer remote accesses (due to more local copies), and the availability of another copy if the first is down or locked for update. The major disadvantage is that if one copy is updated, all should be updated–automatically. Few vendors offer this level of replication. And keeping all replicas synchronized can cause tremendous network traffic.

However, realtime synchronization is not always necessary. Herbert Edelstein, a principal of Euclid Associates, a consulting firm in Berkeley, Calif., lists several exceptions he has encountered. “Day-to-day synchronization of the replicas may be sufficient. Or perhaps the replicas consist of primary key values. These are usually stable over time,” he said, “so changes wouldn’t swamp the network. Or if there are additions, often a business rule can be imposed that restricts the use of that value until it’s replicated. If Chicago adds a new customer, only Chicago can book orders for it that first day.”

Dick Hatch, chairman of the board and CEO of Hatch and Fortwangler, Inc., a consulting firm in Pembroke, Maine, uses the replication facilities of CA-Star with Datacom databases from Computer Associates International, Garden City, N.Y., for U.S. Customs in Washington. The system is distributed across multiple mainframes due to sheer size. Replicas are used for performance and load balancing, and for maintaining “hot duplicates” in case of disk drive failure.

“If one of the replicas goes down, perhaps due to communications or node failure, I have a choice of either halting updates to all copies and continuing read-only, or continuing updates to the other copies and later resynchronizing the failed replica,” said Hatch. “Most vendors don’t give you that second choice.”

TWO-PHASE COMMIT

One of the other big problems with replication is the protocol that ensures consistent updating across multiple copies or, as it is often called, two-phase commit. It is a critical function of any distributed system.

Basically, the protocol ensures that all copies of an object are updated, or none are. If an update fails at one site, the updates at the other sites are rolled back. This causes updates to fail more frequently, but guarantees the integrity when an update succeeds. It also requires numerous messages to cross the network between nodes for each update.

Two-phase commit is available with Sybase from Sybase, Inc., Emeryville, Calif.; Ingres/Star; InterBase from Interbase Software Corporation, Bedford, Mass.; Rdb from Digital Equipment Corporation, Maynard, Mass.; Empress from Empress Software, Inc., Greenbelt, Md.; and CA-Star from Computer Associates.

WAITING TO DO IT RIGHT

The notable exceptions are Oracle, Informix and DB2. According to Oracle’s Jacobs, “Oracle is waiting to provide two-phase commit until we can do it properly, with user transparency and full automatic recovery across all network topologies and across heterogeneous databases.”

IBM’s DB2 offers only single-site update, even on version 2 release 2. According to consultant Tash, “Two-phase commit doesn’t happen in DB2 until they get to ‘distributed unit of work’; and before that arrives they’re going through the stages of ‘remote request,’ which is like a micro-mainframe link; and ‘remote unit of work,’ which is like client/server. Then comes this ‘distributed unit of work,’ and then finally ‘distributed request,’ which is what most people think of when they think of distributed database. By then it’s ll be 1994.”

Today’s two-phase commit offerings can be categorized as automatic or programmatic. John Kornatowski, president of Empress Software, said, “Our automatic two-phase commit is built into the Empress kernel and simply initiated by the developer. It allows updating of views which span multiple nodes, transparent to the user.” Ingres/Star and Interbase also offer automatic two-phase commit.

An automatic implementation is not always desirable. Steven Olson, senior software engineer for SQL Solutions, a Sybase-owned company in Burlington, Mass., worked on a Sybase distributed application for Northern Telecom, Research Triangle Park, N.C., to monitor the status of telephone lines. Sybase uses a programmatic implementation, which allows some flexibility. Olson said, “If someone with a backhoe causes a failure in the network, the system must detect it and notify the repairmen of the location and severity.” The network does not come to a halt.

“It would have been nice if one transaction could have updated all copies of the data, but it was unacceptable that if a node was down, the update was rolled back everywhere,” Olson said.

At U.S. Customs, consultant Hatch said, “Even though CA-Star offers an excellent two-phase commit, there would be problems if our tables were frequently down. Our two and a half million CICS transactions per day generate about half a million database requests. Fortunately, we don’t face typical communications problems, because all of our mainframes are in one location.”

A different approach to distributed databases is the fault-tolerant hardware-software solution developed by Tandem Computers, Cupertino, Calif. Tandem has solved many of the problems with which other distributed databases are still wrestling.

Data stored on a Tandem system and managed by Tandem’s NonStop SQL distributed database can be accessed using tools provided by Ingres, Oracle, Focus and Sybase.

WHAT’S MISSING?

Distributed technology must overcome several hurdles. Distribution across a wide-area network increases the network traffic problems tremendously compared to a local-area network. And traffic only increases with such desirable capabilities as two-phase commits, data replication and distributed transaction management. Query optimization becomes essential to avoid bringing the data mountain to Mohammed.

According to Edelstein from Euclid Associates, “I’m adding rule 13–performance transparency. If you can tell your system is distributed by using a stopwatch or even a calendar, then there’s something missing.”

Heterogeneous distributed database access is crippled by a lack of standards. A SQL standard must emerge, and it cannot be the least common denominator of the dialects. The RDA standard is also sorely needed. Today, if n DBMSs want to talk to m DBMSs, n*m gateways are needed. With an RDA standard, only n would be needed.

Other holes that need to be filled include network security support and security administration, and standardized name servers to locate a user or a printer across nodes.

Even if problems of performance and interproduct access standards were solved today, the data is often not ready to be joined. What if two databases store pricing information in different currencies? What if one system reflects a one-to-one relationship and another calls it one-to-many?

Work is needed on application deployment–how does one provide a new version of an application to 5,000 PCs?

DB2 must become fully distributed, which according to IBM’s Bob McIvor, senior planner in the strategy and market planning organization of the Santa Teresa Laboratory, will occur in the mid-’90s. “Our fundamental objective is to protect the integrity of the data. Our distributed version will have all the things our customers expect in a single DBMS–security, integrity, recoverability and performance.

“IBM’s Distributed Relational Database Access (DRDA) will encompass DB2 on MVS, SQL/DS on VM, and the OS/400 and OS/2EE database managers,” said McIvor.

Consultant Tash commented, “Today, IBM offers some distribution between DB2 and DB2, but people want OS/2 to DB2.” In any market other than the computer industry, if the leader lagged by five years in an important feature, the competitors would seize the throne. But here, the other vendors must wait for DB2 to catch up. For most vendors, it is more important to offer distributed access to DB2 than to their own databases.

With all of these hurdles, is distributed the way to go? “Certainly, it’s still in its infancy,” said Keith Toleman marketing manager, Microproducts Division, Information Builders, Inc. “The security, the communications, the networking–all must be so much more advanced. That’s why we promote distributed applications more than distributed databases.”

CLIENT/SERVER APPROACH

He continued, “This is a client/server approach where you keep the data in one place on the mainframe, and your application runs on the PC to handle your screens and data formatting and so on. The data looks local, and you offload mainframe cycles. And with [Information Builders’] Focus, your host platform and its database can be whatever you have or want.”

Other traditional mainframe software vendors are also developing distributed product strategies based on the client/server model. Cincom Systems, Inc. in Cincinnati, Ohio, for example, previewed a distributed version of Supra, which is based on a client/server paradigm, at its recent user conference.

With the current lack of standards and flux in the market, other factors come into play. Consultant Tash said, “Even more than distributed databases, keep your eyes on these three: cooperative processing, snapshots and partitioned data.” Cooperative processing involves distributed processing but centralized data. Tash predicted that small distributed systems will not replace the large mainframes. “That would require giving up control.

“Summary-level snapshots will be taken from DB2 once a day or once a week and put on LANs and servers,” Tash continued, “and we will see partitioned data for local or departmental applications on machines like DEC VAXes.”

During the next year, progress is likely on proposed relational database access standard protocols, to improve heterogeneous database query and update capabilities.

COPYRIGHT 1990 Wiesner Publications, Inc.

COPYRIGHT 2004 Gale Group