Web Wonders – resarch on improving Web search engines
Imagine: Search engines that actually find what you’re looking for
TRYING TO FIND SOMETHING ON the Internet is a lot like rummaging through countless cardboard boxes in a dark attic. Even if the search is eventually successful, it almost always turns up too much junk. Worse, no search is ever comprehensive. The dark attic of the Internet is only infiltrated as far as the very limited capabilities of a particular search engine. Yahoo’s directory, for example, is merely an index of selected sites cataloged by anonymous editors. Other popular indexers such as Excite, Lycos, and Alta Vista send spindly bits of software known as spiders meandering from Web site to Web site, sucking up bits of text from each page they encounter and making a list of the keywords they find. This patchy list–not the Web itself–is what a search engine actually searches.
Not surprisingly, search engines are having a tough time keeping up with the exponential growth of the Net, which already boasts more than 5 million Web sites and about 1.5 billion total pages of information. Experts estimate that the amount of information on the Net is expanding by 2 million pages a day and that the best search engines cover no more than 16 percent of those pages, down sharply from 34 percent in 1997. Spiders take months to finish one incomplete sweep of the Web, during which time tens of millions of pages have appeared, changed, or disappeared.
So is it hopeless? Some Web engineers are trying to compensate by developing swifter, more efficient spiders. Others look for ways to transform the very nature of the search process by making it more human. Fast Search and Transfer (FAST), a new company with headquarters in Oslo, Norway, has created spiders on steroids. FAST’s programs can index 80 million pages a day–twice as many as spiders dispatched by conventional search engines–and may be able to index all of the Web by early next year. Instead of collecting data from spiders in one big list of keywords the way other search sites do, FAST divides the information into several hundred more manageable chunks and scans all of them simultaneously, performing more than 600 searches per second.
But brute force doesn’t make the job of sorting through all that information any easier. Marc Krellenstein, chief technology officer of Northern Light in Cambridge, Massachusetts, believes finesse is ultimately more important than strength. His spiders do not have the sheer muscle of FAST, and he couldn’t care less. “It may only be a quarter or a third of the Web that we’ve indexed, but it’s a better part of it in terms of its quality,” he says. Northern Light classifies Web pages according to various criteria, including a custom list of 25,000 subject areas. A team of real, live human librarians spot-checks the classifications. When someone does a search, the engine organizes the results into folders, each with a subject title.
Both FAST and Northern Light still depend primarily on random searches by dumb spiders, so the next step is to give spiders the ability to recognize connections between different Web sites. Reka Albert, Albert-Lazlo Barabasi, and Hawoong Jeong of Notre Dame are studying how the so-called six degrees of separation phenomenon relates to the Internet. In the 1960s, Yale social psychologist Stanley Milgram observed that two complete strangers are likely to be connected by a network of six or fewer acquaintances. The Notre Dame researchers have proven that the Internet is a small world too. They did a statistical analysis of hyperlink connections and discovered that the average number of clicks needed to get between two randomly chosen pages on the World Wide Web is 19. Even if the Web expands to 10 times its present size, that average will only increase to 21 clicks. The upshot is that people could quickly reach any page on the Web if only they knew the right place to start and how to navigate. Spiders programmed to recognize degrees-of-separation patterns might be an answer. “If something goes in intelligently and can see what links it is following, it doesn’t have to index everything,” says Albert.
Such clever spiders are already in the works. Sridhar Rajagopalan and his colleagues at IBM’s Almaden Research Center in San Jose have designed a prototype search engine called CLEVER, which uses spiders endowed with a dose of humanlike intelligence. They seek out hubs–mini-index sites with a large collection of hyperlinks on a single subject–and authorities–pages that a large number of other Web sites put on their lists of hyperlinks. CLEVER scores each page based on the quality of the pages it points to and the quality of those that point to it, which in turn affects the scores of all those linked pages. By repeating this circular process, the search engine quickly locates the best hubs and authorities. Instead of relying on the judgment of editors at index sites like Yahoo or America Online, CLEVER gathers the opinions of those who really matter–the millions of people who have created Web pages and added links to their favorite sites.
Rajagopalan considers that an advance over the current Web chaos. “We’re giving people access to the collective wisdom of the Web,” he says. A group led by Bernardo Huberman and Lada Adamic of the Internet Ecologies Group at Xerox PARC is pursuing a related approach using data from Google, one of the most sophisticated of the current search engines.
Natalie Glance of Xerox Research Center Europe in Meylan, France, hopes to add to the collective wisdom by fostering more of a sense of personal connection on-line. “There are possibly millions of people doing the same search as you, so how can we bring in the power of the community to help?” Her answer: a software program that acts as a community search assistant by cataloging queries made by many people looking for the same kind of information. For example, a generic search for digital cameras might generate a list of cameras other people are considering and pinpoint the popular models. Glance has also created an automated process, called Knowledge Pump, which would make it possible for people to submit comments about Web sites and receive recommendations based on past preferences.
One surefire way to foster more community spirit is to make the search process less forbidding. That advance means doing away with endless, incomprehensible lists of text. WebTheme, a cybergraphics program developed by Jim Thomas and his colleagues at Pacific Northwest National Laboratory in Richland, Washington, displays search results as a glowing galaxy, in which each point of light represents an individual Web page and star clusters denote related topics. WebTheme can also represent data as a topographical map on which peaks show the largest subjects.
Some Web visionaries imagine a day when intelligent spiders will weave together connections between related bits of the Web, creating threads that become something meaningful in themselves–a cybertapestry. When many people weigh in on a topic, a process called group filtering occurs, which yields predictions that tend to be extremely accurate. Robert Lucky, an electrical engineer and corporate vice president of Telcordia Technologies in Morristown, New Jersey, compares the phenomenon to the national point spread in football pools, which often produce uncannily accurate predictions of game scores. “Somehow, this process of everybody contributing a guess settles out to something that is statistically accurate,” he says.
Lucky is confident a similar process will make searching the Internet more fruitful. “A lot of information in the world isn’t stuff written down in books; it’s nuances in people’s heads. If you put them all together, you can form something better than any individual opinion. As they say, nobody is as smart as everybody.”
COPYRIGHT 2000 Discover
COPYRIGHT 2000 Gale Group