Free Newsletters :

Twitterverse Ponders The 'Deep Web'

There's a fascinating article in Sunday's New York Times about "a new breed of technologies...that will extend the reach of search engines into the Web's hidden corners," far beyond what even Google can do today.

It's creating quite a buzz on Twitter, where links to the article have been "tweeted" seemingly hundreds of times today. If you're not on Twitter, trust me: That's a sure indication that something is having a great impact. (I've included a blog post and several comments below from Twitter users who want to share their thoughts. They're worth a read.)

Google, as we all know, is the dominant search engine of our time, with the ability to search more than a trillion web pages. Take it, NYT's Alex Wright:
But as impossibly big as that number may seem, it represents only a fraction of the entire Web. Beyond those trillion pages lies an even vaster Web of hidden data: financial information, shopping catalogs, flight schedules, medical research and all kinds of other material stored in databases that remain largely invisible to search engines.

The challenges that the major search engines face in penetrating this so-called Deep Web go a long way toward explaining why they still can't provide satisfying answers to questions like "What's the best fare from New York to London next Thursday?" The answers are readily available -- if only the search engines knew how to find them.

Search engines rely on programs known as crawlers (or spiders) that gather information by following the trails of hyperlinks that tie the Web together. While that approach works well for the pages that make up the surface Web, these programs have a harder time penetrating databases that are set up to respond to typed queries. ...

With millions of databases connected to the Web, and endless possible permutations of search terms, there is simply no way for any search engine -- no matter how powerful -- to sift through every possible combination of data on the fly.
One project, called DeepPeep, has set a goal of crawling and indexing every database on the web. Run by Professor Juliana Freire of the University of Utah, DeepPeep essentially begins by shooting off sample queries to get a sense of what's in a particular database, in the same way that a space probe is sent out to get preliminary data from a planet or solar system. Once it receives answers allowing it to get its bearings, DeepPeep then "fires off automated search terms in an effort to dislodge as much data as possible," according to Wright.

Of course, Google didn't get where it got by being slow-footed, so naturally the company is working on its own Deep Web technology, which involves developing a predictive model based on search returns to specific terms it queries within the context of a database.  

In a way Google is in a tough spot. It currently reigns supreme in search and thus would prefer not to see the market paradigm change. But in one form or another, Deep Web searching -- or the Semantic Web, the web of linked or interconnected data -- is inevitable, so Google has to experiment and be willing to change a formula that has enabled it to become an Internet giant. But what happens if the company can't do Deep Web or semantic search as well as a competitor? Anyone remember Alta Vista?

Here are thoughts on the Deep Web from several Twitter folks:

Sam Han @caughtintheweb: Why the name "Deep Web" misses the point

@dataspora: NYT muses on the 'Deep Web,' but in a shallow article leaves out most (all) of the important players.

@stakats: NYT on "deep web" ignores even "deeper web"of commercial and otherwise gated data stores

@chadskelton: Thinking if Google figures out how to index the "Deep Web" that's good news for database journalism.

@asktonyc: @LLiu The deep web is like dark matter in our galaxy: huge, full of material that can have an impact on our lives, FTW!

0 Comments (click to add your comment)
Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 

Search Datamation Blog