Deep Web Research 2009

Bots, Blogs and News Aggregators is a keynote presentation that I have been delivering over the last several years, and much of my information comes from the extensive research that I have completed into the “invisible” or what I like to call the “deep” web. The Deep Web covers somewhere in the vicinity of 1 trillion pages of information located through the World Wide Web in various files and formats that the current search engines on the Internet either cannot find or have difficulty accessing. Search engines find about 20 billion pages at the time of this publication.

In the last several years, some of the more comprehensive search engines have written algorithms to search the deeper portions of the world wide web by attempting to find files such as .pdf, .doc, .xls, ppt, .ps, and others. These files are predominately used by businesses to communicate information within their organization, or to disseminate information to external communities. Searching for this information using deeper search techniques and the latest algorithms allows researchers to obtain a vast amount of corporate information that was previously unavailable or inaccessible. Research has also shown that even deeper information can be obtained from these files by searching and accessing the “properties” information on these files.

This guide is designed to provide a wide range of resources to better understand the history of deep web research. It also includes various classified resources that allow you to search through the currently available web to find key sources of information located via an understanding of how to search the “deep web”.

This Deep Web Research 2009 article is divided into the following sections:

  • Articles, Papers, Forums, Audios and Videos
  • Cross Database Articles
  • Cross Database Search Services
  • Cross Database Search Tools
  • Peer to Peer, File Sharing, Grid/Matrix Search Engines
  • Presentations
  • Resources – Deep Web Research
  • Resources – Semantic Web Research
  • Bot Research Resources and Sites
  • Subject Tracer Information Blogs


99 Resources to Research & Mine the Invisible Web by Jessica Hupp

Academic and Scholar Search Engines and Sources All of OCLC’s WorldCat Heading Toward the Open Web by Barbara Quint

An Interactive Clustering-based Approach to Integrating Source Query interfaces on the Deep Web by W. Wu, C. Yu, A. Doan, W. Meng

Annotation for the Deep Web

Automatic Extraction of Web Search Interfaces for Interface Schema Integration by H. He, W. Meng, C. Yu, Z. Wu

Automatic Information Extraction From Semi-Structured Web Pages By Pattern Discovery

Automatic Meaning Discovery Using Google by Rudi Cilibrasi and Paul M. B. Vitanyi Benevolent “Virus” Helps Reveal the Hidden Web

Beyond Google: The Invisible Web – Tools for Teaching the Invisible Web

Bibliomining Bibliography Bibliomining for Automated Collection Development in a Digital Library Setting: Using Data Mining to Discover Web-Based Scholarly Research Works by Dr. Scott Nicholson

Bot Research

Client-Side Deep Web Data Extraction

Clustering E-Commerce Search Engines by Q. Peng, W. Meng, H. He, C. Yu

Common Information Environment Seeks To Reveal the Hidden Web,13927,1195901,00.html

Crawling the Hidden Web by Sriram Raghavan and Hector Garcia-Molina

Current Awareness Discovery Tools on the Internet

Data Extraction and Label Assignment for Web Databases

Deep Content – Guide To Effective Searching of the Internet

Deep Web – Exploring the Secrets of the Hiddden Internet by Marcus P. Zillman, M.S., A.M.H.A., – 23 minutes – Internet/Technology Channel

Deep Web Navigation in Web Data Extraction

Desperately seeking Web Search 2.0

DigiCULT Thematic Issue 6 Resource Discovery Technologies for the Heritage Sector, June 2004 Download Thematic Issue 6:Link HiRes .pdf (4.9 MB)

Diving in the Deep End of the Web by Suzanne Ross

Efficient and Effective Metasearch Project

Google Teams Up with 17 Colleges to Test Searches of Scholarly Materials By Jeffrey R. Young

Graph Structure in the Web

Grey Literature

Grey Literature Network Service (GreyNet)

Gray Literature: Resources for Locating Unpublished Research by Brian S. Mathews

Gray Literature Subject Guide

Information Retrieval and the Semantic Web by Tim Finin, James Mayfield, Clay Fink, Anupam Joshi, and R. Scott Cost

In Search of the Deep Web

Invisible Web Gets Deeper

Invisible Web Revealed

IR and IE on the Web – PhD and MSc Dissertations

JEP: The Deep Web

LLRX: Book Review: The Invisible Web //

LLRX: Deep Web Research //

LLRX: Deep Web Research 2005 //

LLRX: Deep Web Research 2006 //

LLRX: Deep Web Research 2007 //

LLRX: Deep Web Research 2008 //

LLRX: Mining Deeper Into the Invisible Web //

LLRX: ResearchWire: Exposing the Invisible Web //

Metadata? Thesauri? Taxonomies? Topic Maps! by Lars Marius Garshol

Mining Newsgroups Using Networks Arising From Social Behavior

Mining the Deep Web: Search Strategies That Work by Lee Ratzan

Mining the Deep Web With Specialized Drills

Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews

Mining Topic-Specific Concepts and Definitions on the Web

Modelling and Mining of Network Information Systems Publications

Net Plan Builds in Search by Kimberly Patch

Online or Invisible?

OntoMiner: Bootstrapping and Populating Ontologies From Domain Specific Web Sites

OpenIndex – Creating a Public Internet Index

Out-googling Google: Federated Searching and the Single Search Box

PhysicsWeb: The Physics of the Web

Publications about Web Analysis, Web Search, Citation Indexing, Digital Libraries, Machine Learning, Neural Networks [Steve Lawrence, Google Labs]

QProber: Classifying and Searching “Hidden-Web” Text Databases

Research Beyond Google: 119 Authoritative, Invisible, and Comprehensive Resources

Researchers Map of the Web

Scientific American: Featured Article: The Semantic Web

Search Engine Meeting 2005 Boston, Massachusetts – White Papers and Presentations

Search Engine Meeting 2006 Boston, Massachusetts – White Papers and Presentations

Search Engine Meeting 2007 Boston, Massachusetts – White Papers and Presentations

Search Engine Meeting 2008 Boston, Massachusetts – White Papers and Presentations

Search Engine Technology and Digital Libraries

Searching the Deep Web by Alex Wright

Searching the Deep Web

Searching the Deep Web – Video

Searching the Deep Web Online Streaming Tutorial

Searching the Internet (White Paper, Audio and Video)

Seeing through the ‘invisible’ Web

SemaForm – Semantic Wrapper Generation for Querying Deep Web Data Sources

Semantic Web Content Accessibility Guidelines for Current Research Information Systems (CRIS)by A. Lopatenko

Smart Search – Advanced Search Engines Link Many Data Sources

Structured Databases on the Web: Observations and Implications

Testbed for Information Extraction from Deep Web

The Deep Web

The Deep Web: Surfacing Hidden Value by Michael K. Bergman

The Future Of News: The Digital Information Librarian

The Hidden Potential of the Web,13927,1195901,00.html

The Invisible Web by Chris Sherman

The Invisible Web: What it is, Why it exists, How to find it, and Its Inherent Ambiguity

The Invisible Web: Where Search Engines Fear To Go

The Mechanics of Deep Net Meta Search

The Ultimate Guide to the Invisible Web

Timeline of Events Related to the Deep Web

Topological Measures and Maps Of the Web

Toward the Semantic Deep Web by James Geller, Soon Ae Chun, and Yoo Jung An

Towards Automatic Incorporation of Search Engines Into A Large-Scale Metasearch Engine

Traffic-Based Feedback on the Web by Jonathan Aizen, Daniel Huttenlocher, Jon Kleinberg, and Antal Novak

Travel Industry and Deep Web: Exclusive Interview with Marcus P. Zillman

UMBC – AgentNews

Understanding Metadata

Using the Internet As a Dynamic Resource Tool for Knowledge Discovery

Web Characterization Project

Web Data Extractors White Paper Link Compilation

Web Pages Search Engine Based on DNS by Wang Liang, Guo Yi-Ping, and Fang Ming

WebScales: Towards a Highly Scalable Metasearch Engine

What Is the Deep Web? A WhatIs Podcast 15 Minute Interview with Marcus P. Zillman

What is the Invisible Web? A Crawler Perspective by Natalia Arroyo, Laboratorio de Internet

WISE-Cluster: Clustering E-Commerce Search Engines Automatically by Q. Peng, W. Meng, H. He, C. Yu

Yahoo and the Deep Web


Basic Functional Requirements for Cross Search Service

Digital Libraries- Cross-Database Search: One-Stop Shopping

Search Tools Reports: Searching for Text Information in Databases

The Right Solution: Federated Search Tools by Roy Tennant

UK Web Archiving Consortium


ARC – A Cross Archive Search Service

Entrez – The Life Sciences Cross-Database Search Engine

EnergyFiles – Subject Pathways

GPO Access – Search Across Multiple Databases

King County Library System

NLM Gateway Search


Scitopia – Deep Federated Search

The Metasearch Infrastructure Project


Bright Planet Copernic

Cross Database Search Tools Summary

Dieselpoint Java Search and Navigation Software

DbVisualizer – The Universal Database Tool

Dublin Core Metadata Initiative (DCMI)

EEVL Xtra – Cross Database Search


Gold Rush – Database Search Tool


MetaSearch Initiative

Project – Getting OAI-PMH For Free


Peter’s PolySearch Engines

PBCore – The Public Broadcasting Metadata Dictionary

Registry of Library Knowledge Bases

Search Federal Research and Development

SRU – Search/Retrieve via URL

STINET Multisearch

The Flamenco Search Interface Project

VIAF: The Virtual International Authority File



ALPINE Network – SourceForge: Project

An Efficient Scheme for Query Processing on Peer-to-Peer Networks

Azureus – Vuze Java Bittorrent Client


Between Rhizomes and Trees: P2P Information Systems by Bryn Loban



BitTorrent FAQ and Guide

Bit Torrent Official Site and Search Engine

Bitzi – The Free Universal Media Catalog


BotSpot®: File-sharing Bots

BTjunkie – Bittorrent Search Engine

Coral – The Coral P2P Content Distribution Network

Capn’s PHP Gnutella Search

Crackle – Stream On

Current P2P Search Implementations – P2P Networks

Deepnet Explorer – P2P/RSS-ATOM Web Browser

Distributed Search Engines

Distributed Search in P2P Networks

FAROO – P2P Web Search Filetopia

Free Haven Project

Frost Project – Freenet Messaging and File Sharing Client

FuzzBox: Tangent Research Artificial Intelligence and Robotics

GNUnet – GNU Project – Free Software Foundation (FSF)


GRACE – GRid seArch and Categorization Engine

Grid Resources


Open Source, Distributed Internet Crawler!

HyperCuP – Shaping Up Peer-to-Peer Networks

Clarke’s Blog

IM and P2P Threat Center

iMesh International Workshop on Peer-to-Peer Knowledge Management (P2PKM)

Internet Movie Database (IMDb)

Hunt – IRC and Bit Torrent Search Engine

JXTA Project

Kademlia: A Peer-to-peer Information System Based on the XOR Metric

Kazaa Media Desktop



LionShare P2P Project – Legitimate File-Sharing Among Individuals and Educational Institutions

Lphant – The Full P2P Solution

MoleSter – A Tiny File-Sharing Application



MysterNetworks – The Evolution of Peer-to-Peer

NeuroGrid – P2P Search Open Directory – File Sharing

Open Directory – MP3 Search Engines

OpenNap: Open Source Napster Server

Oyster – Managing, Searching and Sharing Ontology Metadata in a Peer-to-Peer Network.

P2P and the Future of Private Copying by Peter K. Yu, Michigan State University College of Law

P2PNet – Updated P2P News

P2P News from Topex

PeerCast P2P Radio

PeerMind – P2P Monitor

Piolet Port Knocking

PowerFolder – P2P Whole Folder Synchronization

Rodi – Tiny P2P Client/Host

ScrapeTorrent Skype

Slyck – File Sharing News and Info


Speckly – Torrent Search Simplified

Super-Peer-Based Routing and Clustering Strategies for RDF-Based Peer-to-Peer Networks

SwarmStream™ SDK

The Anthill Project

The Pirate Bay – BitTorrent Tracker

The Chord Project

The Freenet Project

The Peer-to-Peer Weblog

The Role of Peer to Peer File Sharing in Law Firm Marketing by Andy Havens //


Torrent Finder

Torrent Reactor

Torrent Typhoon (TT)

Tranche Project – Secure P2P for the Scientific Community

Tribler – A Social Community That Facilitates Filesharing Through P2P


Understanding BitTorrent: An Experimental Perspective by Arnaud Legout, Guillaume Urvoy-Keller, and Pietro Michiardi

URLBlaze: URL Sharing Network

Videora – Personal Video Using P2P and RSS


WiPeer – Serverless Peer to Peer Collaboration

YaCy – Distributed P2P Based Web Indexing and Anonmymous Search Engine

Yahoo! Directory Peer-to-Peer File Sharing

YAPPERS: A Peer-to-Peer Lookup Service over Arbitrary Topology

YouServ – A P2P (peer-to-peer) Web Hosting/File Sharing System



From Theory To Practice – Bielefeld Academic Search Engine

Gumshoe Librarian //

Quick Introduction to OWL Web Ontology Language

Searching the Internet and the Invisible Web

The Future of the Internet: Bots, Blogs and News Aggregators

RESOURCES – Deep Web Research

A Roadmap for Web Mining: From Web to Semantic Web



Bot Research

BrainBoost – Question Answering Search Engine

BrightPlanet’s Deep Federation Portal™ (DFP)

Can’t Find On Google

COLLATE – Collaboratory for Annotation, Indexing and Retrieval of Digitized Historical Archive Material

Comet Way

CompletePlanet – 70,000 Databases and Speciality Search Engines

Creative Commons RDF-Enhanced Search

Cuil Search – Search 121,617,892,992 Web Pages

Cyber Cemetery

CyberFiber Cybermtrics – First Generation Tools – Invisible Web

Data Fountains: Open Source Internet Resource Discovery and Metadata/Full-Text Generation Service

Data Mining Resources

DeepDyve – Deep Web Search Engine

Deep Web Research

Deep Web Technologies

DigiCULT Resources – Resource Discovery & Information Retrieval digitalAGORA

Directory Resources

Direct Search

eFinancial Bot Deep Meta Search Engine

eHealthcare Bot Deep Meta Search Engine

eMarketing Bot Deep Meta Search Engine


Engineering Village 2

Hakia – Search For Meaning

Find Articles

Freely Accessible Databases for the Public

Ghostscript, Ghostview and GSview

GlobalSpec – Engineering Search Engine

Google Labs

Google Scholar

HighWire Press – Largest Repository of Free Full-Text Life Science Articles in the World

iBoogie™ IncyWincy – The Invisible Web Search Engine


Instant Information Systems

Institutional Archives Registry

Intelligence Center


Internet Archive

Internet Search Environment Number (ISEN) Intute Invisible Library

Kapow Web Collector

KDnuggets: Data Mining, Web Mining, and Knowledge Discovery Guide


Knowledge Discovery

Large-Scale Deep Web Integration: Incomplete Bibliography

Librarians’ Index to the Internet


Mamma – Deep Web Search Engine

Mappa.Mundi Magazine

Microsoft Web Search Research and Patents

Mining the Deep Web for Economic Data

Mooter Search

MSN Sandbox

News Group Search

New Zealand Digital Library

OAI-PMH Implementation Guidelines – Conveying rights expressions about metadata in the OAI-PMH framework


OneLook Dictionary Search

Open Archives Initiative

OpenIndex – Creating a Public Internet Index

QProber: Classifying and Searching “Hidden-Web” Text Databases – PERSIVAL Project

Quigo Technologies

Powerset – Natural Language Semantic Based Web Search Engine

Pretrieve Search – Free Public Record Search Engine

Recommended Gateway Sites for the Deep Web

Science Accelerator – Search Key Resources from DOE OSTI


Science and Technology Sources on the Internet

Scientific and Technical Information Network (STINET)

Science Commons – FirstGov for Science – Government Science Portal

Scirus – Search Engine for Scientific Information

SDARTS – A Protocol and Toolkit for Metasearching

Search Adobe PDF Online

STN International – Databases in Science and Technology

Swoogle – Semantic Bot

TechDeepWeb – How-To Guide to the Deep Web for IT Professionals

TechXtra – Indepth Academic and Scholar Search

Testbed for Information Extraction from Deep Web

The Internet Sleuth

The Deep Web

The Invisible Web

THOR: Deep Web Data Extraction

Those Dark Hiding Places: The Invisible Web Revealed


UNESCO Information Services – Databases

Wall Street Executive Library

Web Data Extractors

Web Farming WebFountain™

Web Intelligence Consortium

Web IR & IE WebScales: Towards a Highly Scalable Metasearch Engine

Web-Searching Agents

RESOURCES – Semantic Web Research

AIS SIGSEMIS – SIGSEMIS: Semantic Web and Information Systems

Analyzing Social Networks on the Semantic Web


Combining RDF and OWL with SOAP for Semantic Web

DARPA Agent Markup Language

DBin Project – Semantic Web P2P and/or Semantic Newsgroup Client.

DERI International – Digital Enterprise Research Institute

Digital Object Identifier (DOI) Fabl – A Native Programming Language for the Semantic Web

FOAF Project – A Semantic Web Application

Foundation for Intelligent Physical Agents (FIPA)

Go3R – Knowledge Based Semantic Search Engine To Avoid Animal Experiments

hakia – Search for Meaning

HP Labs Semantic Web Research

Infomesh’s Semantic Web Introduction

International Journal of Metadata, Semantics and Ontologies (IJMSO)

International Journal on Semantic Web and Information Systems (IJSWIS) Jena – A Semantic Web Framework for Java

Journal of Web Semantics

Journal of Web Semantics: Preprint Server

Knowledge Discovery


Knowledge Search

Language Engineering for the Semantic Web: A Digital Library for Endangered Languages

Magpie – The Samatic Filter and Tool For the Semantic Web

MetaData at W3C

Metadata FAQ – Metadata for Education

MindRaider – Semantic Web Outliner



OASIS – Advancing eBusiness Standards

OIL – Ontology Inference Layer

Ontologies for Education (O4E)

Ontology Matching

Ontology Metadata Vocabulary (OMV)


O’Reilly’s Semantic Web Primer

Potential Advantages Of Semantic Web For Internet Commerce by Yuxiao Zhao and Kristian Sandahl

Powerset – Natural Language Semantic Based Web Search Engine

pOWL – Semantic Web Development Plattform

Practical Semantic Analysis of Web Sites and Documents

RDF Context Tools

RDF – Resource Description Framework

Rules and Rule Markup Languages for the Semantic Web – RuleML-2003

Science and the Semantic Web

Semantic Blogging: Spreading the Semantic Web Meme

Semantic Desktop Environment – gnowsis

Semantic Email by Luke McDowell, Oren Etzioni, Alon Halevy, and Henry Levy

Semantic Interoperability of Metadata and Information in unLike Environments (SIMILE)

Semantic Knowledge Technologies and Language Computation

Semantic Markup Deconstructed Example

Semantic Routing BOF

Semantic Translator for Enhanced Retrieval by the Bremen University (BUSTER) – The Semantic Web Community Portal

Semantic Web Activity Statement

Semantic Web Application Platform – SWAP

Semantic Web Feeds

Semantic Web for AURIS-MM

Semantic Web Laboratory

Semantic Web Primer for Object-Oriented Software Developers

Semantic Web Publications

Semantic Web Roadmap

Semantic Web Services Challenge

Semantic Web W3C SemText – Semantic Hypertext – Making Latent Semantics Blatant

SIG SEMIS Semantic Web and Information Systems

SIMAC – Foafing the Music – Semantic Interaction with Music Audio Contents

SIMILE Project – Semantic Interoperability of Metadata and Information in unLike Environments

Sindice – The Semantic Web Index

SOAPAgent – An Open SOAP Directory Project Info – OWL API

Swoogle – Semantic Bot

SWRL: A Semantic Web Rule Language Combining OWL and RuleML

Technology Review: Sir Tim Berners-Lee – The Semantic Web

The Cover Pages

The Memetic Web

The ontoprise® GmbH The RDF Query Language (RQL)

The Semantic Grid

The Semantic Social Network by Stephen Downes

The Semantic Web: An Introduction

The Semantic Web By Tim Berners-Lee, James Hendler and Ora Lassila

The Semantic Web In Breadth

The Semantic Indexing Project – Creating Tools To Identify the Latent Knowledge Found in Text

The Semantic Web Is Your Friend

Transforming and Enriching Documents for the Semantic Web by Dietmar Roesner, Manuela Kunze, Sylke Kroetzsch

Twine – A Semantic Web Application That Allows You To Share, Organize, and Find Information

UDDI – Universal Description, Discovery, and Integration

Web Semantics: Science, Services and Agents on the World Wide Web

Web Service Modeling Ontology

Wilbur Toolkit for Semantic Web Programming

World Wide Web Reference Semantic Web

Yahoo Groups – SemanticWeb


1st Spot

Agent Construction Tools



Agent Model Yields Leadership

Agent Portal AI


AgentSheets – Authoring Tool to Create Agents

Alarm Growing Over Bot Software by Robert Lemos

ALICEBot Android World

Applied Soft Computing

Search Robots – The Robots.txt File

Bookmach – Track Your Favorite Subject Using Sticky Zine and Blog Search

Bot A Blog

Bots, Blogs and News Aggregators


BrowseEngine – Real-Time Meta-Data Search Engine

Build a Web Spider on Linux – A Simple Spider and Scraper Collects Internet Content

Cetus Links – Mobile Agents


Connotate – Intelligent Agent Technology and Competitive Intelligence Tools

Data Mining Resources

DataparkSearch Engine – Full-Featured Open Source Web-Based Search Engine


Deep Web Research

Design of a Parallel and Distributed Web Search Engine by Salvatore Orlando, Raffaele Perego, and Fabrizio Silvestri

Dictionary of Algorithms and Data Structures

Eliza – The Original ChatterBot

FAME (Facilitating Agents in Multiculture Exchange)Project

Fantomas Spider Spy™ The BotBase

Foundation for Intelligent Physical Agents


GeneSys Middleware

Google Guide

IEI’s Graphical Programming Toolbox

iMacros™ – Browser Based Macro Recorder and Intelligent Agent

Imagination Engines

Indexing Robot Crawler Checklist

Institute for Human and Machine Cognition (IHMC)

Intellexer – Custom Built Search Engines, Knowledge Management Tools, Natural Language Processing

International Journal of Agent-Oriented Software Engineering (IJAOSE)

Internet Mathematics


Knowledge Discovery

Koders – Source Code Search Engine

LAIR – Research Projects of the Laboratory of Applied Informatics Research

List of User-Agents (Spiders, Robots, Crawler, Browser)

Minimal-Intelligence Agents for Bargaining Behaviors in Market-Based Environments by Dave Cliff and Janet Bruten

MIT Media Lab: Software Agents

Modelling and Mining of Network Information Systems



OpenKapow – Serving Mashups For the Long Tail of the Web

Open Source Web Information Retrieval (OSWIR05)

Oxyus Search Engine

ParsCit Project – Reference String Parsing – Web Spider and Search Engine

Robots.Txt Checker – Validator for Robots.txt Files

Searchbots – Uniquely Searching the Internet

Search Engine Robots

Search Engine Watch News

Search Tools – Information Guides and News

Semantic Indexing and Search

Semantic Web


Smarter Bots

SocSciBot3 and SocSciBot 4

Spider Hunter

Spidering Hacks

Spinn3r: RSS Content, News Feeds, News Content, News Crawler and Web Crawler APIs

Structure and Interpretation of Computer Programs – Video Lectures by Hal Abelson and Gerald Jay Sussman

Supybot, A Superb Python IRC Bot

Swoogle – Semantic Bot

The Intelligent Software Agents Lab

The Lemur Toolkit – Language Modeling and Information Retrieval Research

The Search Engine Project (TSEP)

The Simon Lavern Page

The Web Robots Pages

TSEP – The Search Engine Project

UMBC AgentWeb

UMBC eBiquity

Webbot – the W3C libwww Robot

Web Curator Tool (WCT)

Web Data Extractors – White Paper Link Compilation

Web Information Retrieval/Natural Language Processing Group (WING)

Web Intelligence Consortium

Web IR & IE

Words, Extended – Internet Text Information Retrieval, Extraction and Display Bot

Posted in: Data Mining, Features, Legal Research, Search Engines, Search Strategies