Deep Web Research 2007

Bots, Blogs and News Aggregators is a keynote presentation that I have been delivering over the last several years, and much of my information comes from the extensive research that I have completed over the years into the “invisible” or what I like to call the “deep” web. The Deep Web covers somewhere in the vicinity of 900 billion pages of information located through the world wide web in various files and formats that the current search engines on the Internet either cannot find or have difficulty accessing. The search engines find about 20 billion pages as of the publication date of this guide.

In the last several years, some of the more comprehensive search engines have written algorithms to search the deeper portions of the world wide web by attempting to find files such as .pdf, .doc, .xls, ppt, .ps. and others. These files are predominately used by businesses to communicate their information within their organization, or to disseminate information to the public, from their organization. Searching for this information using deeper search techniques and the latest algorithms allows researchers to obtain a vast amount of corporate information that was previously unavailable or inaccessible. Research has also shown that even deeper information can be obtained from these files by searching and accessing the “properties” information on these files! This is interesting research about which I wrote and and posted in my personal blog a few months ago.

This article and guide is designed to give you the resources you need to better understand the history of the deep web research, as well as various classified resources that allow you to search through the currently available web to find those key sources of information nuggets only found by understanding how to search the “deep web”.

This Deep Web Research 2007 article is divided into the following sections:

Articles, Papers, Forums, Audios and Videos Cross Database Search Services Peer to Peer, File Sharing, Grid/Matrix Search Engines Resources – Deep Web Research Bot Research Resources and Sites
Cross Database Articles Cross Database Search Tools Presentations Resources – Semantic Web Research Subject TracerTM Information Blogs

Articles, Papers, Audios and Videos (Current and Historical)

Academic and Scholar Search Engines and Sources

A Crisis for Web Preservation by Florence Olsen

All of OCLC’s WorldCat Heading Toward the Open Web by Barbara Quint

An Interactive Clustering-based Approach to Integrating Source Query interfaces on the Deep Web by W. Wu, C. Yu, A. Doan, W. Meng

Annotation for the Deep Web

Automatic Extraction of Web Search Interfaces for Interface Schema Integration by H. He, W. Meng, C. Yu, Z. Wu

Automatic Information Extraction From Semi-Structured Web Pages By Pattern Discovery

Automatic Meaning Discovery Using Google by Rudi Cilibrasi and Paul M. B. Vitanyi

Benevolent “Virus” Helps Reveal the Hidden Web

Beyond Google: The Invisible Web – Tools for Teaching the Invisible Web

Bibliomining Bibliography

Bibliomining for Automated Collection Development in a Digital Library Setting: Using Data Mining to Discover Web-Based Scholarly Research Works by Dr. Scott Nicholson

Bot Research

Client-Side Deep Web Data Extraction

Clustering E-Commerce Search Engines by Q. Peng, W. Meng, H. He, C. Yu

Common Information Environment Seeks To Reveal the Hidden Web,13927,1195901,00.html

Crawling the Hidden Web by Sriram Raghavan and Hector Garcia-Molina

Current Awareness Discovery Tools on the Internet

Data Extraction and Label Assignment for Web Databases

Deep Content – Guide To Effective Searching of the Internet

Deep Web – Exploring the Secrets of the Hidden Internet by Marcus P. Zillman, M.S., A.M.H.A., – 23 minutes – Internet/Technology Channel

Deep Web Navigation in Web Data Extraction

Desperately seeking Web Search 2.0

DigiCULT Thematic Issue 6
Resource Discovery Technologies for the Heritage Sector, June 2004
Download Thematic Issue 6:Link HiRes .pdf (4,9 MB)

Diving in the Deep End of the Web by Suzanne Ross

Easy Topic Maps

Efficient and Effective Metasearch Project

Farewell, Web 1.0! We Hardly Knew Ye by Steven Levy

Fugitive Documents Evade Federal Depositories

Google Teams Up with 17 Colleges to Test Searches of Scholarly Materials By Jeffrey R. Young

Graph Structure in the Web

Gray Literature: Resources for Locating Unpublished Research by Brian S. Mathews

Gray Literature Subject Guide

Guardian Unlimited: Search for the Invisible Web,3605,547140,00.html

Indexing Deep Web Content By Paul Bruemmer

Information Detective – The Invisible Web Mini-Tutorial Streaming Video

Information Foraging and Extraction Techniques for Internet-Based Literature and Data

Information Retrieval and the Semantic Web by Tim Finin, James Mayfield, Clay Fink, Anupam Joshi, and R. Scott Cost

In Search of the Deep Web

Invisible Web Gets Deeper

Invisible Web Revealed

IR and IE on the Web – PhD and MSc Dissertations

JEP: The Deep Web

Library Journal: Braking Through the Invisible Web

LLRX: Book Review: The Invisible Web

LLRX: Deep Web Research

LLRX: Deep Web Research 2005

LLRX: Deep Web Research 2006

LLRX: Mining Deeper Into the Invisible Web

LLRX: ResearchWire: Exposing the Invisible Web

Metadata? Thesauri? Taxonomies? Topic Maps! by Lars Marius Garshol

Mining Newsgroups Using Networks Arising From Social Behavior

Mining the Deep Web With Specialized Drills

Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews

Mining Topic-Specific Concepts and Definitions on the Web

Modelling and Mining of Network Information Systems Publications

Net Plan Builds in Search by Kimberly Patch

Noisy Channels Models Provide Short Answers to FAQs

Old Search Engine, the Library, Tries to Fit Into a Google World

Online or Invisible?

OntoMiner: Bootstrapping and Populating Ontologies From Domain Specific Web Sites

OpenIndex – Creating a Public Internet Index

PhysicsWeb: The Physics of the Web

Publications about Web Analysis, Web Search, Citation Indexing, Digital Libraries, Machine Learning, Neural Networks [Steve Lawrence, Google Labs]

QProber: Classifying and Searching “Hidden-Web” Text Databases

Research Beyond Google: 119 Authoritative, Invisible, and Comprehensive Resources

Researcher Retrain Thyself

Researchers Map of the Web

Scientific American: Featured Article: The Semantic Web

Scraping the Web for Implied Data

Search Engine Hunts for Gold Beneath the Surface of the Web

Search Engine Meeting 2005 Boston, Massachusetts – White Papers and Presentations

Search Engine Technology and Digital Libraries

Searching the Deep Web

Searching the Deep Web – Video

Searching the Internet (White Paper, Audio and Video)

Seeing through the ‘invisible’ Web

Semantic Web Content Accessibility Guidelines for Current Research Information Systems (CRIS) by A. Lopatenko

Smart Search – Advanced Search Engines Link Many Data Sources

Structured Databases on the Web: Observations and Implications

Testbed for Information Extraction from Deep Web

The Deep Web

The Deep Web: Surfacing Hidden Value by Michael K. Bergman

The Future Of News: The Digital Information Librarian

The Hidden Potential of the Web,13927,1195901,00.html

The Invisible Web by Chris Sherman

The Invisible Web: What it is, Why it exists, How to find it, and Its Inherent Ambiguity

The Invisible Web: Where Search Engines Fear To Go

The Mechanics of Deep Net Meta Search

The Seventh Asia Pacific Web Conference (APWeb05)

Topological Measures and Maps Of the Web

Towards Automatic Incorporation of Search Engines Into A Large-Scale Metasearch Engine

Traffic-Based Feedback on the Web by Jonathan Aizen, Daniel Huttenlocher, Jon Kleinberg, and Antal Novak

UMBC – AgentNews

Understanding Metadata

Using the Internet As a Dynamic Resource Tool for Knowledge Discovery

Web Characterization Project

Web Data Extractors White Paper Link Compilation

Web Pages Search Engine Based on DNS by Wang Liang, Guo Yi-Ping, and Fang Ming

WebScales: Towards a Highly Scalable Metasearch Engine

What Is the Deep Web? A What Is Podcast 15 Minute Interview with Marcus P. Zillman

What is the Invisible Web? A Crawler Perspective by Natalia Arroyo, Laboratorio de Internet

WISE-Cluster: Clustering E-Commerce Search Engines Automatically by Q. Peng, W. Meng, H. He, C. Yu

Yahoo and the Deep Web

Cross Database Articles

Digital Libraries- Cross-Database Search: One-Stop Shopping

Search Tools Reports: Searching for Text Information in Databases

The Right Solution: Federated Search Tools by Roy Tennant

UK Web Archiving Consortium

Cross Database Search Services

Entrez – The Life Sciences Cross-Database Search Engine

EnergyFiles – Subject Pathways

GPO Access – Search Across Multiple Databases

King County Library System

NLM Gateway Search


The Metasearch Infrastructure Project

Cross Database Search Tools

Apple – Mac – Sherlock

Blue Angel Technologies

Bright Planet


Cross Database Search Tools Summary

DbVisualizer – The Universal Database Tool

Dublin Core Metadata Initiative (DCMI)

EEVL Xtra – Cross Database Search


Gold Rush – Database Search Tool


MetaSearch Initiative

mod_oai Project – Getting OAI-PMH For Free


Peter’s PolySearch Engines

PBCore – The Public Broadcasting Metadata Dictionary

Registry of Library Knowledge Bases

Search Federal Research and Development

SRU – Search/Retrieve via URL

STINET Multisearch

The Flamenco Search Interface Project

VIAF: The Virtual International Authority File


<Table of Contents>

Peer To Peer (P2P), File Sharing, Grid and Matrix Search Engines


ALPINE Network – SourceForge: Project

An Efficient Scheme for Query Processing on Peer-to-Peer Networks

Azureus – Java Bittorrent Client


Between Rhizomes and Trees: P2P Information Systems by Bryn Loban



BitTorrent FAQ and Guide

Bit Torrent Official Site and Search Engine

Bitzi – The Free Universal Media Catalog
Blog Torrent


BotSpot®: File-sharing Bots

BTbot – BitTorrent Search Engine

Coral – The Coral P2P Content Distribution Network

Capn’s PHP Gnutella Search

Current P2P Search Implementations – P2P Networks – XDCC Search / File Sharing Portal

Deepnet Explorer – P2P/RSS-ATOM Web Browser

Distributed Search Engines

Distributed Search in P2P Networks


Free Haven Project

FuzzBox: Tangent Research Artificial Intelligence and Robotics

Gnougat: Fully decentralised file caching from the JXTA Project

GNUnet – GNU Project – Free Software Foundation (FSF)



GRACE – GRid seArch and Categorization Engine

Grid Resources


Grouper – P2P Personal Media File Sharing – Open Source, Distributed Internet Crawler!

Hamachi – Secure Mediated Peer To Peer

HyperCuP – Shaping Up Peer-to-Peer Networks

Ian Clarke’s Blog

IM and P2P Threat Center


International Workshop on Peer-to-Peer Knowledge Management (P2PKM)

Internet Movie Database (IMDb)

isoHunt – IRC and Bit Torrent Search Engine

JXTA Project

Kademlia: A Peer-to-peer Information System Based on the XOR Metric

Kazaa Media Desktop

Legal P2P File Sharing Software



LionShare P2P Project – Legitimate File-Sharing Among Individuals and Educational Institutions


Mercora IM P2P Radio

MoleSter – A Tiny File-Sharing Application


Morpheus: Peer-to-Peer File Sharing Software


MysterNetworks – The Evolution of Peer-to-Peer

NeuroGrid – P2P Search

Open Directory – File Sharing

Open Directory – MP3 Search Engines

OpenNap: Open Source Napster Server

Oyster – Managing, Searching and Sharing Ontology Metadata in a Peer-to-Peer Network.

P2P and the Future of Private Copying by Peter K. Yu, Michigan State University College of Law

P2PNet – Updated P2P News

P2P News from Topex

PeerCast P2P Radio

PeerMind – P2P Monitor


Port Knocking

PowerFolder – P2P Whole Folder Synchronization

Project JXTA

Rodi – Tiny P2P Client/Host





Slyck – File Sharing News and Info


Streamload – Share Videos and Photos – Online MP3 Storage and Access

Super-Peer-Based Routing and Clustering Strategies for RDF-Based Peer-to-Peer Networks

Super Powered Peer To Peer

SwarmStream™ SDK

The Anthill Project

The Pirate Bay – BitTorrent Tracker

The Chord Project

The Freenet Project

The Peer-to-Peer Weblog

The Role of Peer to Peer File Sharing in Law Firm Marketing by Andy Havens


Torrent Finder

Torrent Reactor

Torrent Typhoon (TT)


Understanding BitTorrent: An Experimental Perspective by Arnaud Legout, Guillaume Urvoy-Keller, and Pietro Michiardi

URLBlaze: URL Sharing Network

Videora – Personal Video Using P2P and RSS


WiredReach – Powering the User Centric Web

Yahoo! Directory Peer-to-Peer File Sharing

YAPPERS: A Peer-to-Peer Lookup Service over Arbitrary Topology

YouServ – A P2P (peer-to-peer) Web Hosting/File Sharing System



From Theory To Practice – Bielefeld Academic Search Engine

Gumshoe Librarian Presentations Series, by Sabrina I. Pacifici and Barbara Fullerton
// and // and //

Information Detective – Online Streaming Tutorial Videos On Searching the Internet including the Deep and Invisible Web

Quick Introduction to OWL Web Ontology Language

Searching the Deep Web – Dudley Knox Library Internet Guides – PowerPoint Slides

Searching the Internet

Searching the Internet: Using Brains and Bots

Seeing the Invisible Web

<Table of Contents>

Resources – Deep Web Research

A Roadmap for Web Mining: From Web to Semantic Web



Bot Research

BrainBoost – Question Answering Search Engine

BrightPlanet’s Deep Federation Portal™ (DFP)

Can’t Find On Google

CiteLine Professional

COLLATE – Collaboratory for Annotation, Indexing and Retrieval of Digitized Historical Archive Material

Comet Way

CompletePlanet – 70,000 Databases and Speciality Search Engines

Creative Commons RDF-Enhanced Search

Cyber Cemetery


Cybermetrics – First Generation Tools – Invisible Web

Data Fountains: Open Source Internet Resource Discovery and Metadata/Full-Text Generation Service

Data Mining Resources

Deep Web Research

Deep Web Search

Deep Web Technologies

DigiCULT Resources – Resource Discovery & Information Retrieval


Direct Search

EEVL’s Ejournal Search Engines


Engineering Village 2

Find Articles

Freely Accessible Databases for the Public

Ghostscript, Ghostview and GSview

GlobalSpec – Engineering Search Engine

Google Labs

Google Scholar

HighWire Press – Largest Repository of Free Full-Text Life Science Articles in the World


IncyWincy – The Invisible Web Search Engine


Instant Information Systems

Institutional Archives Registry

Intelligence Center


Internet Archive


Invisible Library

Kapow Web Collector

KDnuggets: Data Mining, Web Mining, and Knowledge Discovery Guide


Knowledge Discovery

Large-Scale Deep Web Integration: Incomplete Bibliography

Librarians’ Index to the Internet


Mamma – Deep Web Health Search Engine

Mappa.Mundi Magazine

Medical Databases Online

Microsoft Web Search Research and Patents

Mining the Deep Web for Economic Data

Mooter Search

MSN Sandbox


News Group Search

New Zealand Digital Library

OAI-PMH Implementation Guidelines – Conveying rights expressions about metadata in the OAI-PMH framework


OneLook Dictionary Search

Open Archives Initiative

OpenIndex – Creating a Public Internet Index

Open WorldCat-enabled Web Tools

QProber: Classifying and Searching “Hidden-Web” Text Databases – PERSIVAL Project

Quigo Technologies

Pretrieve Search – Free Public Record Search Engine

Recommended Gateway Sites for the Deep Web


Resource Discovery Network

Science and Technology Sources on the Internet

Scientific and Technical Information Network (STINET)

Science Commons – FirstGov for Science – Government Science Portal

Scirus – Search Engine for Scientific Information

SDARTS – A Protocol and Toolkit for Metasearching

Search Adobe PDF Online

STN International – Databases in Science and Technology

TechXtra – Indepth Academic and Scholar Search

Testbed for Information Extraction from Deep Web

The Internet Sleuth

The Deep Web

The Invisible Web

THOR: Deep Web Data Extraction

Those Dark Hiding Places: The Invisible Web Revealed


UNESCO Information Services – Databases

Wall Street Executive Library

Web Data Extractors

Web Farming


Web Intelligence Consortium

Web IR & IE

WebScales: Towards a Highly Scalable Metasearch Engine

Web-Searching Agents

<Table of Contents>

Resources – Semantic Web Research

AIS SIGSEMIS – SIGSEMIS: Semantic Web and Information Systems

Analyzing Social Networks on the Semantic Web


Combining RDF and OWL with SOAP for Semantic Web

Cypher – Plain Language Access to the Semantic Web

DARPA Agent Markup Language

DBin Project – Semantic Web P2P and/or Semantic Newsgroup Client.

DERI International – Digital Enterprise Research Institute

Digital Object Identifier (DOI)

Dublin Core Services

Fabl – A Native Programming Language for the Semantic Web

Foundation for Intelligent Physical Agents (FIPA)

The FOAF Project – A Semantic Web Application

HP Labs Semantic Web Research

Infomesh’s Semantic Web Introduction

International Journal of Metadata, Semantics and Ontologies (IJMSO)

Jena – A Semantic Web Framework for Java

Journal of Web Semantics: Preprint Server


Knowledge Search

Language Engineering for the Semantic Web: A Digital Library for Endangered Languages

Magpie – The Samatic Filter and Tool For the Semantic Web

MetaData at W3C

Metadata FAQ – Metadata for Education

MindRaider – Semantic Web Outliner



OASIS – Advancing eBusiness Standards

OIL – Ontology Inference Layer

Ontologies for Education (O4E)

Ontology Matching

Ontology Metadata Vocabulary (OMV)


O’Reilly’s Semantic Web Primer

Potential Advantages Of Semantic Web For Internet Commerce by Yuxiao Zhao and Kristian Sandahl

pOWL – Semantic Web Development Plattform

Practical Semantic Analysis of Web Sites and Documents

RDF Context Tools

RDF – Resource Description Framework

RDFWeb: Friend of a Friend (FOAF) Project

Rules and Rule Markup Languages for the Semantic Web – RuleML-2003

Science and the Semantic Web

Semantic Blogging: Spreading the Semantic Web Meme

gnowsis – Semantic Desktop Environment

Semantic Email by Luke McDowell, Oren Etzioni, Alon Halevy, and Henry Levy

Semantic Indexing

Semantic Interoperability of Metadata and Information in unLike Environments (SIMILE)

Semantic Knowledge Technologies and Language Computation

Semantic Markup Deconstructed Example

Semantic Planet Weblog

Semantic Routing BOF

Semantic Translator for Enhanced Retrieval by the Bremen University (BUSTER) – The Semantic Web Community Portal

Semantic Web Activity Statement

Semantic Web Application Platform – SWAP

Semantic Web for AURIS-MM

Semantic Web Laboratory

Semantic Web Primer for Object-Oriented Software Developers

Semantic Web Publications

Semantic Web Roadmap

Semantic Web Services Challenge 2006

Semantic Web W3C

SemText – Semantic Hypertext – Making Latent Semantics Blatant

SIG SEMIS Semantic Web and Information Systems

SIMAC – Foafing the Music – Semantic Interaction with Music Audio Contents

SIMILE Project – Semantic Interoperability of Metadata and Information in unLike Environments

SOAPAgent – An Open SOAP Directory Project Info – OWL API
Swoogle – Semantic Bot

SWRL: A Semantic Web Rule Language Combining OWL and RuleML

Technology Review: Sir Tim Berners-Lee – The Semantic Web

The Cover Pages

The Memetic Web

The ontoprise® GmbH

The RDF Query Language (RQL)

The Semantic Grid

The Semantic Social Network by Stephen Downes

The Semantic Web: An Introduction

The Semantic Web By Tim Berners-Lee, James Hendler and Ora Lassila

The Semantic Web In Breadth

The Semantic Indexing Project – Creating Tools To Identify the Latent Knowledge Found in Text

The Semantic Web Is Your Friend

Transforming and Enriching Documents for the Semantic Web by Dietmar Roesner, Manuela Kunze, Sylke Kroetzsch

UDDI – Universal Description, Discovery, and Integration

Web Semantics: Science, Services and Agents on the World Wide Web

Web Service Modeling Ontology

Wilbur Toolkit for Semantic Web Programming


World Wide Web Reference Semantic Web

Yahoo Groups – SemanticWeb

<Table of Contents>

Bot Research Resources and Sites

1st Spot

Agent-Based Software Development

Agent Construction Tools



Agent Model Yields Leadership

Agent Portal AI


Agents Portal

Alarm Growing Over Bot Software by Robert Lemos


Android World

Applied Soft Computing

B.4.1 Search Robots – The Robots.txt File

Bot A Blog



Bots, Blogs and News Aggregators


Build a Web Spider on Linux – A Simple Spider and Scraper Collects Internet Content

Cetus Links – Mobile Agents


Data Mining Resources

Deep Web Research

Design of a Parallel and Distributed Web Search Engine by Salvatore Orlando, Raffaele Perego, and Fabrizio Silvestri

Dictionary of Algorithms and Data Structures

Eliza – The Original ChatterBot

FAME (Facilitating Agents in Multiculture Exchange)Project

Fantomas Spider Spy™ The BotBase

Foundation for Intelligent Physical Agents


GeneSys Middleware

Google Guide

Indexing Robot Crawler Checklist

Institute for Human and Machine Cognition (IHMC)

Intellexer – Custom Built Search Engines, Knowledge Management Tools, Natural Language Processing

Internet Agents – CWS Apps

Internet Mathematics

Journal of Mathematical Modelling and Algorithms


Knowledge Discovery

Koders – Source Code Search Engine


LAIR – Research Projects of the Laboratory of Applied Informatics Research

List of User-Agents (Spiders, Robots, Crawler, Browser)

Minimal-Intelligence Agents for Bargaining Behaviors in Market-Based Environments by Dave Cliff and Janet Bruten

MIT Media Lab: Software Agents

Modelling and Mining of Network Information Systems


Open Source Web Information Retrieval (OSWIR05)

Oxyus Search Engine – Web Spider and Search Engine

Robots, Spiders and Other User Agents: A Resource for WebMasters

Robots.Txt Checker – Validator for Robots.txt Files

Search Engine Robots

Search Engine Watch News

Search Tools – Information Guides and News

Semantic Indexing

Semantic Web


Smarter Bots


Spider Hunter


Spidering Hacks

Structure and Interpretation of Computer Programs – Video Lectures by Hal Abelson and Gerald Jay Sussman

Supybot, A Superb Python IRC Bot

Swoogle – Semantic Bot

The CGI Resource Index: Programs and Scripts: Perl: Searching

The Intelligent Software Agents Lab

The Search Engine Project (TSEP)

The Simon Lavern Page

The Web Robots Pages

UMBC AgentWeb

UMBC eBiquity

Webbot – the W3C libwww Robot

Web Data Extractors – White Paper Link Compilation

Web Intelligence Consortium

Web IR & IE

Worm Radar

<Table of Contents>

Subject Tracer™ Information Blogs

Subject Tracer™ Information Blogs created and developed by the Virtual Private Library™ combine the best of the latest tools on the Internet. Using bots, blogs and news aggregators the Subject Tracer™ Information blogs generate RSS feeds with the latest resources to create a current information resource flow through niched subject tracers. I am proud to be the creator of the Internet’s first Subject Tracer™ Information Blogs:

Virtual Private Library

Accessibility Resources

Agriculture Resources

Artificial Intelligence Resources

Astronomy Resources

Auction Resources

Biological Informatics

Biotechnology Resources

Bot Research

Business Intelligence Resources


Data Mining Resources

Deep Web Research

Directory Resources

eCommerce Resources

Elder Resources

Employment Resources

Entrepreneurial Resources

Financial Sources

Finding People

Games Resources

Genealogy Resources

Grant Resources

Grid Resources

Healthcare Resources

Information Futures Markets

Information Quality Resources

Internet Alerts

Internet Demographics

Internet Experts

Internet Hoaxes

Journalism Resources

Knowledge Discovery

Military Resources

Outsourcing/Offshoring Information and Resources

Privacy Resources

Reference Resources

Research Resources


Script Resources


Social Informatics

Statistics Resources

Student Research

Theology Resources

Tutorial Resources

World Wide Web Reference

<Table of Contents>

Posted in: Features, Information Management, Internet Resources, Internet Resources - Web Links, Internet Trends, Legal Research, Search Engines, Search Strategies