Deep Web Research and Discovery Resources 2016

The Deep Web covers somewhere in the vicinity of trillions upon trillions of pages of information located through the world wide web in various files and formats that the current search engines on the Internet either cannot find or have difficulty accessing. The current search engines find hundreds of billions of pages at the time of this publication.

In the last several years, some of the more comprehensive search engines have written algorithms to search the deeper portions of the world wide web by attempting to find files such as .pdf, .doc, .xls, ppt, .ps. and others. These files are predominately used by businesses to communicate their information within their organization or to disseminate information to the external world from their organization. Searching for this information using deeper search techniques and the latest algorithms allows researchers to obtain a vast amount of corporate information that was previously unavailable or inaccessible. Research has also shown that even deeper information can be obtained from these files by searching and accessing the “properties” information on these files.

This Deep Web Research and Discovery Resources 2015 report and guide is divided into the following sections:

Articles, Papers, Forums, Audios and Videos
Cross Database Articles
Cross Database Search Services
Cross Database Search Tools
Peer to Peer, File Sharing, Grid/Matrix Search Engines
Resources – Deep Web Research
Resources – Semantic Web Research
Bot and Intelligent Agent Research Resources and Sites


99 Resources to Research & Mine the Invisible Web by Jessica Hupp

Academic and Scholar Search Engines and Sources

All of OCLC’s WorldCat Heading Toward the Open Web by Barbara Quint

An Interactive Clustering-based Approach to Integrating Source Query interfaces on the Deep Web by W. Wu, C. Yu, A. Doan, W. Meng

Annotation for the Deep Web

An Up-To-Date Layman’s Guide To Accessing The Deep Web

Automatic Extraction of Web Search Interfaces for Interface Schema Integration by H. He, W. Meng, C. Yu, Z. Wu

Automatic Information Extraction From Semi-Structured Web Pages By Pattern Discovery

Automatic Meaning Discovery Using Google by Rudi Cilibrasi and Paul M. B. Vitanyi

Beyond Google: The Invisible Web – Tools for Teaching the Invisible Web

Bibliomining for Automated Collection Development in a Digital Library Setting: Using Data Mining to Discover Web-Based Scholarly Research Works by Dr. Scott Nicholson

Bot Research

BrightPlanet Launches Deep Web Data Feeds: Global News Data Feed Is First Available Data Feed

Client-Side Deep Web Data Extraction

Clustering E-Commerce Search Engines by Q. Peng, W. Meng, H. He, C. Yu

Common Deep Web and Big Data Questions Answered (Part 1)

Common Deep Web and Big Data Questions Answered (Part 2)

Creating Intelligence from Big Data

Current Awareness Discovery Tools on the Internet

Data Extraction and Label Assignment for Web Databases

Deep Web – Exploring the Secrets of the Hidden Internet by Marcus P. Zillman, M.S., A.M.H.A., – 23 minutes – Internet/Technology Channel

Deep Web: Legal Due Diligence by Lisa Brownlee

Desperately Seeking Web Search 2.0

Digging Deeper into Deep Web Databases by Breaking Through the Top-k Barrier

DigiCULT Thematic Issue 6
Resource Discovery Technologies for the Heritage Sector, June 2004

Effective and Scalable Metasearch Project

Efficient Deep Web Crawling Using Reinforcement Learning

Everything You Need To Know About the Deep Web In One Simple Infographic

Experiences In Crawling Deep Web In The Context Of Local Search

Grey Literature

Grey Literature Network Service (GreyNet)

How To Browse the Deep Web Using Tor on iPhone and iPad by Sumeet Sharma

Information Retrieval and the Semantic Web by Tim Finin, James Mayfield, Clay Fink, Anupam Joshi, and R. Scott Cost

In Search of the Deep Web

Invisible Web Gets Deeper

Invisible Web Revealed

IR and IE on the Web – PhD and MSc Dissertations

Journey Into the Hidden Web: A Guide for New Researchers by Ryan Dube

Lessons from the Deep Web That Could Lead To a More Secure IoT by Revathl Subramanian

LLRX: Book Review: The Invisible Web

LLRX: Deep Web Research

LLRX: Deep Web Research 2005

LLRX: Deep Web Research 2006

LLRX: Deep Web Research 2007

LLRX: Deep Web Research 2008

LLRX: Deep Web Research 2009

LLRX: Deep Web Research 2010

LLRX: Deep Web Research 2011

LLRX: Deep Web Research 2012

LLRX: Deep Web Research 2013

LLRX: Deep Web Research 2014

LLRX: Deep Web Research 2015

LLRX: Mining Deeper Into the Invisible Web

LLRX: ResearchWire: Exposing the Invisible Web

Metadata? Thesauri? Taxonomies? Topic Maps! by Lars Marius Garshol

Mining Newsgroups Using Networks Arising From Social Behavior

Mining the Deep Web: Search Strategies That Work by Lee Ratzan

Mining Topic-Specific Concepts and Definitions on the Web

Net Plan Builds in Search by Kimberly Patch

NYU-Poly Researcher Awarded DARPA Contract To Explore the Deep Web by Rhea Kelly

Onion Browser – An Open-Source Privacy Enhancing Web Browser for iOS

Online or Invisible? [Requires Login]

OntoMiner: Bootstrapping and Populating Ontologies From Domain Specific Web Sites

OpenIndex – Creating a Public Internet Index

Out-googling Google: Federated Searching and the Single Search Box

QProber: Classifying and Searching “Hidden-Web” Text Databases

Really Private Browsing: An Unofficial User’sGuide to Tor by Andre Infante

Research Beyond Google: 119 Authoritative, Invisible, and Comprehensive Resources

Search Engine Meeting

Search Engine Technology and Digital Libraries

Searching the Deep Web by Alex Wright

Searching the Deep Web

Searching the Deep Web – Video

Searching the Internet (White Paper, Audio and Video)

Search Interfaces on the Web: Querying and Characterizing by Denis Shestakov

Seeing through the ‘invisible’ Web

Semantic Web Content Accessibility Guidelines for Current Research Information Systems (CRIS) by A. Lopatenko

Structured Databases on the Web: Observations and Implications

Testbed for Information Extraction from Deep Web

The Deep Web: Surfacing Hidden Value by Michael K. Bergman;rgn=main

The Future Of News: The Digital Information Librarian

The Hidden Potential of the Web

The Invisible Web: What it is, Why it exists, How to find it, and Its Inherent Ambiguity

The Invisible Web: Where Search Engines Fear To Go

The New Search Engines Shining a Light On the Deep Web by Carola Frediani

The Ultimate Guide to the Invisible Web

The Virtual Private LibraryTM and The Deep Web Video by Melissa Barker

Timeline of Events Related to the Deep Web

Topological Measures and Maps Of the Web

TOR For Newbies – When Should You Use It?

Toward the Semantic Deep Web by James Geller, Soon Ae Chun, and Yoo Jung An

Towards Automatic Incorporation of Search Engines Into A Large-Scale Metasearch Engine

Traffic-Based Feedback on the Web by Jonathan Aizen, Daniel Huttenlocher, Jon Kleinberg, and Antal Novak

Travel Industry and Deep Web: Exclusive Interview with Marcus P. Zillman

UMBC – AgentNews

Understanding Metadata

Understanding the Deep Web In 10 Minutes

Using the Internet As a Dynamic Resource Tool for Knowledge Discovery

Web Characterization Activity

Web Data Extractors White Paper Link Compilation

Web Pages Search Engine Based on DNS by Wang Liang, Guo Yi-Ping, and Fang Ming

WebScales: Towards a Highly Scalable Metasearch Engine

What Is the Deep Web? A WhatIs Podcast 15 Minute Interview with Marcus P. Zillman

What is the Invisible Web? A Crawler Perspective by Natalia Arroyo, Laboratorio de Internet

Wikipedia – Deep Web

WISE-Cluster: Clustering E-Commerce Search Engines Automatically by Q. Peng, W. Meng, H. He, C. Yu


Search Tools Reports: Searching for Text Information in Databases

The Right Solution: Federated Search Tools by Roy Tennant

UK Web Archiving Consortium


EnergyFiles – Subject Pathways [Oil Gas production and forecasting]

FDsys – Search Across Multiple Government Databases

King County Library System

NLM Gateway Search

SUMSearch 2 [Health Sciences]


Bright Planet – Deep Web Intelligence


Dieselpoint Java Search and Navigation Software

Dublin Core Metadata Initiative (DCMI)

EEVL Xtra – Cross Database Search

Gold Rush – Database Search Tool


MetaSearch Initiative


Peter’s PolySearch Engines

PBCore – The Public Broadcasting Metadata Dictionary

Registry of Library Knowledge Bases

Search Federal Research and Development

SRU – Search/Retrieve via URL

The Flamenco Search Interface Project

VIAF: The Virtual International Authority File


ALPINE Network – SourceForge: Project

Azureus – Vuze Java Bittorrent Client

BadBlue [Uncensored News]

Between Rhizomes and Trees: P2P Information Systems by Bryn Loban


Bitmessage – P2P Communication Protocol To Send Encrypted Messages

Bit Torrent Official Site and Search Engine

Coral – The Coral P2P Content Distribution Network

Capn’s PHP Gnutella Search [Only code is available for download]

ClearBits – BitTorrent distribution of open licensed media

Distributed Search Engines

Distributed Search in P2P Networks

DirecTransFile – P2P File Transfers

FAROO – P2P Web Search

FilesOverMiles – Browser to Browser File Sharing (P2P)

Filetopia – File sharing tool with public key encryption

Free Haven Project

Frost Project – Freenet Messaging and File Sharing Client

FuzzBox: Tangent Research Artificial Intelligence and Robotics

GNUnet – Secure P2P Networking – Free Software Foundation (FSF)

Grid, Distributed and Cloud Computing Resources

GNU GRUB – Multiboot Boot Loader

Ian Clarke’s Blog

infinit – Re-imaging the Way You Send Files

International Workshop on Peer-to-Peer Knowledge Management (P2PKM)

Internet Movie Database (IMDb)

Kademlia: A Peer-to-peer Information System Based on the XOR Metric [Citeseer Login Required]

Lphant – The Full P2P Solution

MoleSter – A Tiny File-Sharing Application

MusicBrainZ – Open Music Encyclopedia

MysterNetworks – The Evolution of Peer-to-Peer

Open Directory – File Sharing

Open Directory – MP3 Search Engines

OpenNap: Open Source Napster Server

P2P and the Future of Private Copying by Peter K. Yu, Michigan State University College of Law

Peer-To-Peer Wikipedia

Port Knocking

PowerFolder – P2P Whole Folder Synchronization

Rodi – Tiny P2P Client/Host


Slyck – File Sharing News and Info

Stealth Mode Online Privacy Resources

Super-Peer-Based Routing and Clustering Strategies for RDF-Based Peer-to-Peer Networks [CiteSeer Login Required]

Swarm – A Transparently Scalable Distributed Programming Language

The Anthill Project

The Deep Web: Shutdowns, New Sites, New Tools by Vincenzo Ciancaglini

The Freenet Project

The Peer-to-Peer Weblog [Last updated 2010]

The Role of Peer to Peer File Sharing in Law Firm Marketing by Andy Havens


Torrent Reactor

Transmission – Fast, Easy and Free BitTorrent Client

Tribler – A Social Community That Facilitates Filesharing Through P2P


Understanding BitTorrent: An Experimental Perspective by Arnaud Legout, Guillaume Urvoy-Keller, and Pietro Michiardi

WASTE (Secure P2P communication)

YaCy – Distributed P2P Based Web Indexing and Anonmymous Search Engine

YAPPERS: A Peer-to-Peer Lookup Service over Arbitrary Topology [CiteSeer Login Required]

YouServ – A P2P (peer-to-peer) Web Hosting/File Sharing System

Zebra – Structured Text Indexing and Retrieval

Zilok – Peer To Peer Rental Marketplace


Deep Web

From Theory To Practice – Bielefeld Academic Search Engine

Gumshoe Librarian

Searching the Internet Whitepaper

The Virtual Private LibraryTM and The Deep Web Video by Melissa Barker

RESOURCES – Deep Web Research

AEON (Automatic Evaluation of ONtologies) [code has been archived by Google]

AnkaSearch – Meta Search and Deep Web Search Desktop Tool

Anonymous Web Browsing – Wikipedia

An Up-To-Date Layman’s Guide To Accessing The Deep Web

A Roadmap for Web Mining: From Web to Semantic Web

BASE – Bielefeld Academic Search Engine

Biznar – Deep Federated Search

Bot Research

BrightPlanet – Deep Web Intelligence

Catalog of U.S. Government Publications (CGP)

Cazoodle – Search, Integrate, and Organize — The Real World

Creative Commons RDF-Enhanced Search

Cyber Cemetery

CyberGhost – One of the World’s Most Trusted and Secure Virtual Private Networks

Cybermetrics – First Generation Tools – Invisible Web

Data Mining Resources

DeepDive – Analyze Data On a Deeper Level Than Ever Before


Deep Web Research Resources

Deep Web Search

Deep Web Technologies – federated search

Directory Resources

eFinancial Bot Deep Meta Search Engine

eGreenBot – Green Resources Search Engine

eHealthcare Bot Deep Meta Search Engine

eMarketing Bot Deep Meta Search Engine


Engineering Village

Falcons Semantic Web Search Engine

Federated Search Blog

Freely Accessible Databases for the Public

Google Scholar – Best Deep Web TOR Onion Links

HighWire Press – Largest Repository of Free Full-Text Life Science Articles in the World

Internet Archive

Kapow Web Collector

Karma – Data Integration Tool

KDnuggets: Data Mining, Web Mining, and Knowledge Discovery Guide

Knowledge Discovery

Large-Scale Deep Web Integration: Incomplete Bibliography

Linked Data – Connect Distributed Data Across the Web

LinkingOpenData – W3C SWEO Community Project


Mappa.Mundi Magazine

Mednar – Innovative Medical Search

Mining the Deep Web for Economic Data

New Zealand Digital Library

OAI-PMH Implementation Guidelines – Conveying rights expressions about metadata in the OAI-PMH framework


OECD.StatExtracts – Complete Databases Available Via OECD’s iLibrary

OneLook Dictionary Search

Onion Browser – An Open-Source Privacy Enhancing Web Browser for iOS

Open Archives Initiative

OpenIndex – Creating a Public Internet Index

Open Source Intelligence

Open Vulnerability Assessment System (OpenVAS)

Privacy Resources Subject TracerTM

Project Maelstrom – The Internet We Build Next

QProber: Classifying and Searching “Hidden-Web” Text Databases – PERSIVAL Project

Recommended Gateway Sites for the Deep Web

ReportLinker: Industry Reports, Company Profiles and Market Statistics

SAO/NASA Astrophysics Data System (ADS)


Science and Technology Sources on the Internet

Scientific and Technical Information Network (STINET)

Science Commons – FirstGov for Science – Government Science Portal – Deep Web Search Engine

SciTech Connect

Scrapinghub Crawls the Deep Web

Scrapy Webcrawler

SDARTS – A Protocol and Toolkit for Metasearching

SIMILE Widgets – Free, Open-Source Data Visualization Web Widgets and More

Social Buzz Bot (PDF download)

STN International – Databases in Science and Technology

SurfEasy – Online Privacy

Swoogle – Semantic Bot

SWRC Ontology

TechDeepWeb – How-To Guide to the Deep Web for IT Professionals

Terbium Labs – Matchlight Proactive Security In an Insecure World

Testbed for Information Extraction from Deep Web

The Invisible Internet Project (I2P)

The Invisible Web

The World Bank – Data

THOR: Deep Web Data Extraction

Tor Browser Bundle – Anonymity

TOR For Newbies – When Should You Use It?

TRID – The TRIS and ITRD Database (Transportation Research Board)

TunnelBear – Simple, Private, Free Access to the Global Internet

Twitter/Search #deepweb

UNdata – Data Access System To UN Databases

UNESCO Information Services – Databases

Useful Tips and Tools to Research the Deep Web

Virtual Private Networks Directory of Best Services

Wall Street Executive Library

Web Data Extractors

WebFountainTM – Analytical engine unstructured data

Web IR & IE

WebScales: Towards a Highly Scalable Metasearch Engine

WTO Statistics Database

Zaba Search – Free People Search and Public Information Search Engine

RESOURCES – Semantic Web Research

4Store – An Efficient, Scalable and Stable RDF Database

Analyzing Social Networks on the Semantic Web

DARPA Agent Markup Language

DBin Project – Semantic Web P2P and/or Semantic Newsgroup Client.

Deep Search, Wide Search and Everything Else You Should Know About Semantic Search

Digital Object Identifier (DOI)

FOAF Project – A Semantic Web Application

Foundation for Intelligent Physical Agents (FIPA)

GistWeb – Gist of Any Web Page Actual Content

Go3R – Knowledge Based Semantic Search Engine To Avoid Animal Experiments

GoodRelations Vocabulary – Semantic Web Based eCommerce

Infomesh’s Semantic Web Introduction

International Journal of Metadata, Semantics and Ontologies (IJMSO)

International Journal on Semantic Web and Information Systems (IJSWIS)

Jena – A Semantic Web Framework for Java

Journal of Biomedical Semantics

Journal of Web Semantics

Journal of Web Semantics: Preprint Server

Knowledge Discovery


Language Engineering for the Semantic Web: A Digital Library for Endangered Languages

Linked Open Data from the New York Times

Magpie – The Samatic Filter and Tool For the Semantic Web

MetaData at W3C

MindRaider – Semantic Web Outliner

OASIS – Advancing eBusiness Standards

Ontology Matching

Ontology Metadata Vocabulary (OMV)

O’Reilly’s Semantic Web Primer

Potential Advantages Of Semantic Web For Internet Commerce by Yuxiao Zhao and Kristian Sandahl [CiteSeer Login Required]

pOWL – Semantic Web Development Plattform

Practical Semantic Analysis of Web Sites and Documents [CiteSeer Login Required]

RDF Context Tools

RDF – Resource Description Framework

Rules and Rule Markup Languages for the Semantic Web – RuleML-2003 – Interlinking the Web of Data

SAO/NASA Astrophysics Data System (ADS)

Semantic Knowledge Technologies and Language Computation – The Semantic Web Community Portal

Semantic Web Activity Statement

Semantic Web Application Platform – SWAP

Semantic Web for AURIS-MM

Semantic Web In Breadth

Semantic Web Primer for Object-Oriented Software Developers

Semantic Web Roadmap

Semantic Web Search Engine

Semantic Web Search Engine (SWSE)

Semantic Web Services Challenge

Semantic Web – The Voice of Semantic Web Technology

Semantic Web W3C

SenseBot – Semantic Search Engine That Finds Sense On the Web

Simile Widgets – Free, Open-Source Data Visualization Web Widgets and More

Sindice – The Semantic Web Index Project Info – OWL API

Swoogle – Semantic Bot

SWRL: A Semantic Web Rule Language Combining OWL and RuleML

Terbium Labs – Matchlight Proactive Security In an Insecure World

The Authoritative Resource List for the Semantic Web by Kaila Strong

The Cover Pages

The RDF Query Language (RQL)

The Semantic Web: An Introduction

The Semantic Web By Tim Berners-Lee, James Hendler and Ora Lassila

The Semantic Web Is Your Friend

Transforming and Enriching Documents for the Semantic Web by Dietmar Roesner, Manuela Kunze, Sylke Kroetzsch

uClassify – Free Text Classified Web Service

Watson Web – Exploring the Semantic Web

Web Semantics: Science, Services and Agents on the World Wide Web

Web Service Modeling Ontology

Wilbur Toolkit for Semantic Web Programming [Project no longer actively maintained]

World Wide Web Reference Semantic Web

Yahoo Groups – SemanticWeb

Bot and Intelligent Agent Research Resources and Sites

1st Spot

80legs – Powerful and Economical Service Platform for Crawling and Processing Web Content

Agent Construction Tools


Agent Model Yields Leadership [2004 article]


AgentSheets – Authoring Tool to Create Agents

ALICEBot – Speech Interface for Apps and Devices

Applied Soft Computing

Article Search API – New York Times Articles 1981 to Present

Artificial Intelligence Resources

artoo.js – The Client-Side Scraping Companion

Bots, Blogs and News Aggregators


Clara – Digital Employee That Schedules Meetings

Common Crawl – Open Repository of Web Crawl Data Composed Of Over 5 Billion Freely Available Web Pages

cQuery – Content Query Engine

CrawlTrack – Your Web Statistics Tool

Create a Crawler – Extract Data From an Entire Website

Data Mining Resources

Dataminr – Real-time Information Discovery

DataparkSearch Engine – Full-Featured Open Source Web-Based Search Engine

DataRobot – Build Better Predictive Models – Faster

Deep Web Research

Design of a Parallel and Distributed Web Search Engine by Salvatore Orlando, Raffaele Perego, and Fabrizio Silvestri

Dictionary of Algorithms and Data Structures

Digital Footprints – Collect Facebook Data

Eliza – The Original ChatterBot

Ethereum Frontier Release – A Decentralized Software Platform

Facepager – Fetching Public Data From Facebook

FAME (Facilitating Agents in Multiculture Exchange)Project

File Information Tool Set (FITS)

Foundation for Intelligent Physical Agents

Google Guide

Huginn – Your Agents Are Standing By

IBM Watson Services

Imagination Engines – Turn the Web Into Data With Extractors, Crawlers and Connectors

Indexing Robot Crawler Checklist

InfoExtractor – Extract Relevant Information from Various Sources Like Blogs, YouTube, and Wikipedia

Information Retrieval Intelligence

Institute for Human and Machine Cognition (IHMC)

Intellexer – Custom Built Search Engines, Knowledge Management Tools, Natural Language Processing

Intelligent Information Systems Research Laboratory

International Journal of Agent-Oriented Software Engineering (IJAOSE)

jSEO – Web Crawler For Search Engine Optimization

Knowledge Discovery

Koders – Source Code Search Engine

LAIR – Laboratory of Applied Informatics Research

List of User-Agents (Spiders, Robots, Crawler, Browser)

Minimal-Intelligence Agents for Bargaining Behaviors in Market-Based Environments by Dave Cliff and Janet Bruten

MIT Media Lab: Software Agents

Modelling and Mining of Network Information Systems

Motion AI – Artificial Intelligence Made Easy

Mozenda Web Agent Builder – Web Data Extraction


MySpiders [CiteSeer Login Required]

NCapture – Capture Web Content

Networks and agents Network (NaN)

NewsBot – Related News At a Click Of a Button

Nomibot – Bots Scour the Web To Bring You What You Want

Open Source Web Information Retrieval (OSWIR05)

Oxyus Open Source Search Engine

ResearchKit Framework – Medical Research Apps

Robo Brain – Large Scale Computational System That Learns from Publicly Available Internet Resources

Scrapple – A Framework For Creating Web Scrapers and Web Crawlers

Search Engine Robots

Search Engine Watch News

Search Tools – Information Guides and News

SeerSuite – CiteSeerX Toolkit

Semantic Web


Siri – Your Virtual Personal Assistant

Smarter Bots

SocialBuzzBot – The Business and Social Intelligence Search Engine for Information

Discovery from Social Communities

SocSciBot – Social Sciences Link Analysis Research

Spidering Hacks

Spinn3r: RSS Content, News Feeds, News Content, News Crawler and Web Crawler APIs

STACKS – Social Media Tracker, Analyzer, & Collector Toolkit at Syracuse

Structure and Interpretation of Computer Programs – Video Lectures by Hal Abelson and Gerald Jay Sussman

Supybot, A Superb Python IRC Bot

Swoogle – Semantic Bot

TextRunner Search – Searches Hundreds of Millions of Assertions Extracted from 500 Million High-Quality Web Pages

The Intelligent Software Agents Lab

The Lemur Toolkit – Language Modeling and Information Retrieval Research

The Search Engine Project (TSEP)

The Simon Lavern Page

TSEP – The Search Engine Project

UMBC AgentWeb

UMBC eBiquity

Web Curator Tool (WCT)

Web Data Extractors – White Paper Link Compilation

Web Intelligence Consortium

Web IR & IE

WolframAlpha Computational Knowledge Engine – Trillions of Pieces of Curated Data and Millions of Lines of Algorithms

Posted in: Data Mining, Evaluation of Internet Resources, Internet Resources - Web Links, Open Source, Search Engines, Search Strategies, Technology Trends