Extracting data from the World Wide Web (WWW) has become an important issue in the last few years as the number of web pages available on the visible Internet has grown to over 20 billion pages with over 1 trillion pages available from the invisible web. Tools and protocols to extract all this information have now come in demand as researchers as well as web browsers and surfers want to discover new knowledge at an ever increasing rate! As robots (bots) and intelligent agents are at the heart of many extraction tools I decided to create a compilation of the latest sources and sites that extract information from the web. There are a number of eMail extraction tools still available through the Internet and I have decided not to list these as they aid to the on-going and increasing problem of SPAM except for a readily available DMOZ Directory listing:
Web Data Extractors:
80legs – Powerful and Economical Service Platform for Crawling and Processing Web Content
http://www.80legs.com/
Anthracite
http://freecode.com/projects/anthracite
Aristo – Answer Questions with a Knowledgeable Machine
http://allenai.org/aristo.html
artoo.js – The Client-Side Scraping Companion
http://medialab.github.io/artoo/
Automated RSS Scraper Scripts
http://www.djeaux.com/rss/
Automated Information Solutions
http://www.automated-info-solutions.com/
Automatic Information Extraction From Semi-Structured Web Pages By Pattern Discovery
http://portal.acm.org/citation.cfm?id=640423&dl=ACM&coll=portal
Automation Anywhere – Web Data Extraction Software
http://www.automationanywhere.com/solutions/webDataExt.htm
Beautiful Soup
http://freecode.com/projects/beautifulsoup
Beautiful Soup – HTML/XML Parser for Quick Turnaround Screen Scraping and Web Data Extraction
http://www.crummy.com/software/BeautifulSoup/
BLIASoft Knowledge Discovery
http://www.bliasoft.com/Eindex.html
Bot Research
http://www.BotResearch.info/
BYU Data Extraction Research Group
http://www.deg.byu.edu/
Captiva Software: Digital Information Capture Software
http://www.emc.com/enterprise-content-management/captiva/captiva.htm
ChartSearch Data Search Technology
http://www.ChartSearch.net/
Client-Side Deep Web Data Extraction
http://www.tic.udc.es/~mad/publications/ceceast2004.pdf
Connotate – Web Data Extraction and Monitoring
http://www.connotate.com/
ContextMiner – Tools to Collect Data, Metadata and Contextual Information
http://www.contextminer.org/
cQuery – Content Query Engine
http://cquery.com/
Create a Crawler – Extract Data From an Entire Website
http://support.import.io/knowledgebase/articles/247570-create-a-crawler
cURL groks URLs – Command Line Tool for Transferring Data
http://curl.haxx.se/
Data Extraction Services
http://www.dataextractionservices.com/
Data Mining Resources
http://www.DataMiningResources.info/
Dataminr – Real-time Information Discovery
http://www.dataminr.com/
DataSift – Powerful Social Data Platform
http://datasift.com/
DataWrangler – Data Cleaning and Transformation Tool
http://vis.stanford.edu/wrangler/
Deep Web Research
http://www.DeepWebResearch.info/
DiffBot – Get Data From Web Pages Automatically
http://www.DiffBot.com/
Digital Footprints – Collect Facebook Data
http://digitalfootprints.dk/
DiscoverText – Import, Sort, Distribute and Analyze Electronic Content from eMail, Document Repositories, and Social Media
http://discovertext.com/
Easy PDF Cloud
https://www.easypdfcloud.com/
eGrabber – Data Capture Tools
http://www.egrabber.com/
ExtractData Technologies – SearchExtract Software
http://www.extradata.com/
Facepager – Fetching Public Data From Facebook
https://github.com/strohne/Facepager
FeedsAPI – Extract Content from Web Pages Tool
http://www.feedsapi.com/
Ficstar Software – Web Data Extraction
http://www.ficstar.com/
File Information Tool Set (FITS)
http://projects.iq.harvard.edu/fits
Huginn – Your Agents Are Standing By
https://github.com/cantino/huginn
Imagination Engines
http://www.Imagination-Engines.com/
Import.io – Turn the Web Into Data With Extractors, Crawlers and Connectors
https://import.io/
InfoExtractor – Extracts Relevant Information from Blogs, YouTube and Twitter
http://www.infoextractor.org/
Information Retrieval (IR) and Information Extraction (IE) on the Web
http://www.webir.org/
Introduction to Information Retrieval
http://www-nlp.stanford.edu/IR-book/
iOpus Internet Macros
http://www.iopus.com/imacros/
iRobotSoft – Visual Web Scraping and Web Automation
http://irobotsoft.com/
iWeb Scraping Services
http://www.iwebscraping.com/
jSEO – Web Crawler For Search Engine Optimization
http://codecanyon.net/item/jseo-web-crawler-for-search-engine-optimization/8770392
Junar – Discovering Data
http://www.junar.com/
Karma – Data Integration Tool
https://usc-isi-i2.github.io/karma/
Kimono – Turn Website Into Structured APIs From Your Browser In Seconds
https://www.kimonolabs.com/
Knowledge Discovery Resources
http://www.KnowledgeDiscovery.info/
Knowlesys® – Web Data Extraction, Web Grabber and Screen Scraper
http://www.knowlesys.com/index.htm
LingPipe – Information Extraction and Data Mining Tools
http://alias-i.com/lingpipe/
LoginWorks – Advanced Solutions – Data Mining and Web Scraping
http://www.loginworks.com/
Metadata Extraction Tool
http://meta-extractor.sourceforge.net/
Mozenda – Comprehensive Web Data Gathering
http://www.mozenda.com/
NCapture – Capture Web Content
http://www.qsrinternational.com/products_nvivo_add-ons.aspx
Netlytic – Making Sense of Online Conversations
https://netlytic.org/home/
Newprosoft – Web Data Extraction Software
http://newprosoft.com/
NewsClipper.com – Snip and Ship Dynamic News Content to Your Web Pages
http://www.newsclipper.com/
OutWit Hub – Harvest the Web With Your Own Web Collection Engine
http://www.outwit.com/
Pervasive Data Management and Integration Products
http://www.pervasive.com/
Priceonomics – Crawl Data From the Web
http://priceonomics.com/
QL2 Software – Unstructured Data Management and Web Mining Software
http://www.ql2.com/
OutWit Hub – Harvest the Web With Your Own Web Collection Engine
http://www.outwit.com/
REBOL Technologies
http://www.rebol.com/
Semantic Scholar – Free Scientific Literature Search and Discovery
http://allenai.org/semantic-scholar.html
ScissorsFly – Your Web Clipper and Scrapbook
https://alternativeto.net/software/scissorsfly/
ScrapeForge
http://freecode.com/projects/scrapeforge
Scraper
http://freecode.com/projects/scraper
ScraperWiki – Community of Programmers Sifting Information To Give You the Edge
https://scraperwiki.com/
ScrapeShield – Monitor and Track Misuse of Your Content
https://www.cloudflare.com/apps/scrapeshield
Scrapy – Open Source Web Scraping Framework for Python
http://scrapy.org/
Screen-Scraper
http://freecode.com/projects/screenscraper
Screen-Scraper – Extracts Information From Web Sites
http://www.Screen-Scraper.com/
Screenscraping the Senate by Paul Ford
http://www.xml.com/pub/a/2004/09/01/hack-congress.html
Search and Replace with TextPipe Pattern Matching
http://www.datamystic.com/textpipe.html
Social Media Data Collection Tools
http://socialmediadata.wikidot.com/
Spinn3r – Indexing the Blogosphere
http://www.spinn3r.com/
Squirro – Find, Remember, Organize and Share Important Information
http://squirro.com/
STACKS – Social Media Tracker, Analyzer, & Collector Toolkit at Syracuse
https://github.com/bitslabsyr/stack
Texifter – Search, Sift, Sort, Classify and Analyze
http://texifter.com/
TextRazor – Text Analysis Infrastructure
https://www.textrazor.com/
Topicgrazer – Graze On Web Pages and Documents
http://www.topicscape.com/Topicgrazer/help.php
Unit Miner – Web Data Extraction Software
http://www.unitminer.com/
W3C Publishes Data Extraction Language (DEL) as W3C Note
http://xml.coverpages.org/ni2001-11-06-a.html
Web Data Extraction Software
https://www.automationanywhere.com/webdataext
Web Data Extractor
http://www.rafasoft.com/
Web-Harvest – Open Source Web Data Extraction Tool
http://web-harvest.sourceforge.net/index.php
Website Extractor – Offline Browser
http://www.internet-soft.com/extractor.htm
WebSunDew – Advanced Web Scraping Tool
http://www.websundew.com/
Wikimedia Public Data Dumps
http://meta.wikimedia.org/wiki/Data_dumps
XRay Web Scraping Tool
http://freecode.com/projects/xrayguibasedwebscrapingtool
YaCy Web page Indexer
http://freecode.com/projects/yacy