Unknown Unknown: Open Source Web Crawlers

I am back again with one more OpenSource list. Seeing the good response to my previous posts on OpenSource SNS Platforms and OpenSource CMS platforms, I am here presenting you with OpenSource WebCrawlers.

Before I go further let me tell you that WebCrawlers need not be for search engines, they can be used to have a mirror site, advanced web users can use them for working on some sites, people like me who like to browse sites off line can also use them.

The list:

Aperture : Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems (e.g. file systems, web sites, mail boxes) and the file formats (e.g. documents, images) occurring in these systems.
Arachnid : Java based web spider framework.
Arale : Arale can download entire web sites or specific resources from the web.
Grub : Grub is a distributed internet crawler/indexer designed to run on multi-platform systems, interfacing with a central server/database.
Heritrix : Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
HyperSpider : This collects the link structure of a website.
J-Spider : A highly configurable and customizable Web Spider engine.
JoBo : A basic web spider, with the main advantage over others being that it can automatically fill out forms (e.g. for automated login) and also use cookies for session handling.
Metis : Collect information from the content of web sites
Nutch : From the Apache stable.
OpenWebSpider : OpenWebSpider would be the base for a new Search engine developed from a community of opensource developers!
Spider : Spider is a complete standalone Java application designed to easily integrate varied datasources.
Web-Harvest : Collect desired Web pages and extract useful data from them.
WebEater : A Java program for web site retrieval and offline viewing
WebLeach : A fully featured website downloader/mirror tool in Java.
WebSphinx : A personal, Customizable crawler.
YaCy : Peer-Peer Web Search Engine

[lInks/sOurces]
Java Based Crawlers
More Java Based Crawlers
Crawl Track: Tracking the Crawlers
8Legs: OpenSource Spiders and Search Engines
Build A Web Spider on Linux
All About Search: Search Tools

Unknown Unknown

Open Source Web Crawlers

1 comments:

Subscribe

The Past

rAm's shared items in Google Reader

Cloud!

Followers