Open Source Web Crawlers

I am back again with one more OpenSource list. Seeing the good response to my previous posts on OpenSource SNS Platforms and OpenSource CMS platforms, I am here presenting you with OpenSource WebCrawlers.

Before I go further let me tell you that WebCrawlers need not be for search engines, they can be used to have a mirror site, advanced web users can use them for working on some sites, people like me who like to browse sites off line can also use them.

The list:

  • Aperture : Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems (e.g. file systems, web sites, mail boxes) and the file formats (e.g. documents, images) occurring in these systems.
  • Arachnid : Java based web spider framework.
  • Arale : Arale can download entire web sites or specific resources from the web.
  • Grub : Grub is a distributed internet crawler/indexer designed to run on multi-platform systems, interfacing with a central server/database.
  • Heritrix : Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
  • HyperSpider : This collects the link structure of a website.
  • J-Spider : A highly configurable and customizable Web Spider engine.
  • JoBo : A basic web spider, with the main advantage over others being that it can automatically fill out forms (e.g. for automated login) and also use cookies for session handling.
  • Metis : Collect information from the content of web sites
  • Nutch : From the Apache stable.
  • OpenWebSpider : OpenWebSpider would be the base for a new Search engine developed from a community of opensource developers!
  • Spider : Spider is a complete standalone Java application designed to easily integrate varied datasources.
  • Web-Harvest : Collect desired Web pages and extract useful data from them.
  • WebEater : A Java program for web site retrieval and offline viewing
  • WebLeach : A fully featured website downloader/mirror tool in Java.
  • WebSphinx : A personal, Customizable crawler.
  • YaCy : Peer-Peer Web Search Engine
Java Based Crawlers
More Java Based Crawlers
Crawl Track: Tracking the Crawlers
8Legs: OpenSource Spiders and Search Engines
Build A Web Spider on Linux
All About Search: Search Tools


Anonymous said...

Thanks for these links, I was looking for a good crawler to adapt it to my needs.