I am back again with one more OpenSource list. Seeing the good response to my previous posts on OpenSource SNS Platforms and OpenSource CMS platforms, I am here presenting you with OpenSource WebCrawlers.
Before I go further let me tell you that WebCrawlers need not be for search engines, they can be used to have a mirror site, advanced web users can use them for working on some sites, people like me who like to browse sites off line can also use them.
- Aperture : Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems (e.g. file systems, web sites, mail boxes) and the file formats (e.g. documents, images) occurring in these systems.
- Arachnid : Java based web spider framework.
- Arale : Arale can download entire web sites or specific resources from the web.
- Grub : Grub is a distributed internet crawler/indexer designed to run on multi-platform systems, interfacing with a central server/database.
- Heritrix : Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
- HyperSpider : This collects the link structure of a website.
- J-Spider : A highly configurable and customizable Web Spider engine.
- Metis : Collect information from the content of web sites
- Nutch : From the Apache stable.
- OpenWebSpider : OpenWebSpider would be the base for a new Search engine developed from a community of opensource developers!
- Spider : Spider is a complete standalone Java application designed to easily integrate varied datasources.
- Web-Harvest : Collect desired Web pages and extract useful data from them.
- WebEater : A Java program for web site retrieval and offline viewing
- WebLeach : A fully featured website downloader/mirror tool in Java.
- WebSphinx : A personal, Customizable crawler.
- YaCy : Peer-Peer Web Search Engine
Java Based Crawlers
More Java Based Crawlers
Crawl Track: Tracking the Crawlers
8Legs: OpenSource Spiders and Search Engines
Build A Web Spider on Linux
All About Search: Search Tools