Nutch Resource

Crawling

Configuration of Nutch Crawler

See also http://lucene.apache.org/nutch/tutorial8.html for more information.

  • URLs to start with:
    • e.g. nutch-0.8.x/url/yanel-website.txt (http://yanel.wyona.org/)
    • e.g. nutch-0.8.x/url/yulup-website.txt (http://www.yulup.org/)
  • The range of crawling resp. URLs to be parsed and followed (IMPORTANT: Both files below need to have an "accept hosts" entry):
    • nutch-0.8.x/conf/crawl-urlfilter.txt (+^http://yanel.wyona.org/)
    • nutch-0.8.x/conf/regex-urlfilter.txt (+^http://yanel.wyona.org/)
  • Depth of Crawling: crawl.sh (e.g. DEPTH=5)

Running Nutch Crawler

  • sh crawl.sh

Searching

Configuration of Yanel Nutch Resource

...

Your comments are much appreciated

Is the content of this page unclear or you think it could be improved? Please add a comment and we will try to improve it accordingly.