SiteWalker.js
Simple web crawler with basic capability to crawl next page based on callback
How to install
$ npm install site-walker
Usage
var SiteWalker = var instance = "http://someawesome.site.com"{ //callback is fired when page is successfully crawled //pageStr contains crawled page, in string //do some scrapping here and there var nextUrl = "http://someawesome.site.com/page/2" //assume that page/2 is scrapped from current pageStr thisnextnextUrl}instanceinstance //invoke crawling
You can call this.next(nextUrl)
several times during callback. If so, the next url that will be crawled the first supplied nextUrl, and so on. For example :
//supplied callback { //scrap scrap thisnexturl1; thisnexturl2; ifsomeConditionIsMet thisnexturl3 }
the crawled page order will be :
url1 -> url2 -> url1 -> url2
If during callback, someConditionisMet
evaluate to true
, the order of execution will be :
url1 -> url2 -> url3 -> url1 -> url2
Notes
- Currently, if during crawling a URL is failed to be crawled, SiteWalker will break the execution and throw
reject
- No
stop()
method is available. So, if you keep supplyingnextUrl
on callback, SiteWalker will run forever (theoretically)