csscrape
A simple, lightweight, promise-based web scraper for Node.js.
Install
$ npm install csscrape
Basic Usage
Try it out in your browser
Example -Scraping the most dependend-upon packages from NPM's main page:
const scrape = ; await ; /*Results (as of 5/25/2020):[ 'lodash', 'react', 'chalk', 'request', 'commander', 'moment', 'express', 'react-dom', 'prop-types', 'tslib', 'axios', 'debug', 'fs-extra', 'async', 'bluebird', 'vue', 'uuid', 'classnames', 'core-js', 'underscore', 'inquirer', 'yargs', 'webpack', 'rxjs', 'mkdirp', 'glob', 'body-parser', 'dotenv', 'colors', 'typescript', 'jquery', 'minimist', 'babel-runtime', '@types/node', 'aws-sdk', '@babel/runtime']*/
Use simple JSON & CSS to describe the data you want
Try it out in your browser
Example -Same as above, but only get the first 2 packages and their details:
const scrape = ; await ; /*Results (as of 5/25/2020):[ { name: 'lodash', info: { desc: 'Lodash modular utilities.', author: 'jdalton', info: 'published 4.17.15 • 10 months ago' } }, { name: 'react', info: { desc: 'React is a JavaScript library for building user interfaces.', author: 'acdlite', info: 'published 16.13.1 • 2 months ago' } }]*/
Follow links to scrape related data
Try it out in your browser
Example -Same as above, but follow each package's link to grab number of dependencies from its details page:
const scrape = ; await ; /*Results (as of 5/25/2020):[ { name: 'lodash', info: { desc: 'Lodash modular utilities.', author: 'jdalton', info: 'published 4.17.15 • 10 months ago' }, dependencies: '0', dependents: '115,408' }, { name: 'react', info: { desc: 'React is a JavaScript library for building user interfaces.', author: 'acdlite', info: 'published 16.13.1 • 2 months ago' }, dependencies: '3', dependents: '56,464' }]*/
Simple API
You can scrape for data using the five following methods:
/** * Get the initial url to start the scrape * @param */; /** * Filter down the current results to create a new data context * @param */; /** * Select data from the current data context * @param {string | string[] | {}} selector - a css selector string/object * Note: To select an attribute value instead of text use string array: * Select div text : .select('div') * Select div title attribute: .select(['div', 'title']) */; /** * Find a link in the current data context to follow to continue scraping * @param */; /** * Marks the current scrape as finished and returns a Promise of the results */;
Command Line Interface
csscrape also provides a CLI
Usage: csscrape [options] <url> Options: -V, --version output the version number -f, --filter <selector> Filter to specific data in the results -s, --
Install
$ npm install csscrape -g
Example
Same as first example, but using the CLI
csscrape www.npmjs.com/browse/depended -s 'main h3' # Results (as of 5/25/2020): # [ # 'lodash', 'react', 'chalk', # 'request', 'commander', 'moment', # 'express', 'react-dom', 'prop-types', # 'tslib', 'axios', 'debug', # 'fs-extra', 'async', 'bluebird', # 'vue', 'uuid', 'classnames', # 'core-js', 'underscore', 'inquirer', # 'yargs', 'webpack', 'rxjs', # 'mkdirp', 'glob', 'body-parser', # 'dotenv', 'colors', 'typescript', # 'jquery', 'minimist', 'babel-runtime', # '@types/node', 'aws-sdk', '@babel/runtime' # ]