LLM Scraper is a TypeScript library that allows you to convert any webpages into structured data using LLMs.
[!TIP] Under the hood, it uses function calling to convert pages to structured data. You can find more about this approach here
- Supports Local (GGUF), OpenAI, Groq chat models
- Schemas defined with Zod
- Full type-safety with TypeScript
- Based on Playwright framework
- Streaming when crawling multiple pages
- Supports 4 input modes:
-
html
for loading raw HTML -
markdown
for loading markdown -
text
for loading extracted text (using Readability.js) -
image
for loading a screenshot (multi-modal only)
-
Make sure to give it a star!
-
Install the required dependencies from npm:
npm i zod playwright llm-scraper
-
Initialize your LLM:
OpenAI
import OpenAI from 'openai' const model = new OpenAI()
Local
import { LlamaModel } from 'node-llama-cpp' const model = new LlamaModel({ modelPath: 'model.gguf' })
-
Create a new browser instance and attach LLMScraper to it:
import { chromium } from 'playwright' import LLMScraper from 'llm-scraper' const browser = await chromium.launch() const scraper = new LLMScraper(browser, model)
In this example, we're extracting top stories from HackerNews:
import { chromium } from 'playwright'
import { z } from 'zod'
import OpenAI from 'openai'
import LLMScraper from 'llm-scraper'
// Launch a browser instance
const browser = await chromium.launch()
// Initialize LLM provider
const llm = new OpenAI()
// Create a new LLMScraper
const scraper = new LLMScraper(browser, llm)
// Define schema to extract contents into
const schema = z.object({
top: z
.array(
z.object({
title: z.string(),
points: z.number(),
by: z.string(),
commentsURL: z.string(),
})
)
.length(5)
.describe('Top 5 stories on Hacker News'),
})
// URLs to scrape
const urls = ['https://news.ycombinator.com']
// Run the scraper
const pages = await scraper.run(urls, {
model: 'gpt-4-turbo',
schema,
mode: 'html',
closeOnFinish: true,
})
// Stream the result from LLM
for await (const page of pages) {
console.log(page.data)
}
As an open-source project, we welcome contributions from the community. If you are experiencing any bugs or want to add some improvements, please feel free to open an issue or pull request.