How to Scrape Data from Websites Using LLM and Save to CSV in Node.js
Introduction
In today's data-driven world, the ability to extract structured information from web pages is invaluable. Whether you're a data analyst, a researcher, or a developer, web scraping can unlock a wealth of information. But what if you could leverage the power of Language Learning Models (LLMs) to enhance this process? In this post, we'll explore how to use JavaScript, the @ai-sdk/openai
package, and Zod schemas to scrape data from websites and save it to a CSV file. By the end, you'll have a robust tool at your disposal.
Understanding LLMs and Web Scraping
What is a Language Model (LLM)?
LLMs are sophisticated algorithms capable of understanding and generating text. They can interpret complex data and provide structured output, making them ideal for tasks like web scraping.
Basics of Web Scraping
Web scraping involves extracting data from websites and transforming it into a usable format. While it sounds straightforward, it can be challenging due to dynamic content and anti-scraping measures. JavaScript, with its versatility and robust libraries, is a popular choice for implementing these tasks.
Setting Up Your Environment
Necessary Tools and Libraries
To get started, you'll need Node.js and the following libraries:
@ai-sdk/openai
for interacting with LLMs.zod
for schema validation.playwright
for navigating web pages.csv-writer
for exporting data to CSV files.
Installation Steps
- Clone the github repo from here Github AI Extract.
- Run
pnpm install
to install all necessary packages. - Open
index.ts
to define your data extraction schema and prompt.
Using LLM in JavaScript for Web Scraping
Integrating LLM with JavaScript
The @ai-sdk/openai
package allows seamless interaction with LLMs. Define a Zod schema to ensure the data is structured correctly.
Example Code Snippet
Here's how you can set up your main function to begin data extraction:
- Add the website pages links to the
links.txt
file
import { z } from 'zod';
import { start } from './fn';
async function main() {
const schema = z.object({
title: z.string(),
description: z.string(),
});
await start(schema, 'Extract all the glossary headings and descriptions.');
}
main();
This code initializes the schema for the data you want to extract and triggers the scraping process.
Building the Web Scraper
Identifying the Target Data Before scraping, analyze the structure of the website to identify the data you wish to extract. List all target URLs in links.txt for batch processing.
Writing the Scraper Code
The core functionality is encapsulated in fn.ts. This script manages the web scraping process, from opening URLs with Playwright to cleaning and extracting data with the LLM.
So below code loop all the link from links.txt
file and will extract the data and save into the file.csv
file
import { chromium, Page } from 'playwright';
import { createObjectCsvWriter } from 'csv-writer';
import { openai } from '@ai-sdk/openai';
import { z } from 'zod';
import fs from 'fs';
import slugify from 'slugify';
import Turndown from 'turndown';
const llm = openai.chat('gpt-4o');
const defaultPrompt =
'You are an advanced tool for extracting information from web pages. Retrieve the data from the site.';
export async function start(schema: z.ZodObject<any>, prompt: string) {
const csvWriter = createCsvWriter(schema);
const urls = readUrlsFromFile();
const browser = await chromium.launch({
headless: false,
});
const _schema = z.object({
data: z.array(schema).describe(prompt),
});
for (const url of urls) {
try {
const page = await browser.newPage();
console.log(url);
await extractData(url, _schema, page, csvWriter);
await page.close();
} catch (error) {
console.log(error);
}
}
await browser.close();
}
This snippet shows the setup of the scraping process, where each URL is processed to extract the desired data.
Saving Data to a CSV File
Using JavaScript to Create CSVs
The csv-writer library simplifies writing data to CSV files. Once extracted, data is formatted and stored in file.csv.
export function createCsvWriter(schema: z.ZodObject<any>) {
const keys = Object.keys(schema.shape);
const header = keys.map((key) => ({ id: key, title: key }));
const csvWriter = createObjectCsvWriter({
path: 'file.csv',
header: [...header, { id: 'url', title: 'url' }],
});
return csvWriter;
}
Handling Common Challenges
Dealing with Dynamic Content Websites often load content dynamically. Playwright manages these scenarios by rendering pages as a browser would, ensuring complete data capture.
Avoiding Anti-Scraping Measures
Respect websites' terms of service and their robots.txt files. Incorporate headers and delay requests to minimize the risk of being blocked.
Ethical Considerations and Legal Aspects
Web scraping involves ethical and legal responsibilities. Always ensure you have permission to scrape data and comply with applicable laws and terms of use.
Conclusion
We've explored how to use JavaScript and LLMs to scrape web data effectively. By following this guide, you can create a powerful tool for collecting and organizing data from the web, whether for personal projects or professional applications.