How to Scrape Data from Websites Using LLM and Save to CSV in Node.js

Introduction

In today's data-driven world, the ability to extract structured information from web pages is invaluable. Whether you're a data analyst, a researcher, or a developer, web scraping can unlock a wealth of information. But what if you could leverage the power of Language Learning Models (LLMs) to enhance this process? In this post, we'll explore how to use JavaScript, the @ai-sdk/openai package, and Zod schemas to scrape data from websites and save it to a CSV file. By the end, you'll have a robust tool at your disposal.

Understanding LLMs and Web Scraping

What is a Language Model (LLM)?

LLMs are sophisticated algorithms capable of understanding and generating text. They can interpret complex data and provide structured output, making them ideal for tasks like web scraping.

Basics of Web Scraping

Web scraping involves extracting data from websites and transforming it into a usable format. While it sounds straightforward, it can be challenging due to dynamic content and anti-scraping measures. JavaScript, with its versatility and robust libraries, is a popular choice for implementing these tasks.

Setting Up Your Environment

Necessary Tools and Libraries

To get started, you'll need Node.js and the following libraries:

  • @ai-sdk/openai for interacting with LLMs.
  • zod for schema validation.
  • playwright for navigating web pages.
  • csv-writer for exporting data to CSV files.

Installation Steps

  1. Clone the github repo from here Github AI Extract.
  2. Run pnpm install to install all necessary packages.
  3. Open index.ts to define your data extraction schema and prompt.

Using LLM in JavaScript for Web Scraping

Integrating LLM with JavaScript

The @ai-sdk/openai package allows seamless interaction with LLMs. Define a Zod schema to ensure the data is structured correctly.

Example Code Snippet

Here's how you can set up your main function to begin data extraction:

  • Add the website pages links to the links.txt file
import { z } from 'zod';
import { start } from './fn';

async function main() {
  const schema = z.object({
    title: z.string(),
    description: z.string(),
  });

  await start(schema, 'Extract all the glossary headings and descriptions.');
}

main();

This code initializes the schema for the data you want to extract and triggers the scraping process.

Building the Web Scraper

Identifying the Target Data Before scraping, analyze the structure of the website to identify the data you wish to extract. List all target URLs in links.txt for batch processing.

Writing the Scraper Code

The core functionality is encapsulated in fn.ts. This script manages the web scraping process, from opening URLs with Playwright to cleaning and extracting data with the LLM.

So below code loop all the link from links.txt file and will extract the data and save into the file.csv file

import { chromium, Page } from 'playwright';
import { createObjectCsvWriter } from 'csv-writer';
import { openai } from '@ai-sdk/openai';
import { z } from 'zod';
import fs from 'fs';
import slugify from 'slugify';
import Turndown from 'turndown';

const llm = openai.chat('gpt-4o');

const defaultPrompt =
  'You are an advanced tool for extracting information from web pages. Retrieve the data from the site.';

export async function start(schema: z.ZodObject<any>, prompt: string) {
  const csvWriter = createCsvWriter(schema);
  const urls = readUrlsFromFile();

  const browser = await chromium.launch({
    headless: false,
  });

  const _schema = z.object({
    data: z.array(schema).describe(prompt),
  });

  for (const url of urls) {
    try {
      const page = await browser.newPage();
      console.log(url);
      await extractData(url, _schema, page, csvWriter);
      await page.close();
    } catch (error) {
      console.log(error);
    }
  }

  await browser.close();
}

This snippet shows the setup of the scraping process, where each URL is processed to extract the desired data.

Saving Data to a CSV File

Using JavaScript to Create CSVs

The csv-writer library simplifies writing data to CSV files. Once extracted, data is formatted and stored in file.csv.

export function createCsvWriter(schema: z.ZodObject<any>) {
  const keys = Object.keys(schema.shape);

  const header = keys.map((key) => ({ id: key, title: key }));

  const csvWriter = createObjectCsvWriter({
    path: 'file.csv',
    header: [...header, { id: 'url', title: 'url' }],
  });

  return csvWriter;
}

Handling Common Challenges

Dealing with Dynamic Content Websites often load content dynamically. Playwright manages these scenarios by rendering pages as a browser would, ensuring complete data capture.

Avoiding Anti-Scraping Measures

Respect websites' terms of service and their robots.txt files. Incorporate headers and delay requests to minimize the risk of being blocked.

Web scraping involves ethical and legal responsibilities. Always ensure you have permission to scrape data and comply with applicable laws and terms of use.

Conclusion

We've explored how to use JavaScript and LLMs to scrape web data effectively. By following this guide, you can create a powerful tool for collecting and organizing data from the web, whether for personal projects or professional applications.