Web scraping with n8n allows you to automate the extraction of data from websites, turning unstructured web content into structured, usable information without needing an API. This process primarily uses n8n’s HTTP Request
node to fetch a website’s HTML code and the HTML Extract
node to parse and pull specific pieces of data using CSS selectors. By connecting these and other nodes, you can build powerful workflows to gather, process, and send data to virtually any application, from Google Sheets to a custom database.
So, you’ve got your eye on some data living on a website, but there’s no handy API to just ask for it. What do you do? Manually copy-pasting is a nightmare, and writing a full-blown Python script feels like bringing a cannon to a knife fight. This, my friend, is where n8n scrapping shines. It hits that sweet spot between power and simplicity.
Before We Dive In: The Ethics of Web Scraping
Let’s get this out of the way first because it’s important. Is web scraping legal? The answer is… it’s nuanced. While grabbing publicly available data is generally okay, you’re essentially a guest on someone else’s server. Always check a website’s Terms of Service
(ToS) and its robots.txt
file (you can usually find it at www.example.com/robots.txt
). The ToS is your legal guide, and the robots.txt
is a polite request from the webmaster about which areas they don’t want bots to visit. Scrape responsibly: don’t hammer servers with requests, and don’t use the data for nefarious purposes. Trust me, it’s better to be a good digital citizen.
Why Use n8n for Web Scrapping?
So, why not just use a browser extension or a dedicated script? Because n8n isn’t just a scraper; it’s an entire automation ecosystem. Using n8n for scraping is like having a Swiss Army knife. The scraping tool is one part, but you also have the tools to immediately do something with that data.
Think about it: with n8n, you can build a workflow that does this:
- Scrapes product prices from an e-commerce site every morning.
- Compares them to prices in your Google Sheet.
- Sends you a Slack notification if a price drops below a certain threshold.
- Updates your database with the new price.
Doing all that with separate scripts would be a tangled mess. In n8n, it’s a clean, visual flow of nodes. It’s automation on easy mode.
Your First n8n Scrapping Workflow: A Simple Guide
Let’s build a basic scraper. We’ll use a website made for this purpose: Books to Scrape, a fictional online bookstore.
Our goal: Get the titles and prices of all the books on the first page.
Step 1: Fetch the Web Page with the HTTP Request
Node
This is your starting point. The HTTP Request
node acts like your browser, sending a request to a URL and getting the website’s raw HTML back.
- Add an
HTTP Request
node to your canvas. - Request Method: Set it to
GET
. - URL: Enter
http://books.toscrape.com
. - Response Format: Choose
String
.
Execute the node. You should see a giant wall of HTML code in the output. Don’t panic! That’s exactly what we want.
Step 2: Finding Your Target with CSS Selectors
Now, how do we tell n8n what to grab from that HTML soup? We use CSS Selectors. Think of them as an address for an element on a webpage.
To find the selector:
- Go to Books to Scrape in your browser.
- Right-click on a book’s title and click “Inspect”.
- Your browser’s developer tools will open, highlighting the HTML for that title. You’ll see it’s inside an
<h3>
tag which is inside an<article>
tag with a class ofproduct_pod
.
So, a good selector to get all the book containers is article.product_pod
.
Step 3: Extracting the Data with the HTML Extract
Node
This node is the star of the show. It takes the HTML from the previous node and uses our CSS selectors to pull out the data.
- Add an
HTML Extract
node after theHTTP Request
node. - Source Data: It should default to
From Previous Node Input
. - JSON Property: Set this to
data
(or whatever you named the property in the HTTP Request node). - Under Extraction Values, click
Add Value
twice to create two extractors.- Extractor 1 (Title):
- Key:
title
- CSS Selector:
h3 > a
(This selects the<a>
tag inside the<h3>
) - Return Value:
Text
- Key:
- Extractor 2 (Price):
- Key:
price
- CSS Selector:
.price_color
- Return Value:
Text
- Key:
- Extractor 1 (Title):
- Enable the Return Array option, since we expect multiple books.
Execute the workflow. Voilà! You should have a clean, structured list of all the book titles and their prices. You’ve just built your first n8n scraper!
Leveling Up: Tackling Real-World Scraping Challenges
Let’s be honest, most websites aren’t as simple as books.toscrape.com
. You’ll run into challenges. Here’s how n8n can handle them.
Challenge | The Problem | n8n Solution |
---|---|---|
Pagination | Data is spread across multiple pages (Page 1, 2, 3…). | Use a Loop or Split in Batches node to cycle through page URLs. |
Dynamic Content | Content is loaded with JavaScript after the page loads. The HTTP Request node won’t see it. |
Use a headless browser service like Browserless and call its API via the HTTP Request node. |
Getting Blocked | Websites block your IP address after too many rapid requests. | Use a proxy service. Some services can be called via an API, which you can integrate with your HTTP Request node. |
Case Study: Scraping a Dynamic Website with n8n and Browserless
I once had to scrape data from a modern web application where all the useful information was loaded via JavaScript. My simple HTTP Request
workflow came back empty. It was so frustrating!
This is where a headless browser comes in. A headless browser is a web browser without a graphical user interface, controlled by code. It loads a page just like a real user, executes all the JavaScript, and then gives you the final HTML.
Services like Browserless.io offer this as a service. And here’s the cool part: you can control it right from n8n.
Here’s the concept:
- Set up Browserless: Get a self-hosted or cloud instance of Browserless running.
- Use
HTTP Request
Node in n8n: Instead of calling the target website directly, you call the Browserless API. You pass the target URL to Browserless in the request. - Browserless Works its Magic: It opens the page, waits for the JavaScript to load, and returns the complete, final HTML to n8n.
- Use
HTML Extract
Node: Now you use theHTML Extract
node on the HTML returned from Browserless.
This turns n8n from a simple static page scraper into a powerhouse capable of tackling almost any modern website. You’re no longer just getting the blueprint of the house; you’re getting the fully furnished, decorated home.
Conclusion: Your Automation Journey Starts Here
Web scraping with n8n opens up a universe of possibilities. You’re no longer limited by whether a service has an API. If you can see the data in a browser, you can most likely automate its extraction with an n8n workflow. Start with the basics, get comfortable with the core nodes, and don’t be afraid to tackle more complex sites. The data is out there waiting for you!