Web scrapping with n8n is the process of using automated workflows to extract data from websites, even those without a formal API. By using nodes like HTTP Request to fetch a page’s HTML and the HTML Extract node with CSS selectors to pinpoint specific data, you can build powerful automations to gather information, track prices, or monitor content changes. This visual, low-code approach makes web scraping accessible, allowing you to easily structure the extracted data and send it to other applications like Google Sheets, databases, or even an AI for analysis.
Why Use n8n for Web Scrapping?
Let’s be honest, you could write a Python script with Beautiful Soup or a JavaScript function with Puppeteer to scrape a website. I’ve done it countless times. But the real question is, should you? For many, the answer is a resounding no, and here’s why n8n has become my go-to tool for most scrapping (or scraping, if we’re being formal) tasks.
At its core, n8n transforms web scraping from a pure coding challenge into a visual, logical puzzle. You’re not wrestling with syntax or managing dependencies; you’re connecting nodes on a canvas. This visual approach is not just easier; it’s often faster and more maintainable.
Here’s the real magic: n8n is an integration platform first. This means your scraped data doesn’t just sit in a lonely CSV file. You can instantly:
- Push it to a Google Sheet for your team to view.
- Insert it into a PostgreSQL or MySQL database.
- Send a Slack notification when a certain condition is met.
- Feed it to an OpenAI node to summarize the content.
Doing all that with a standalone script would mean writing boilerplate code for every single API. With n8n, you just add another node. It’s that simple.
The Core Toolkit: Your First Scrapping n8n Workflow
Ready to get your hands dirty? Building a basic web scraper in n8n revolves around two essential nodes. Think of them as your dynamic duo for data extraction.
H3: Step 1: Fetching the Page with the HTTP Request Node
First, you need the raw material: the website’s HTML code. The HTTP Request
node is your tool for this. It’s like knocking on a website’s door and asking, “Can I please see your HTML?”
- Add an
HTTP Request
node to your workflow. - Set the Request Method to
GET
. - In the URL field, paste the address of the page you want to scrape. For this example, we’ll use
http://books.toscrape.com
, a sandbox site made for this purpose. - Under Response Format, select
String
.
Execute the node, and you’ll see a mountain of HTML code. Don’t panic! This is exactly what we want. Now, we just need to find the treasure buried within.
H4: Step 2: Extracting the Gold with the HTML Extract Node
This is where the real web scrapping with n8n happens. The HTML Extract
node sifts through that mountain of code to find exactly what you’re looking for. To do this, it uses CSS Selectors.
What’s a CSS Selector? It’s simply a specific address for an element on a webpage. The easiest way to find it is to visit the page in your browser, right-click the data you want (like a book title or price), and click “Inspect.” This opens the developer tools, highlighting the element’s code. You can use its class or ID as a selector.
Let’s configure the HTML Extract
node:
- Source Data: Should be
JSON
. - JSON Property: This should be the field name from the previous node containing the HTML (e.g.,
data
). - Extraction Values: Click ‘Add Value’ to define what you want to pull.
- Key: Give it a name, like
title
. - CSS Selector: To get all book titles, use
article.product_pod > h3 > a
. - Return Value:
Text
.
- Key: Give it a name, like
Add another extraction value for the price, using the CSS selector p.price_color
. When you run this node, you’ll get a beautifully structured list of all the book titles and their prices!
Real-World Example: Building an Automated Price Tracker
Let’s make this practical. I once wanted to buy a specific mechanical keyboard, but I was waiting for a price drop. Instead of checking the site every day, I built a 5-minute n8n workflow to do it for me.
Here’s the blueprint:
- Cron Node: The trigger. I set it to run once a day at 9 AM.
0 9 * * *
. - HTTP Request Node: Fetches the product page URL.
- HTML Extract Node: Extracts two things: the product title and the current price. I found the CSS selectors using the “Inspect” method.
- IF Node: This is the brain. It compares the extracted price to my desired price. The condition was something like
{{ $json.price }} <= 100
. - Gmail/Slack Node: Connected to the ‘true’ output of the IF node. If the price was less than or equal to $100, it sent me a message: “Price Drop Alert! ‘{{ $json.title }}’ is now only ${{ $json.price }}!”
I set it up and completely forgot about it. Two weeks later, a notification popped up on my phone. I bought the keyboard. That’s the power of putting your scrapping n8n skills to work.
Leveling Up: Handling Dynamic and Complex Websites
Now, here’s where things can get tricky. What happens when the data you want isn’t in the initial HTML? Many modern websites use JavaScript to load content (like product listings or search results) after the page initially loads. Your trusty HTTP Request
node won’t see this data because it doesn’t run JavaScript.
This is the point where many new automators get stuck. But fear not, there are solutions!
The Challenge of Dynamic Content
When HTTP Request
isn’t enough, you need to simulate a real browser that can run JavaScript. This is where “headless browsers” come into play. These are tools like Browserless, Playwright, or ScrapingBee that run a full browser on a server. You can send them a URL, and they’ll return the fully rendered HTML, JavaScript content and all.
You typically interact with these services via their API, which means you can call them directly from n8n using… you guessed it, the HTTP Request
node! You just send the URL you want to scrape to the headless browser’s API endpoint, and it sends back the clean HTML for your HTML Extract
node to parse.
Scraper Tools & Techniques: A Quick Comparison
Technique | Best For… | Pros | Cons |
---|---|---|---|
HTTP Request + HTML Extract |
Static HTML websites (blogs, simple stores) | Fast, simple, no extra setup | Fails on websites that rely heavily on JavaScript |
n8n + Headless Browser (e.g., Browserless) | Dynamic, complex, JS-heavy web apps | Can scrape virtually anything a user sees | Requires an external service, slower, more complex setup |
n8n + changedetection.io |
Website monitoring and change detection | Specialized for tracking differences | Another tool to self-host and manage, but powerful |
A Quick Word on Legality and Ethics
Just because you can scrape a website doesn’t always mean you should. Be a good internet citizen. Before you start a scrapping n8n
project:
- Check
robots.txt
: Most websites have a file atwww.example.com/robots.txt
that outlines rules for bots. Respect them. - Read the Terms of Service: Some sites explicitly forbid scraping.
- Don’t Overload Servers: Use a
Cron
node to schedule your scraping for off-peak hours and don’t make requests too frequently.
Web scraping is an incredibly powerful skill, and with a tool like n8n, it’s more accessible than ever. Whether you’re tracking prices, gathering sales leads, or just collecting data for a hobby project, you now have the foundational knowledge to build robust and useful automations. Go ahead and start building!