To build a custom web scraper in n8n, you primarily use the HTTP Request node to fetch a website’s raw HTML and the HTML Extract node to parse and pull specific data using CSS selectors. This combination is highly effective for static websites. For dynamic sites that rely on JavaScript to load content, the process involves integrating a headless browser service like Browserless, which you can call using the HTTP Request node to get the fully rendered HTML before extracting the data.
Ever stared at a website, wishing you could just magically pull all its data into a spreadsheet? Maybe it’s product prices from a competitor, job listings from a portal, or articles from your favorite blog. Manually copy-pasting is a soul-crushing task, and let’s be honest, you’ve got better things to do. This is where building a scraper n8n workflow becomes your secret weapon. It’s not just about saving time; it’s about unlocking data that can power your business decisions, side projects, or personal research.
Why Use n8n for Web Scraping?
You might be thinking, “Can’t I just use a Python script with Beautiful Soup?” And you absolutely can! But as someone who has written countless scraping scripts from scratch, I can tell you that n8n offers a different, often more efficient, path.
Think of it like this: writing a scraper in pure code is like building a car from individual parts. You have ultimate control, but you also have to worry about every single nut and bolt. Using n8n is like getting a high-quality, pre-assembled chassis. The engine (data processing), wheels (integrations), and dashboard (visual UI) are already there. You just get to focus on the fun part: steering it where you want to go.
With a scraper n8n workflow, you get:
- A Visual Interface: See your entire process laid out, making it easy to understand, debug, and modify.
- Seamless Integrations: Want to save your data to Google Sheets, a PostgreSQL database, or get a Slack notification when it’s done? It’s just another node away.
- Built-in Scheduling: Run your scraper every hour, every day, or every week with a simple Cron node. No need to set up external schedulers.
Building Your First Scraper: The Core n8n Nodes
To build a basic web scraper, you only need two core nodes. They’re the dynamic duo of data extraction.
The HTTP Request Node: Your Gateway to the Web
This node is your entry point. Its job is simple: to go to a specific URL and bring back the website’s source code, usually as HTML. For most basic scraping tasks, you’ll set the Request Method to GET
and paste the target website’s URL into the URL field. That’s it! It fetches the blueprint of the page.
The HTML Extract Node: Finding the Needle in the Haystack
Once you have the HTML, you need to find the exact pieces of information you want. The HTML Extract node does this using CSS Selectors. A CSS selector is like a specific address you give the node to find an element on the page. For example, h1
targets the main heading, while .product-price
might target an element with the class “product-price”.
You can easily find these selectors by right-clicking an element on a webpage and choosing “Inspect” in your browser’s developer tools. Just find the element in the code, right-click it, and select Copy > Copy selector
.
A Practical Example: Scraping Books toScrape
Let’s build a real workflow. We’ll scrape book titles and prices from books.toscrape.com, a website designed for this very purpose.
-
Step 1: Fetch the Page Content: Add an HTTP Request node. Set the URL to
http://books.toscrape.com
. Execute it, and you’ll see the page’s HTML in the output. -
Step 2: Extract All Book Containers: Add an HTML Extract node. Connect it to the first node. To get all the books on the page, we need a selector that targets each book’s container. Using the browser inspector, we can see that each book is an
<article class="product_pod">
. So, we set the CSS Selector toarticle.product_pod
. Crucially, we set the Return Value toHTML
and enable the Return Array option. This gives us a list where each item is the HTML for a single book. -
Step 3: Process Each Book: Now we have an array of 20 items. We need to loop through them. Add a Split In Batches node. Set the Batch Size to
1
. This node will take our list and run the subsequent nodes once for each item, which is perfect for processing one book at a time. -
Step 4: Extract the Details: Add another HTML Extract node after the Split In Batches node. This time, its input will be the HTML for a single book. We’ll add two Extraction Values:
- For the title: Key =
title
, CSS Selector =h3 a
, Return Value =Text
. - For the price: Key =
price
, CSS Selector =.price_color
, Return Value =Text
.
- For the title: Key =
-
Step 5: Save to a Spreadsheet: Finally, connect a Spreadsheet File node or a Google Sheets node. Map the
title
andprice
fields from the previous step. Execute the workflow, and voilà! You have a perfectly structured list of books and their prices.
The Big Challenge: What About JavaScript-Powered Websites?
Here’s where it gets interesting. The method above is fantastic for simple, static websites. But what happens when content is loaded dynamically with JavaScript? Your HTTP Request node might get an almost empty page because the content hasn’t been rendered yet. It’s like receiving a cake recipe (HTML) instead of the finished cake (the rendered page).
This is a super common hurdle. The solution is to use a headless browser. This is a real web browser, like Chrome, that runs in the background without a graphical interface. It loads the page, executes all the JavaScript, and then gives you the final, fully rendered HTML.
Integrating a Headless Browser Service
Services like Browserless.io or ScrapingBee make this easy. You can run them yourself (if you’re feeling adventurous) or use their cloud services. The workflow pattern looks like this:
Step | Basic Scraper | Advanced Scraper (with Browserless) |
---|---|---|
1. Fetching | Use HTTP Request node to get raw HTML from target URL. | Use HTTP Request node to call the Browserless API. |
2. Input | Target URL goes directly into the node. | Target URL is passed in the body of the API call to Browserless. |
3. Output | Raw HTML, potentially missing dynamic content. | Fully rendered HTML, including all JavaScript-loaded content. |
4. Parsing | The output is fed directly into an HTML Extract node. | The output is fed directly into an HTML Extract node. |
Essentially, you’re just swapping out the direct HTTP call for a call to a browser service. The rest of your scraper n8n logic—extracting, processing, and saving—remains exactly the same!
Be a Good Internet Citizen: Scraping Ethics
With great power comes great responsibility. When you build a scraper, you’re a bot, and it’s important to be a polite one.
- Check
robots.txt
: Most websites have a file atdomain.com/robots.txt
that outlines rules for bots. Respect them. - Read the Terms of Service: Some sites explicitly forbid scraping. Violating ToS can get your IP address banned.
- Don’t Hammer Servers: Add a Wait node in your loops to pause for a few seconds between requests. This prevents overloading the website’s server.
- Identify Yourself: In your HTTP Request node, set a custom
User-Agent
header that identifies your bot (e.g.,My n8n Price Scraper - contact@myemail.com
). It’s a courteous gesture.
By building your scraper n8n workflows thoughtfully and ethically, you can harness the web’s vast data to automate tasks you never thought possible. Now, what will you build first?