The HTML Extract n8n node is a powerhouse for web scraping, enabling you to pull specific pieces of data—like product prices, article titles, or contact information—directly from a website’s HTML code. It operates by using CSS selectors to precisely target elements you want to extract. It’s vital to know that in recent n8n versions, this specific node has been integrated into the more versatile “HTML” node as the “Extract HTML Content” operation, creating a unified tool for all your HTML manipulation needs within a single, powerful node.
Wait, Where Did the HTML Extract Node Go?
So, you’re diving into a web scraping tutorial, maybe an older one, and you’re told to add the “HTML Extract” node. You search for it in n8n, and… nothing. Did you miss an update? Is it a paid feature? Let’s clear this up right away: you’re not going crazy.
The n8n team didn’t remove it; they gave it a promotion! The functionality you’re looking for now lives inside the main HTML node. Think of it as moving from a small apartment into a spacious house. The old “HTML Extract” node is now an operation called Extract HTML Content within the HTML node, sitting alongside other useful operations like “Generate HTML template” and “Convert to HTML Table.”
This is actually a fantastic improvement. Instead of having multiple nodes for different HTML tasks, you have one central, powerful tool. It’s a common point of confusion, but once you know where to look, you’ll appreciate the streamlined approach.
Getting Started with HTML Extraction in n8n
Ready to pull some data from the web? The process is surprisingly straightforward. It’s essentially a two-step dance: first you grab the webpage, then you pick out the pieces you want.
Step 1: Fetching the Webpage’s HTML
Before you can extract anything, you need the raw material—the HTML source code of a webpage. The go-to tool for this job in n8n is the HTTP Request node.
Simply add an HTTP Request node to your workflow, set the Request Method to GET
, and paste the URL of the website you want to scrape into the URL field. When you run this node, it will fetch the entire HTML content of that page and pass it along as a single data item, usually in a property named data
.
Step 2: Configuring the HTML Node for Extraction
Now for the main event. Add an HTML node after your HTTP Request node. In the node’s parameters, you’ll see a dropdown for Operation. This is where the magic happens.
- Select Extract HTML Content from the Operation list.
- For Source Data, you’ll typically leave it as
JSON
. - In the JSON Property field, you need to tell the node where the HTML code is. This will almost always be an expression pointing to the output of the previous node, like
{{ $('HTTP Request').item.data }}
.
The Core of Extraction: CSS Selectors and Values
Now, here’s where it gets interesting. The Extraction Values section is where you define exactly what data you want to grab. You can add multiple extractors to pull different pieces of information from the same page.
Think of a CSS Selector as a specific address for a piece of data on a webpage. To find it, you can right-click an element on a webpage (like a title or a price) and click “Inspect” in your browser. This opens up the developer tools, where you can find the element’s unique identifier.
Here’s a breakdown of the fields for each extractor:
Parameter | Description | Example |
---|---|---|
Key | The name (or key) for the data you’re extracting. This becomes the property name in your final JSON output. | productTitle |
CSS Selector | The address of the HTML element you want to target. This can be a tag (h1 ), a class (.price ), or an ID (#main-image ). |
h1.product-title-text |
Return Value | What part of the element do you want? | |
– Text | The visible text inside the element. | “My Awesome Product” |
– HTML | The full HTML code inside the element. | <span>$</span>99.99 |
– Attribute | The value of a specific attribute, like the link (href ) from an <a> tag or the image URL (src ) from an <img> tag. |
/products/awesome-product |
Return Array | Should n8n return all matching elements as an array? Turn this on if you’re scraping a list (like all blog post titles on a page). | true |
Real-World Example: Scraping n8n Blog Post Titles
Let’s put this into practice. Imagine we want to create a workflow that automatically fetches the latest post titles and their links from the n8n blog.
Our Workflow:
- HTTP Request Node: Set to
GET
the URLhttps://n8n.io/blog/
. - HTML Node: Set the operation to
Extract HTML Content
and point the JSON Property to the output of the HTTP Request node.
Now, we’ll configure two extractors:
-
Extractor 1: Get Titles
- Key:
postTitle
- CSS Selector:
h3.heading-6
(This targets all the blog post titles on the page) - Return Value:
Text
- Return Array:
On
- Key:
-
Extractor 2: Get Links
- Key:
postUrl
- CSS Selector:
a.blog-card_link-block
(This targets the link element that wraps each blog post card) - Return Value:
Attribute
- Attribute:
href
- Return Array:
On
- Key:
When you run this workflow, the HTML node will output a single item containing two arrays: postTitle
and postUrl
. You can then use subsequent nodes to merge this data, save it to a Google Sheet, or send a Slack notification. It’s that easy!
A Word of Caution: Pitfalls and Best Practices
Let’s be honest, web scraping isn’t always a walk in the park. Here are a few things to keep in mind:
- Dynamic Websites: The
HTTP Request
node grabs the initial HTML. If a site loads its content using JavaScript after the page loads, your node won’t see it. For these more complex sites (known as Single-Page Applications or SPAs), you might need to use a browser rendering service like Browserless or ScrapeNinja, which you can call from your HTTP Request node. - Website Layouts Change: The biggest challenge in web scraping is maintenance. If the website owner redesigns their page, your CSS selectors will likely break. Be prepared to update your workflows periodically.
- Scrape Ethically: Always check a website’s
robots.txt
file (e.g.,https://example.com/robots.txt
) to see their rules for crawlers. Don’t hammer a server with too many requests in a short time. Be a good internet citizen!
By mastering the html extract n8n node functionality, you’ve unlocked a way to tap into the vast ocean of data on the web, turning unstructured website content into structured, actionable data for your automations.