LLM-Powered Web Content Extraction with crawlAI
How to Use crawlAI with a Local LLM to Extract Clean Web Content
If you’re working with large websites and want to extract only the meaningful content—like articles, reviews, or tutorials—while ignoring all the clutter (ads, navigation, login prompts, etc.), crawlAI paired with a local LLM like llama3 is a powerful combo.
Here’s how to set it up.
Step 1 Set Up Your Crawler and LLM Filter
Using the crawl4ai library, we spin up a headless browser, fetch a web page, and pass the HTML to an LLM-powered content filter. The filter uses custom instructions to clean up the page, keeping only what matters.
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import LLMContentFilter
We configure a local LLM (like one served through Ollama):
filter = LLMContentFilter(
provider="ollama/llama3:latest", # local LLM
api_token="no token", # no token needed for local
chunk_token_threshold=1024, # fine-tune for large pages
instruction="""
Extract blog content while preserving the original wording.
Focus on:
- News articles
- Reviews
- Columns
Remove:
- Navigation
- Ads
- Footers
Add full links prefixed with https://www.animenewsnetwork.com/
""",
verbose=True,
base_url="https://your-local-ngrok-tunnel.ngrok-free.app"
)
Step 2: Crawl and Filter
We fetch the page and apply the filter:
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url, config=run_config)
html = result.cleaned_html
filtered_content = filter.filter_content(html, ignore_cache=True)
You’ll get a cleaned-up Markdown version of the page—perfect for offline reading, summarization, or training datasets.
Step 3: Save and Analyze
Save the filtered content to a file:
with open("filtered_content.md", "w", encoding="utf-8") as f:
f.write("\n".join(filtered_content))
You can also view how many tokens were used during processing:
filter.show_usage()