How to Use crawlAI with a Local LLM to Extract Clean Web Content

If you’re working with large websites and want to extract only the meaningful content—like articles, reviews, or tutorials—while ignoring all the clutter (ads, navigation, login prompts, etc.), crawlAI paired with a local LLM like llama3 is a powerful combo.

Here’s how to set it up.

Step 1 Set Up Your Crawler and LLM Filter

Using the crawl4ai library, we spin up a headless browser, fetch a web page, and pass the HTML to an LLM-powered content filter. The filter uses custom instructions to clean up the page, keeping only what matters.

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import LLMContentFilter

We configure a local LLM (like one served through Ollama):

filter = LLMContentFilter(
    provider="ollama/llama3:latest",  # local LLM
    api_token="no token",             # no token needed for local
    chunk_token_threshold=1024,       # fine-tune for large pages
    instruction="""
        Extract blog content while preserving the original wording.
        Focus on:
        - News articles
        - Reviews
        - Columns
        Remove:
        - Navigation
        - Ads
        - Footers
        Add full links prefixed with https://www.animenewsnetwork.com/
    """,
    verbose=True,
    base_url="https://your-local-ngrok-tunnel.ngrok-free.app"
)

Step 2: Crawl and Filter

We fetch the page and apply the filter:

async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.arun(url, config=run_config)
    html = result.cleaned_html
    filtered_content = filter.filter_content(html, ignore_cache=True)

You’ll get a cleaned-up Markdown version of the page—perfect for offline reading, summarization, or training datasets.

Step 3: Save and Analyze

Save the filtered content to a file:

with open("filtered_content.md", "w", encoding="utf-8") as f:
    f.write("\n".join(filtered_content))

You can also view how many tokens were used during processing:

filter.show_usage()

Video Demo