Building a Markdown Knowledge Base from Web Data
Converting web content into a well-organized knowledge base can be challenging, especially when dealing with multiple sources and formats. In this tutorial, I’ll show you how to use DataFuel to transform website data into a clean, maintainable Markdown knowledge base. For a deeper dive into using this knowledge base with AI, check out our guide on RAG for Websites.
Why Markdown for Your Knowledge Base?
| Feature | Benefit | 
|---|---|
| Human-readable | Easy to read and write without special tools | 
| Version Control | Git-friendly format for tracking changes | 
| Convertible | Easily transforms to HTML, PDF, and other formats | 
| LLM-friendly | Clean text structure ideal for AI processing | 
Getting Started with DataFuel
- Get Your API Key: Sign up at datafuel.dev
- Install Dependencies: You’ll need Python with the requestslibrary
Basic Implementation
Here’s an example of how to scrape a website or knowledge base into clean markdown format using DataFuel:
import requests
import os
from datetime import datetime
import time
import json
API_KEY = 'sk-012d1e6e0e092a5be1f988990d41c7e1257371f68caf30a645'
KNOWLEDGE_BASE_DIR = './kb'
def fetch_content(url, depth=5, limit=100):
    headers = {
        'Authorization': f'Bearer {API_KEY}',
        'Content-Type': 'application/json'
    }
    
    # Initial crawl request
    payload = {
        "depth": depth,
        "limit": limit,
        "url": url
    }
    
    response = requests.post('https://api.datafuel.dev/crawl', 
        json=payload,
        headers=headers
    )
    
    job_id = response.json().get('job_id')
    if not job_id:
        raise Exception("No job_id received")
    
    # Poll for results
    while True:
        results = get_crawl_results(job_id, headers)
        if results and all(r['job_status'] == 'finished' for r in results):
            return results
        time.sleep(5)  # Wait 5 seconds before polling again
def get_crawl_results(job_id, headers):
    response = requests.get(
        'https://api.datafuel.dev/list_scrapes',
        headers=headers,
        params={'job_id': job_id}
    )
    return response.json()
def extract_markdown(signed_url):
    try:
        response = requests.get(signed_url)
        response.raise_for_status()
        data = response.json()
        
        if markdown := data.get("markdown"):
            return markdown
        return "No markdown field found in the data."
    except requests.exceptions.RequestException as e:
        return f"An error occurred while fetching the data: {e}"
    except json.JSONDecodeError:
        return "Error: The response is not valid JSON."
def save_to_markdown(content, url, tags=None):
    # Create clean filename from URL
    filename = url.replace('https://', '').replace('/', '_') + '.md'
    filepath = os.path.join(KNOWLEDGE_BASE_DIR, filename)
    
    # Get markdown content from signed URL
    markdown_content = extract_markdown(content['signed_url'])
    
    metadata = {
        'source': url,
        'date_added': datetime.now().strftime('%Y-%m-%d'),
        'tags': tags or [],
        'category': determine_category(url),
        'scrape_id': content.get('scrape_id'),
        'job_id': content.get('job_id')
    }
    
    # Convert metadata to YAML format
    yaml_metadata = '\n'.join([f'{k}: {v}' for k, v in metadata.items()])
    
    final_content = f"""---
{yaml_metadata}
---
{markdown_content}
"""
    
    with open(filepath, 'w') as f:
        f.write(final_content)
# Example usage
def process_website(url, depth=5, limit=100):
    try:
        crawl_results = fetch_content(url, depth, limit)
        for result in crawl_results:
            if result['scrape_status'] == 'success':
                save_to_markdown(result, result['scrape_url'])
    except Exception as e:
        print(f"Error processing website: {e}")Organizing Your Knowledge Base
Recommended Structure
kb/
├── 📁 technical/
│   ├── guides/
│   ├── reference/
│   └── tutorials/
├── 📁 product/
│   ├── features/
│   └── use-cases/
└── 📁 research/
    ├── market/
    └── competitors/Making Your Knowledge Base LLM-Ready
Best Practices Checklist
✨ Rich Metadata
- 📅 Date added
- 🏷️ Tags
- 📁 Categories
- 🔄 Last updated
📚 Clear Structure
- 📌 Consistent headings
- 🔍 Logical hierarchy
💻 Code Examples
- ✨ Syntax highlighting
- 💭 Clear comments
🔗 Cross-References
- 📎 Internal links
- 🤝 Related content
Enhanced Metadata Example
def save_to_markdown(content, url, tags=None):
    metadata = {
        'source': url,
        'date_added': datetime.now().strftime('%Y-%m-%d'),
        'tags': tags or [],
        'category': determine_category(url),
        'summary': content.get('summary', '')
    }
    
    # Convert metadata to YAML format
    yaml_metadata = '\n'.join([f'{k}: {v}' for k, v in metadata.items()])
    
    markdown_content = f"""---
{yaml_metadata}
---
# {content['title']}
{content['summary']}
{content['main_content']}
"""Maintaining Your Knowledge Base
Regular Maintenance Tasks
| Task | Frequency | Purpose | 
|---|---|---|
| Content Updates | Monthly | Keep information current | 
| Link Verification | Weekly | Ensure all links work | 
| Duplicate Check | Monthly | Remove redundant content | 
| Tag Review | Quarterly | Maintain consistent taxonomy | 
Automated Health Check
def audit_knowledge_base():
    for root, _, files in os.walk(KNOWLEDGE_BASE_DIR):
        for file in files:
            if file.endswith('.md'):
                filepath = os.path.join(root, file)
                check_file_health(filepath)
def check_file_health(filepath):
    # Read file content
    with open(filepath, 'r') as f:
        content = f.read()
    
    # Check for common issues
    issues = []
    if '](broken-link)' in content:
        issues.append('Contains broken links')
    if len(content.split('\n\n')) < 3:
        issues.append('Content might be too condensed')
    
    return issuesFinal Thoughts
Building a robust Markdown knowledge base requires:
- Consistent Structure - Clear organization
- Regular formatting
- Predictable patterns
 
- Quality Content - Up-to-date information
- Well-documented code
- Comprehensive metadata
 
- Regular Maintenance - Scheduled reviews
- Automated checks
- Content updates
 
Pro Tip: Start small and iterate. Your knowledge base should evolve with your needs and grow organically over time.
Need help getting started? Check out our documentation or join our community.