🦊

Firefox History Crawler

( for macOS )

"What did I even do on the internet last year?"

A cozy little Python script that digs through your Firefox browsing history, crawls all those URLs you visited, and enriches them with titles, descriptions, content, and social media metadata. Perfect for digital archaeologists, data hoarders, and anyone who wants to remember that one article they read at 3am.


┌─────────────────────────────────────────────────────────┐
│  🦊 Firefox Database                                    │
│     └── 56,000 URLs you forgot about                    │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼ ✨ magic ✨
┌─────────────────────────────────────────────────────────┐
│  📊 Rich Dataset                                        │
│     ├── Titles, descriptions, word counts               │
│     ├── YouTube video IDs, Twitter authors              │
│     ├── Reddit posts, GitHub repos                      │
│     └── All the things you meant to read later          │
└─────────────────────────────────────────────────────────┘

Features

🗄️Extracts
Your full Firefox browsing history (visit counts, timestamps, frecency)
🕷️Crawls
Every URL to fetch fresh metadata and content
🎬Parses
Social media URLs (YouTube, Twitter, Reddit, TikTok, and more!)
💾Checkpoints
Progress so you can stop and resume anytime
🐢Polite
Crawling with rate limiting (we say thank you in the user agent!)
📈Progress bar
Because watching numbers go up is satisfying

Quick Start

📥 Download Script

Opens in any text editor • Requires Python to run

Prerequisites

# You'll need these friends
pip install aiohttp pandas beautifulsoup4 lxml tqdm

Run it!

# Close your Firefox application before running
python firefox_history_crawler.py

That's it. Go meditate or make some tea. ☕

What happens next

============================================================
FIREFOX HISTORY EXTRACTION PIPELINE
============================================================
Output folder: crawl_output/

Loading Firefox history...
  Loaded 56,847 URLs from Firefox history
  Combined 1,234 duplicate URLs (visit counts summed)
  Final URL count: 55,613 unique URLs

============================================================
ASYNC WEB CRAWLER
============================================================
Crawling URLs: 100%|████████████████| 55613/55613 [1:42:30<00:00]

[Checkpoint saved every 500 URLs, don't worry!]

Configuration

Tweak these settings at the top of the script:

CONFIG = {
  'delay_per_domain': 0.5,      # Seconds between requests to same domain
  'max_concurrent': 100,        # Parallel connections (be nice!)
  'max_retries': 3,             # Attempts before giving up
  'timeout': 5,                 # Seconds to wait for slow servers
  'min_visit_count': 1,         # Skip URLs you only visited once? Set to 2+
  'max_content_length': 15000,  # Truncate War and Peace
  'checkpoint_interval': 500,   # Save progress every N URLs
}

🐢 vs 🐇 Mode

Setting Polite 🐢 Speedy 🐇 Chaotic 🐆
delay_per_domain 1.0 0.5 0.1
max_concurrent 50 100 200
timeout 10 5 3
~56K URLs ~3 hours ~1.5 hours ~45 min*

*Some sites may get grumpy

Output Files

crawl_output/
├── firefox_history.pkl # Full data, fast to reload
├── firefox_history.csv # Full data, human-readable
├── firefox_history_clean.csv # Only successful crawls
├── firefox_history_errors.csv # The ones that got away
└── firefox_crawl_checkpoint.pkl # Temporary, deleted when done

Supported Platforms

The script recognizes and extracts IDs from:

📺
YouTube
Video ID, channel, playlist, shorts
🐦
Twitter/X
Tweet ID, username
🤖
Reddit
Post ID, subreddit, username
🐙
GitHub
Repo, issue/PR number, file path
📸
Instagram
Post, reel, story, profile
🎵
TikTok
Video ID, username
💼
LinkedIn
Profile, company, job
🎮
Twitch
Channel, video, clip
🎧
Spotify
Track, album, artist, playlist
📘
Facebook
Post, profile, group, event
🍿
Netflix
Title ID (watch/browse)
🎬
Vimeo
Video ID, channel

Plus basic support for Medium, Quora, Pinterest, Tumblr, and SoundCloud!

How It Works

Firefox DB Load & Clean Async Crawl Save CSV
100 concurrent connections 3 retries per domain maximum 3 per domain 0.5s delay for retries

Limitations

Some sites fight back — Cloudflare, bot detection, login walls. We try 3 times then move on.
Social media is shy — Many platforms require authentication to see content. We get what we can from public pages.
Memory grows — 56K URLs × extracted content ≈ 500MB-1GB RAM. Your laptop can handle it.
macOS only (for now) — Auto-detection works on macOS. Linux/Windows users need to set firefox_profile_path manually.

FAQ

Q: Will this get me banned/rate-limited?
A: We're polite! 0.5s delay per domain, honest user agent, max 3 connections per host. Most sites won't even notice.
Q: Can I run this while Firefox is open?
A: Yes! We use SQLite's backup API to safely copy the database.
Q: What if my internet dies mid-crawl?
A: Checkpoints save every 500 URLs. Just run it again and it'll resume.
Q: Why is X website always erroring?
A: Some sites block crawlers aggressively. Check firefox_history_errors.csv for details.

Acknowledgments

License

MIT — Do whatever you want with it! If you build something cool, I'd love to hear about it.