🦊 Firefox History Crawler

A cozy little Python script that digs through your Firefox browsing history, crawls all those URLs you visited, and enriches them with titles, descriptions, content, and social media metadata. Perfect for digital archaeologists, data hoarders, and anyone who wants to remember that one article they read at 3am.


┌─────────────────────────────────────────────────────────┐
│  🦊 Firefox Database                                    │
│     └── 56,000 URLs you forgot about                    │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼ ✨ magic ✨
┌─────────────────────────────────────────────────────────┐
│  📊 Rich Dataset                                        │
│     ├── Titles, descriptions, word counts               │
│     ├── YouTube video IDs, Twitter authors              │
│     ├── Reddit posts, GitHub repos                      │
│     └── All the things you meant to read later          │
└─────────────────────────────────────────────────────────┘

Features

🗄️Extracts

Your full Firefox browsing history (visit counts, timestamps, frecency)

🕷️Crawls

Every URL to fetch fresh metadata and content

🎬Parses

Social media URLs (YouTube, Twitter, Reddit, TikTok, and more!)

💾Checkpoints

Progress so you can stop and resume anytime

🐢Polite

Crawling with rate limiting (we say thank you in the user agent!)

📈Progress bar

Because watching numbers go up is satisfying

Quick Start

📥 Download Script

Opens in any text editor • Requires Python to run

Prerequisites

# You'll need these friends
pip install aiohttp pandas beautifulsoup4 lxml tqdm

Run it!

# Close your Firefox application before running
python firefox_history_crawler.py

That's it. Go meditate or make some tea. ☕

What happens next

============================================================
FIREFOX HISTORY EXTRACTION PIPELINE
============================================================
Output folder: crawl_output/

Loading Firefox history...
  Loaded 56,847 URLs from Firefox history
  Combined 1,234 duplicate URLs (visit counts summed)
  Final URL count: 55,613 unique URLs

============================================================
ASYNC WEB CRAWLER
============================================================
Crawling URLs: 100%|████████████████| 55613/55613 [1:42:30<00:00]

[Checkpoint saved every 500 URLs, don't worry!]

Configuration

Tweak these settings at the top of the script:

CONFIG = {
  'delay_per_domain': 0.5,      # Seconds between requests to same domain
  'max_concurrent': 100,        # Parallel connections (be nice!)
  'max_retries': 3,             # Attempts before giving up
  'timeout': 5,                 # Seconds to wait for slow servers
  'min_visit_count': 1,         # Skip URLs you only visited once? Set to 2+
  'max_content_length': 15000,  # Truncate War and Peace
  'checkpoint_interval': 500,   # Save progress every N URLs
}

🐢 vs 🐇 Mode

Setting	Polite 🐢	Speedy 🐇	Chaotic 🐆
delay_per_domain	1.0	0.5	0.1
max_concurrent	50	100	200
timeout	10	5	3
~56K URLs	~3 hours	~1.5 hours	~45 min*

*Some sites may get grumpy

Output Files

crawl_output/
├── firefox_history.pkl # Full data, fast to reload
├── firefox_history.csv # Full data, human-readable
├── firefox_history_clean.csv # Only successful crawls
├── firefox_history_errors.csv # The ones that got away
└── firefox_crawl_checkpoint.pkl # Temporary, deleted when done

Supported Platforms

The script recognizes and extracts IDs from:

📺

YouTube

Video ID, channel, playlist, shorts

🐦

Twitter/X

Tweet ID, username

🤖

Post ID, subreddit, username

🐙

GitHub

Repo, issue/PR number, file path

📸

Instagram

Post, reel, story, profile

🎵

TikTok

Video ID, username

💼

Profile, company, job

🎮

Twitch

Channel, video, clip

🎧

Spotify

Track, album, artist, playlist

📘

Facebook

Post, profile, group, event

🍿

Netflix

Title ID (watch/browse)

🎬

Vimeo

Video ID, channel

Plus basic support for Medium, Quora, Pinterest, Tumblr, and SoundCloud!

How It Works

Firefox DB → Load & Clean → Async Crawl → Save CSV

100 concurrent connections 3 retries per domain maximum 3 per domain 0.5s delay for retries

Limitations

Some sites fight back — Cloudflare, bot detection, login walls. We try 3 times then move on.

Social media is shy — Many platforms require authentication to see content. We get what we can from public pages.

Memory grows — 56K URLs × extracted content ≈ 500MB-1GB RAM. Your laptop can handle it.

macOS only (for now) — Auto-detection works on macOS. Linux/Windows users need to set firefox_profile_path manually.

FAQ

Q: Will this get me banned/rate-limited?

A: We're polite! 0.5s delay per domain, honest user agent, max 3 connections per host. Most sites won't even notice.

Q: Can I run this while Firefox is open?

A: Yes! We use SQLite's backup API to safely copy the database.

Q: What if my internet dies mid-crawl?

A: Checkpoints save every 500 URLs. Just run it again and it'll resume.

Q: Why is X website always erroring?

A: Some sites block crawlers aggressively. Check firefox_history_errors.csv for details.

Acknowledgments

• Firefox for keeping our browsing history so nicely organized
• aiohttp for making async HTTP not painful
• BeautifulSoup for parsing the HTML jungle
• You for reading this far 💖

License

MIT — Do whatever you want with it! If you build something cool, I'd love to hear about it.