( for macOS )
"What did I even do on the internet last year?"
A cozy little Python script that digs through your Firefox browsing history, crawls all those URLs you visited, and enriches them with titles, descriptions, content, and social media metadata. Perfect for digital archaeologists, data hoarders, and anyone who wants to remember that one article they read at 3am.
┌─────────────────────────────────────────────────────────┐
│ 🦊 Firefox Database │
│ └── 56,000 URLs you forgot about │
└─────────────────────────────────────────────────────────┘
│
▼ ✨ magic ✨
┌─────────────────────────────────────────────────────────┐
│ 📊 Rich Dataset │
│ ├── Titles, descriptions, word counts │
│ ├── YouTube video IDs, Twitter authors │
│ ├── Reddit posts, GitHub repos │
│ └── All the things you meant to read later │
└─────────────────────────────────────────────────────────┘
Opens in any text editor • Requires Python to run
# You'll need these friends
pip install aiohttp pandas beautifulsoup4 lxml tqdm
# Close your Firefox application before running
python firefox_history_crawler.py
That's it. Go meditate or make some tea. ☕
============================================================
FIREFOX HISTORY EXTRACTION PIPELINE
============================================================
Output folder: crawl_output/
Loading Firefox history...
Loaded 56,847 URLs from Firefox history
Combined 1,234 duplicate URLs (visit counts summed)
Final URL count: 55,613 unique URLs
============================================================
ASYNC WEB CRAWLER
============================================================
Crawling URLs: 100%|████████████████| 55613/55613 [1:42:30<00:00]
[Checkpoint saved every 500 URLs, don't worry!]
Tweak these settings at the top of the script:
CONFIG = {
'delay_per_domain': 0.5, # Seconds between requests to same domain
'max_concurrent': 100, # Parallel connections (be nice!)
'max_retries': 3, # Attempts before giving up
'timeout': 5, # Seconds to wait for slow servers
'min_visit_count': 1, # Skip URLs you only visited once? Set to 2+
'max_content_length': 15000, # Truncate War and Peace
'checkpoint_interval': 500, # Save progress every N URLs
}
| Setting | Polite 🐢 | Speedy 🐇 | Chaotic 🐆 |
|---|---|---|---|
| delay_per_domain | 1.0 | 0.5 | 0.1 |
| max_concurrent | 50 | 100 | 200 |
| timeout | 10 | 5 | 3 |
| ~56K URLs | ~3 hours | ~1.5 hours | ~45 min* |
*Some sites may get grumpy
The script recognizes and extracts IDs from:
Plus basic support for Medium, Quora, Pinterest, Tumblr, and SoundCloud!
firefox_profile_path manually.
firefox_history_errors.csv for details.MIT — Do whatever you want with it! If you build something cool, I'd love to hear about it.