01 May, 2024

What is ArchiveBox

ArchiveBox is a powerful, self-hosted internet archiving solution to collect, save, and view websites offline.

Without active preservation effort, everything on the internet eventually dissapears or degrades. Archive.org does a great job as a centralized service, but saved URLs have to be public, and they can’t save every type of content.

ArchiveBox is an open source tool that lets organizations & individuals archive both public & private web content while retaining control over their data. It can be used to save copies of bookmarks, preserve evidence for legal cases, backup photos from FB/Insta/Flickr or media from YT/Soundcloud/etc., save research papers, and more…

ArchiveBox Features

Install ArchiveBox

Get ArchiveBox with pip install archivebox on Linux, macOS, and Windows (WSL2), or via Docker ⭐️.

Once installed, it can be used as a CLI tool, self-hosted Web App, Python library, or one-off command.

📥 You can feed ArchiveBox URLs one at a time, or schedule regular imports from your bookmarks or history, social media feeds or RSS, link-saving services like Pocket/Pinboard, our Browser Extension, and more.
See Input Formats for a full list of supported input formats…

snapshot detail page

It saves snapshots of the URLs you feed it in several redundant formats.
It also detects any content featured inside pages & extracts it out into a folder:

  • 🌐 HTML/Any websites ➡️ original HTML+CSS+JS, singlefile HTML, screenshot PNG, PDF, WARC, title, article text, favicon, headers, …
  • 🎥 Social Media/News ➡️ post content TXT, comments, title, author, images, …
  • 🎬 YouTube/SoundCloud/etc. ➡️ MP3/MP4s, subtitles, metadata, thumbnail, …
  • 💾 Github/Gitlab/etc. links ➡️ clone of GIT source code, README, images, …
  • and more, see Output Formats below…

🛠️ ArchiveBox uses standard tools like Chrome, wget, & yt-dlp, and stores data in ordinary files & folders.
(no complex proprietary formats, all data is readable without needing to run ArchiveBox)

The goal is to sleep soundly knowing the part of the internet you care about will be automatically preserved in durable, easily accessible formats for decades after it goes down.


Getting Started