Workflows for capturing individual social media accounts
This is the first deliverable (D1) of the project Best practices for social media archiving in Flanders and Brussels by KADOC and meemoo with the support of the Flemish Government. It is the result of WP1: preparation for the capture of individual accounts in which different tools and APIs were tested for the capture of social media accounts.
In this document you will find a number of workflows for archiving social media. Note! The code of social media platforms changes regularly. It is therefore possible that at some point certain workflows will no longer work.
Authors
Nastasia Vanderperren and Lode Scheers with contributions from Rony Vissers (meemoo)
Step 1: Create an account
It is recommended to create an institutional account that you will only use to archive social media. This prevents personal information, for example notifications, your name and profile picture, and names of friends who follow the account, from ending up in the web archive.

Also disable two-factor authentication or two factor authentication so that you can log in to the account with archiving tools.
Step 2: Capture the social media account
There are four strategies for archiving individual social media accounts:
- Self-archiving: These workflows use the export functions within the social media platforms. This is the easiest way to save all content from the account. However, shared messages and comments from third parties will not be exported and will be missing from the archive. Export functions usually result in a ZIP file with HTML files. This can only be performed by the owner of the account. The content of the ZIP file can be converted to the WARC format.
- Crawling with look-and-feel preservation: These workflows use external tools to archive all content from an account, including messages and comments from third parties. In many cases, however, there is missing content because the social media platform hides comments or messages that it deems irrelevant or because the page of the social media platform will not load. The web archive is stored in a WARC file.
- Data scraping: Extracting content from social media platforms in the form of structured data. These tools are more stable and better at extracting all textual content from a social media platform. The data is usually stored in a JSON file.
- Download: This strategy consists of downloading videos and images from the social media platform. Comments and captions are saved in the form of structured data or in a text file.
Strategy 1: self-archiving
Self-archiving can only be performed by the archiver and is explained in D2 Guides for self-archiving individual accounts. If an archive institution receives such a self-archived web archive, it is adviced to use the tool warcit to convert the web archive into a WARC file. To use warcit, you need Python 3.
- open a terminal window
- install warcit:
pip3 install warcit - decompress the web archive if it is still a zip file.
- find the base URL of the social media account. This is
https://www.facebook.com/ ,https://twitter.com/ orhttps://www.instagram.com/ with the name of the account behind the ‘/’. Some examples: - the base URL for the Facebook page of meemoo is:
https://www.facebook.com/meemoo.be - the base URL for the Twitter account of meemoo is:
https://twitter.com/meemoo_be - the base URL for the Instagram account of the Flemish Parliament is:
https://www.instagram.com/vlaparl - use the command
warcit basis_URL/ map_met_webarchiefto convert the web archive into a WARC file. Changebasis_URLin the base URL andmap_met_webarchiefto the path of the web archive. Assuming it is the Twitter account of Nastasia whose decompressed ZIP is on the Desktop, then the command is: - For Windows:
warcit(change ‘(username)’ to your username)https://twitter.com/nvanderperren/ c:\Users$username)\Desktop\nastasia-twitter - For macOS:
warcithttps://www.twitter.com/nvanderperren/ ~/Desktop/nastasia-twitter

- The file can now be opened in a WARC player, such as ReplayWeb.page or pywb (see Step 4).

Strategy 2: crawling with maintenance of look-and-feel
For crawling with a conservation of look & feel, there are two options available:
- Archiveweb.page;
- snscrape and Browsertrix.
These options differ in terms of e.g. difficulty, requirements, social media platforms for which they are suitable, limitations and workflow.
1. ArchiveWeb.page
ArchiveWeb.page is the successor to Webrecorder. It is a Chrome extension that allows you to archive websites and social media accounts via the browser. DifficultyNone
Requirements- an account for the social media platform
- Chrome as a browser
- Social media platforms (especially Facebook) determine what is relevant for your account, so not all comments are visible. If you want to archive everything, you will have to click on each post individually.
- Facebook screens out certain elements, such as opening images; this functionality is therefore missing from the web archive.
2. Snscrape and browsertrix
Snscrape is software that allows you to retrieve the URL of every resource (post, video, image) on a social media platform in order to save them in a text file.Browsertrix is a web crawler that can archive webpages via a list of URLs. The tool is mainly suitable for capturing dynamic websites or pages where you need to log in, such as social media.
DifficultyFamiliarity with the command line
Requirements- an account for the social media platform
- Docker
- Docker-Compose
- Python 3.8 or higher
- Twitter: public account and hashtag
- Instagram: public account and hashtag
- Only suitable for public accounts
- Facebook blocks this tool; the developers are still looking for a solution.
- WARC
Archiving individual social media posts with snscrape and browsertrix
Strategy 3: data scraping
snscrape and facebook-scraper no longer work, and Twarc requires payment/approval to use X/Twitter's API, making it effectively unusable.There are several options available for data scraping:
- snscrape;
- facebook-scraper;
- Twarc.
They differ in terms of e.g. difficulty, requirements, social media platforms for which they are suitable, limitations and workflow.
1. Snscrape
Snscrape is open source software to retrieve the textual content of social media within a structured text format.
DifficultyUse of the command line
Requirements Suitable for:- Twitter: public account and hashtag
- Instagram: public account and hashtag
- Only suitable for public accounts;
- Only textual content is retrieved, no images or other media.
Saving social media account content as structured text with snscrape
2. Facebook-scraper
Facebook-scraper is open source software used to retrieve the textual content of facebook posts within a structured text format.
DifficultyUse of the command line
Requirements Suitable for:- Facebook: both public and private groups, pages and personal accounts
- Difficult to scrape entire accounts
- Only textual content is retrieved, no images or other media
Saving Facebook accounts as structured text via Facebook-scraper
3. Twarc
Twarc is an open source tool and Python library that was developed as part of the Documenting The Now project. The tool makes it possible to archive tweets and trends via the Twitter API as well as to convert enriched twitter data into stripped data for publication.
Difficulty?Use of the command line
Requirements- A Twitter developer account.
- Python 3
- Archiving tweets
- Archiving Twitter profiles
- Enriching and stripping Twitter data
- Subject to the Twitter API rules.
Archiving Twitter with Twarc
Strategy 4: Download
There are also several options available for the download:
- Instaloader
- Youtube-dl
- Tartube
- Youtube-comment-downloader
These options differ in terms of e.g. difficulty, requirements, social media platforms for which they are suitable, output format, limitations and workflow.
1. Instaloader
Instaloader is used to archive content from Instagram, such as private profiles, hashtags, user stories, feeds and saved media, reactions, geotags and captions of each post. DifficultyUse of the command line
Requirements Suitable for:- Downloading public and private profiles, hashtags, user stories, feeds and saved media.
- Downloading reactions, geotags and captions of each post.
- No possibilities for archiving live content
2. Youtube-dl
Youtube-dl is an open source command line program that is mainly used to download videos and data from YouTube. DifficultyUse of the command line
Requirements- Python 3.2 or higher
- Downloading videos, YouTube channels, playlists and livestreams.
- Downloading videos in different formats
- Downloading data such as thumbnails, subtitles
- Downloading video metadata
- Suitable for automating archiving actions
- No possibility to download comments
3. Tartube
Tartube is an open source graphical user interface tool that is mainly used to download videos and data from YouTube. Difficulty- Installation on macOS is complex
- Python 3.2 or higher
- Downloading videos, YouTube channels, playlists and livestreams.
- Downloading videos in different formats
- Downloading data such as thumbnails, subtitles
- Downloading video metadata
- Suitable for automating archiving actions
- No possibility to download comments
4. Youtube-comment-downloader
Youtube-comment-downloader is an open source command line program that is used to download YouTube comments. DifficultyUse of the command line
Requirements- Python 2.7 or higher
- Downloading comments on YouTube videos
- None
Step 3: Quality Control
Visual check of the web archive in ReplayWeb.page
ReplayWeb.page is a very simple open source tool that allows you to view web archives in your browser without having to install software. It is the successor to Webrecorder Player. You can use it to open WARC files that are locally on your computer, Google Drive, Amazon S3 or a web server (via HTTP or HTTPS). Also read about the documentation regarding sharing WARC files that are loaded in ReplayWeb.page, unless it is a WARC file that is located locally on your computer.In the example below we explain how to open a local WARC file. For opening an online WARC file (via S3, Google Drive or webserver) we refer to the documentation.
- Go to https://replayweb.page and select the WARC file you want to open.

- Then click Load.

- The WARC file will now be loaded.

- You can choose which page you want to open via a list of URLs.

- And then view the archived page in the browser.

Visual check of the web archive in pywb with Chrome extensions
Thoroughly visually checking a web archive takes a lot of time. There are Chrome extensions that can automate some actions, such as opening several URLs at once, switching tabs and checking if links work on a webpage.
These extensions do not work with web archives opened with ReplayWeb.page, which is why we use pywb as a tool to open web archives. Pywb stands for Python Wayback and was chosen by IIPC (International Internet Preservation Coalition) as the best software for playing back web archives at the end of 2020.1 Pywb displays web archives directly in the browser.
Pywb is controlled via the command line.
Step 1: install the software
- make sure Python is installed on your computer
- open a terminal window
- use the command `pip install pywb` to install pywb
- pywb is now installed
Step 2: create a collection
In this example we create a web collection with the name my-archive. You can choose your own name for the web archive.
- open a terminal window
- use the command `wb-manager init my-archive` to create the my-archive collection
- add your WARC file to the collection via the command `wb-manager add my-archive path/to/warc.gz`. Replace `path/to/warc.gz` in the path of your WARC file. For example, if the WARC file is archief.warc.gz on the Desktop, then the command is:
- for Windows: `wb-manager add my-archive c:\Users$username)\Desktop\archief.warc.gz` (replace (username) with your username)
- for macOS: `wb-manager add my-archive ~/Desktop/archief.warc.gz`
Step 3: open your web archive
Continue in the terminal window from the previous step.
- enter the command wayback in the terminal to start pywb
- open a window in Google Chrome and navigate to
http://localhost:8080/my-archive/url where `my-archive` is the name of your collection (see step 2) and `url` is the URL of the webpage you archived - if everything went well, you will now see an archived version of the web page

Step 4: install Chrome extensions
There are several extensions in Chrome that can help speed up quality control. This list will be further supplemented during the project Best practices for archiving social media in Flanders and Brussels.
- Open multiple URL’s: can open several URLs simultaneously. This extension is useful if you have used the archiving method with snscrape and browsertrix. You can use this extension to check whether the list of URLs that you received via snscrape are present in the WARC file. However, you must adjust the URLs. The
http://localhost:8080/collection/ must be placed before each URL, where collection corresponds to the name of your collection (e.g. my-archive). - Revolver - Tabs: this extension automatically switches between tabs in Chrome.
- Check My Links: checks if links work on a page.

Quality control for data scraping
Open the JSON file in Text Editor, Notepad or Notepad++ and look at the date of the first line (most recent post or pinned post) and the date of the last line (oldest post). If the oldest message in the JSON file corresponds to the oldest post on the social media platform, then there is a good chance that all posts have been included.
Do you find a JSON Lines file difficult? Convert it to CSV or Excel format using this online tool: https://json-csv.com/
References
Persistente URI:
https://id.kbde.be/019680d6-4c41-73ab-a509-eb330d558e84Organisatie
Licentie
- CC-BY-SA
Type
Medium
Collectie
Expertisedomein
Verwante software
Deze pagina is laatst aangepast op 07 oktober 2025
Deze pagina aanvullen of corrigeren?
Foutje gespot? Of heb je aanvullende inzichten? Deel je ervaringen via onderstaande knop.
Zie je geen video? Pas dan je cookieinstellingen aan onderaan deze pagina: Cookie policy Klik op ‘verander uw toestemming’ vlak boven de tabel en vink ‘voorkeuren’ en ‘statistieken’ aan.