Ga verder naar de inhoud

Workflows for capturing individual social media accounts

This is the first deliverable (D1) of the project Best practices for social media archiving in Flanders and Brussels by KADOC and meemoo with the support of the Flemish Government. It is the result of WP1: preparation for the capture of individual accounts in which different tools and APIs were tested for the capture of social media accounts.

In this document you will find a number of workflows for archiving social media. Note! The code of social media platforms changes regularly. It is therefore possible that at some point certain workflows will no longer work.

Authors

Nastasia Vanderperren and Lode Scheers with contributions from Rony Vissers (meemoo)

Step 1: Create an account

It is recommended to create an institutional account that you will only use to archive social media. This prevents personal information, for example notifications, your name and profile picture, and names of friends who follow the account, from ending up in the web archive.

Personal information in the web archive

Also disable two-factor authentication or two factor authentication so that you can log in to the account with archiving tools.

Step 2: Capture the social media account

There are four strategies for archiving individual social media accounts:

  1. Self-archiving: These workflows use the export functions within the social media platforms. This is the easiest way to save all content from the account. However, shared messages and comments from third parties will not be exported and will be missing from the archive. Export functions usually result in a ZIP file with HTML files. This can only be performed by the owner of the account. The content of the ZIP file can be converted to the WARC format.
  2. Crawling with look-and-feel preservation: These workflows use external tools to archive all content from an account, including messages and comments from third parties. In many cases, however, there is missing content because the social media platform hides comments or messages that it deems irrelevant or because the page of the social media platform will not load. The web archive is stored in a WARC file.
  3. Data scraping: Extracting content from social media platforms in the form of structured data. These tools are more stable and better at extracting all textual content from a social media platform. The data is usually stored in a JSON file.
  4. Download: This strategy consists of downloading videos and images from the social media platform. Comments and captions are saved in the form of structured data or in a text file.

Strategy 2 - Crawling with look-and-feel preservation is the most error-prone strategy. It may therefore be interesting to supplement this strategy with Strategy 3 - Data scraping. It is also much easier to search in the structured text files than in a WARC file. The structured text can be used as an index for the web archive. With a WARC file, you usually have to scroll endlessly to find the correct message.

Strategy 1: self-archiving

Self-archiving can only be performed by the archiver and is explained in D2 Guides for self-archiving individual accounts. If an archive institution receives such a self-archived web archive, it is adviced to use the tool warcit to convert the web archive into a WARC file. To use warcit, you need Python 3.

  • The file can now be opened in a WARC player, such as ReplayWeb.page or pywb (see Step 4).

Strategy 2: crawling with maintenance of look-and-feel

For crawling with a conservation of look & feel, there are two options available:

  1. Archiveweb.page;
  2. snscrape and Browsertrix.

These options differ in terms of e.g. difficulty, requirements, social media platforms for which they are suitable, limitations and workflow.

1. ArchiveWeb.page

ArchiveWeb.page is the successor to Webrecorder. It is a Chrome extension that allows you to archive websites and social media accounts via the browser.

Difficulty

None

Requirements

  • an account for the social media platform
  • Chrome as a browser

Suitable for:

  • Facebook
  • Twitter
  • Instagram

Output format

Limitations
  • Social media platforms (especially Facebook) determine what is relevant for your account, so not all comments are visible. If you want to archive everything, you will have to click on each post individually.
  • Facebook screens out certain elements, such as opening images; this functionality is therefore missing from the web archive.

Workflow

Archiving social media accounts with ArchiveWeb.page

2. Snscrape and browsertrix

Snscrape is software that allows you to retrieve the URL of every resource (post, video, image) on a social media platform in order to save them in a text file.

Browsertrix is a web crawler that can archive webpages via a list of URLs. The tool is mainly suitable for capturing dynamic websites or pages where you need to log in, such as social media.

Difficulty

Familiarity with the command line

Requirements

Suitable for:

  • Twitter: public account and hashtag
  • Instagram: public account and hashtag

Limitations:

  • Only suitable for public accounts
  • Facebook blocks this tool; the developers are still looking for a solution.

Output format

  • WARC

Workflow

Archiving individual social media posts with snscrape and browsertrix

Strategy 3: data scraping

snscrape and facebook-scraper no longer work, and Twarc requires payment/approval to use X/Twitter's API, making it effectively unusable.

There are several options available for data scraping:

  1. snscrape;
  2. facebook-scraper;
  3. Twarc.

They differ in terms of e.g. difficulty, requirements, social media platforms for which they are suitable, limitations and workflow.

1. Snscrape

Snscrape is open source software to retrieve the textual content of social media within a structured text format.

Difficulty

Use of the command line

Requirements

Suitable for:

  • Twitter: public account and hashtag
  • Instagram: public account and hashtag

Limitations:

  • Only suitable for public accounts;
  • Only textual content is retrieved, no images or other media.

Output format

Workflow

Saving social media account content as structured text with snscrape

2. Facebook-scraper

Facebook-scraper is open source software used to retrieve the textual content of facebook posts within a structured text format.

Difficulty

Use of the command line

Requirements

Suitable for:

  • Facebook: both public and private groups, pages and personal accounts

Limitations:

  • Difficult to scrape entire accounts
  • Only textual content is retrieved, no images or other media

Output format

Workflow

Saving Facebook accounts as structured text via Facebook-scraper

3. Twarc

Twarc is an open source tool and Python library that was developed as part of the Documenting The Now project. The tool makes it possible to archive tweets and trends via the Twitter API as well as to convert enriched twitter data into stripped data for publication.

Difficulty?

Use of the command line

Requirements

Suitable for:

  • Archiving tweets
  • Archiving Twitter profiles
  • Enriching and stripping Twitter data

Limitations:

  • Subject to the Twitter API rules.

Output format

Workflow

Archiving Twitter with Twarc

Strategy 4: Download

There are also several options available for the download:

  1. Instaloader
  2. Youtube-dl
  3. Tartube
  4. Youtube-comment-downloader

These options differ in terms of e.g. difficulty, requirements, social media platforms for which they are suitable, output format, limitations and workflow.

1. Instaloader

Instaloader is used to archive content from Instagram, such as private profiles, hashtags, user stories, feeds and saved media, reactions, geotags and captions of each post.

Difficulty

Use of the command line

Requirements

Suitable for:

  • Downloading public and private profiles, hashtags, user stories, feeds and saved media.
  • Downloading reactions, geotags and captions of each post.

Limitations:

  • No possibilities for archiving live content

Workflow

Archiving Instagram with Instaloader

2. Youtube-dl

Youtube-dl is an open source command line program that is mainly used to download videos and data from YouTube.

Difficulty

Use of the command line

Requirements

  • Python 3.2 or higher

Suitable for:

  • Downloading videos, YouTube channels, playlists and livestreams.
  • Downloading videos in different formats
  • Downloading data such as thumbnails, subtitles
  • Downloading video metadata
  • Suitable for automating archiving actions

Limitations:

  • No possibility to download comments

Workflow

Archiving YouTube videos with youtube-dl

3. Tartube

Tartube is an open source graphical user interface tool that is mainly used to download videos and data from YouTube.

Difficulty

  • Installation on macOS is complex

Requirements

  • Python 3.2 or higher

Suitable for:

  • Downloading videos, YouTube channels, playlists and livestreams.
  • Downloading videos in different formats
  • Downloading data such as thumbnails, subtitles
  • Downloading video metadata
  • Suitable for automating archiving actions

Limitations:

  • No possibility to download comments

Workflow

Archiving YouTube videos with Tartube

4. Youtube-comment-downloader

Youtube-comment-downloader is an open source command line program that is used to download YouTube comments.

Difficulty

Use of the command line

Requirements

  • Python 2.7 or higher

Suitable for:

  • Downloading comments on YouTube videos

Limitations:

  • None

Workflow

Archiving YouTube comments with youtube-comment-downloader

Step 3: Quality Control

Visual check of the web archive in ReplayWeb.page

ReplayWeb.page is a very simple open source tool that allows you to view web archives in your browser without having to install software. It is the successor to Webrecorder Player. You can use it to open WARC files that are locally on your computer, Google Drive, Amazon S3 or a web server (via HTTP or HTTPS). Also read about the documentation regarding sharing WARC files that are loaded in ReplayWeb.page, unless it is a WARC file that is located locally on your computer.

In the example below we explain how to open a local WARC file. For opening an online WARC file (via S3, Google Drive or webserver) we refer to the documentation.

  • Then click Load.

  • The WARC file will now be loaded.

  • You can choose which page you want to open via a list of URLs.

  • And then view the archived page in the browser.

Visual check of the web archive in pywb with Chrome extensions

Thoroughly visually checking a web archive takes a lot of time. There are Chrome extensions that can automate some actions, such as opening several URLs at once, switching tabs and checking if links work on a webpage.

These extensions do not work with web archives opened with ReplayWeb.page, which is why we use pywb as a tool to open web archives. Pywb stands for Python Wayback and was chosen by IIPC (International Internet Preservation Coalition) as the best software for playing back web archives at the end of 2020.1 Pywb displays web archives directly in the browser.

Pywb is controlled via the command line.

Step 1: install the software

  • make sure Python is installed on your computer
  • open a terminal window
  • use the command `pip install pywb` to install pywb
  • pywb is now installed

Step 2: create a collection

In this example we create a web collection with the name my-archive. You can choose your own name for the web archive.

  • open a terminal window
  • use the command `wb-manager init my-archive` to create the my-archive collection
  • add your WARC file to the collection via the command `wb-manager add my-archive path/to/warc.gz`. Replace `path/to/warc.gz` in the path of your WARC file. For example, if the WARC file is archief.warc.gz on the Desktop, then the command is:
    • for Windows: `wb-manager add my-archive c:\Users$username)\Desktop\archief.warc.gz` (replace (username) with your username)
    • for macOS: `wb-manager add my-archive ~/Desktop/archief.warc.gz`

Step 3: open your web archive

Continue in the terminal window from the previous step.

  • enter the command wayback in the terminal to start pywb
  • open a window in Google Chrome and navigate to http://localhost:8080/my-archive/url where `my-archive` is the name of your collection (see step 2) and `url` is the URL of the webpage you archived
  • if everything went well, you will now see an archived version of the web page

Step 4: install Chrome extensions

There are several extensions in Chrome that can help speed up quality control. This list will be further supplemented during the project Best practices for archiving social media in Flanders and Brussels.

  • Open multiple URL’s: can open several URLs simultaneously. This extension is useful if you have used the archiving method with snscrape and browsertrix. You can use this extension to check whether the list of URLs that you received via snscrape are present in the WARC file. However, you must adjust the URLs. The http://localhost:8080/collection/ must be placed before each URL, where collection corresponds to the name of your collection (e.g. my-archive).
  • Revolver - Tabs: this extension automatically switches between tabs in Chrome.
  • Check My Links: checks if links work on a page.

Quality control for data scraping

Open the JSON file in Text Editor, Notepad or Notepad++ and look at the date of the first line (most recent post or pinned post) and the date of the last line (oldest post). If the oldest message in the JSON file corresponds to the oldest post on the social media platform, then there is a good chance that all posts have been included.

Do you find a JSON Lines file difficult? Convert it to CSV or Excel format using this online tool: https://json-csv.com/

References

Licentie

  • CC-BY-SA

Collectie

Expertisedomein

Verwante standaarden

Deze pagina is laatst aangepast op 07 oktober 2025

Deze pagina aanvullen of corrigeren?

Foutje gespot? Of heb je aanvullende inzichten? Deel je ervaringen via onderstaande knop.