Archiving Individual social media posts with snscrape and Browsertrix

As part of the project Best practices for social media archiving in Flanders and Brussels, several tools were tested to archive social media platforms. This guide describes how the tools Browsertrix and snscrape can be used to archive public accounts on Facebook, Twitter and Instagram.

Disclaimer: This guide was created in February 2021. Do you notice that something isn't working? Please email Nastasia

Snscrape is software which allows the URL of every resource (post, video, image) on a social media platform to be retrieved and saved in a text file. Browsertrix is a web crawler suitable for archiving dynamic webpages requiring login, such as social media.

Requirements

Docker;
Docker-Compose;
Python 3.8 or higher: you can check if you have Python, and the version, via the command python -V;
git: you can check if you have git, via the command git --version;
no fear of the command line.

Advantages

works on Windows, macOS and Linux;
suitable for Twitter and Instagram;
automated;
by archiving individual posts and resources you increase the chance that all content is archived, as endless scrolling on social media platforms causes problems for crawlers (see also the guide Social media archiving with Browsertrix(NL));
searching for individual posts is easier than scrolling to the correct post (built-in search functions don't work in archived websites).

Disadvantages

only works for public accounts;
since March 2021, snscrape has been blocked by Facebook, meaning it cannot be used for this platform.

Workflow

Step 1: install snscrape

Snscrape is installed via pip, the Python package manager.

open a terminal window;
type in the window: pip3 install snscrape;
upgrade snscrape to the latest version: pip3 install --upgrade git+https://github.com/JustAnotherArchivist/snscrape@master.

Snscrape is now properly installed.

Step 2: request URLs from every post, photo and video

Use snscrape to retrieve a URL for each post that can later be used by Browsertrix to archive the content.

Open a terminal window.

Find the name of the account you want to archive. This is usually located behind the base URL of the social media platform, such as https://www.facebook.com/meemoo.be, https://twitter.com/meemoo_be and https://www.instagram.com/vlaparl. meemoo.be, meemoo_be and vlaparl are the names of the accounts respectively.

Run one of the following commands for the social media platform of the account. Replace 'name' (also in name.txt) each time with the name of the account.

Facebook page: snscrape facebook-user name > name.txt
Instagram user: snscrape instagram-user name > name.txt
Twitter account: snscrape twitter-user name > name.txt

The command ensures that the data from the platform is downloaded and then saved in the file name.txt. In the following steps, we will continue to refer to this file as name.txt.

Step 3: create a configuration file for Browsertrix

Crawls in Browsertrix are created via a configuration file. Adapt the configuration file for your chosen social media platform and paste the URLs from step 2 into it.

Download one of the following configuration files

Open the configuration file with a text editor, Notepad or Notepad++ and replace the following data:

name: the name for your crawl, e.g. 20210115_facebook_meemoo_be (note: spaces are not allowed);
coll: also change this to the name of the crawl (note: spaces are not allowed);
replace seed_urls facebook-page, twitter-user or instagram-user to the name of the account you are going to archive, e.g. meemoo.be or vlaparl (see Step 2 to find the name of the account).

Then download the script snscrape_to_browsertrix.py to transform the URLs in the text file name.txt, which was created in the previous step, into a form suitable for the Browsertrix configuration file.

Run the script in a terminal window with the command: python3 snscrape_to_browsertrix.py name.txt (change name.txt to the filename of the file you created in step 2).

Open name.txt and paste the contents under the list of URLs that are already under seed_urls (see image).

Step 4: use Browsertrix to capture all URLs

Browsertrix will be used to crawl the URLs from name.txt.

Ensure that Docker and Docker-Compose are installed on your computer and that Docker is started before installing and using Browsertrix.

Install the software

For illustration purposes, Browsertrix is being installed here on the Desktop

Open a terminal window and navigate to the Desktop

for Windows: cd c:\Users\(username)\Desktop (replace (username) with your username);
for macOS: cd ~/Desktop.

Enter the following command in the terminal or download them from : git clone https://github.com/webrecorder/browsertrix. Unpack the folder if you downloaded the code.

Open a terminal window and navigate to the Browsertrix folder via the command cd path/to/browsertrix (replace path/to/browsertrix with the correct path for the Browsertrix folder). If the folder is on your Desktop, the command is:

for Windows: cd c:\Users\(username)\Desktop\browsertrix (replace (username) with your username);
for macOS: cd ~/Desktop/browsertrix.

Install the Browsertrix command line interface by entering the command python3 setup.py install in the terminal.

Then you can install extra virtual browsers that Browsertrix uses during crawling via the command .\install-browsers.sh(Windows) or ./install-browsers.sh(macOS, Linux).

Let Docker build the Browsertrix environment with the command docker-compose build.

Start the software

Continue in the terminal window from step 1. Enter the command docker-compose up -d to start Browsertrix

From the moment Browsertrix has been started, it can be used to archive websites.

The Browsertrix web interface can be consulted at http://localhost:8000. This can be used to monitor the progress of your crawls. The interface is still under development and has some bugs.

Create a profile

To archive websites that are secured with a password (such as social media) you can create a profile. This also helps to avoid privacy-sensitive data being included in a WARC file. For Facebook and Instagram, it is necessary to log in to properly archive the layout.

To create a profile:

Enter the command browsertrix profile create in a terminal window.
The browser will then open. You can use this to navigate to the websites you want to archive and log in.

Logging into Facebook can sometimes cause problems. Follow these steps to successfully do so:

Go to https://www.facebook.com and log in.

You will receive a message that it has not succeeded.

Go to https://m.facebook.com and log in.

It may still happen that you receive a message that it has not succeeded. Refresh the page and log in again.

You are logged into the mobile version of Facebook. If you now return to https://www.facebook.com, you will notice that you are still logged in. The login has been successful!

It is important to clear the cache content after this. If you don't, Browsertrix will not include the cached content.

Click on the three dots in the top right corner of the Chrome browser and choose Settings.

Choose Advanced.

Scroll down to find Clear browsing data.

Tick Browsing history and Cached images and files. It is important that Cookies and other site data is not ticked!

Then go back to your terminal window. Give the profile a name and press .

The profile is now created. You can choose to create a new profile by entering ‘Y’ or stop via ‘N’. If you stop, you can now close the browser.

You are now ready to use this profile to archive websites. With this profile, you can now log in each time to archive one or more accounts of your choice.

Keep the terminal window open for the next step.

Start a crawl

The next step is to start the crawl.

Use the command browsertrix crawl create configuration.yaml --profile profile --watch. Replace configurationfile.yaml with the path or name of your configuration file that you created in Step 3: Create a configuration file for Browsertrix and profile with the name of the profile. The option --watch ensures that a browser window is opened where you can monitor the automated crawl.

You will see the browser going from post to post or scrolling through the page and opening media.

If Browsertrix freezes or doesn't capture certain parts, you can intervene yourself. With Instagram, Browsertrix sometimes forgets to open individual posts. By manually opening a post in the browser, you make Browsertrix aware of the individual posts and it starts opening them itself.

Combine the different WARC files (optional)

Browsertrix creates several WARC files during a crawl by default. To open the entire web archive in a WARC player outside of Browsertrix, you need to combine the different files into one WARC file. You can do this with the warcat tool.

Open a terminal window and install warcat with the command pip3 install warcat.
Find the WARC files in the Browsertrix folder. You will find them in the folder webarchive > collections > name of collection > archive.

Create a folder named WARC on your desktop and copy the different WARC files from Browsertrix into it.

Open a terminal window and navigate to your desktop via the terminal:

in Windows use the command: cd c:\Users\(username)\Desktop (replace (username) with your username);
on macOS this is via the command: cd ~/Desktop.

Then type in the terminal the command to combine the different WARC files into one WARC file:

on Windows: python3 -m warcat --output output.warc.gz --force-read-gzip --gzip --progress concat WARC\*;
on macOS: python3 -m warcat --output output.warc.gz --force-read-gzip --gzip --progress concat WARC/*.

Once warcat is finished, a file named output.warc.gz should appear on your desktop. You can give this file a better name and then delete the WARC folder.

Result

You now have a WARC file containing all individual posts. This means that you cannot endlessly scroll through the wall or timeline but must use the URL of the posts to view them separately.

The web interface of Browsertrix even has full text search which allows you to search for terms in social media posts.

Open a browser and go to http://localhost:8180.

Click on the name of your collection.

Type a word in the search bar. After pressing , all posts containing this term will appear.

Troubleshooting

Snscrape gives an error when retrieving URLs from Facebook

As of March 2021, it was noted that Facebook blocks snscrape ( issue #208). Until the developers behind snscrape fix this, this guide cannot be used for Facebook. You can still try archiving Facebook with just Browsertrix. See the guide Archiving social media with Browsertrix(NL) for this.

SyntaxError: invalid syntax or NameError: name `snscrape` is not defined on Windows

This error means that Windows cannot find the snscrape program on your computer. You can fix this by modifying the command. Instead of snscrape [rest of command] use python -m snscrape [rest of command]

example: python -m snscrape twitter-user meemoo_be > name.txt

snscrape: the term `snscrape` is not recognized as the name of a cmdlet, script file, or operable program in Windows PowerShell or command prompt

This error means that PowerShell/Command prompt cannot find the snscrape program on your computer. You can fix this by modifying the command. Instead of snscrape [rest of command] use python -m snscrape [rest of command]

example: python -m snscrape twitter-user meemoo_be > name.txt

browsertrix: the term `browsertrix` is not recognized as the name of a cmdlet, script file, or operable program in Windows PowerShell or command prompt

This error means that PowerShell/Command prompt cannot find the Browsertrix program on your computer. You can fix this by modifying the command. Instead of browsertrix [rest of command] use python -m browsertrix [rest of command]

example: python -m browsertrix crawl create configuration.yaml --profile profile --watch

Browsertrix has frozen

A common problem with Browsertrix is that it freezes. This is often due to network issues or limitations of the platform. Don't worry. A web archive has still been created (see step 7 to find its location), but it will not be complete.

There are a few more things you can try:

close the tab where the crawl is running, sometimes Browsertrix opens a new tab and continues like that
stop the current crawl
start a new crawl

Twitter Notifies That My Access Is Limited Due to Rate Limits

Rate limits are restrictions imposed by a website—Twitter, in this case—on how the platform can be used. If you open too many tweets in a short period of time, you’ll receive an error message and pages will stop loading. This is partly used to prevent the site from falling victim to a cyberattack. Unfortunately, it is not possible to configure Browsertrix to wait longer between opening different tweets, which means this issue cannot be avoided when dealing with larger accounts. However, we have noticed that the restriction is lifted after some time, and tweets can then be accessed again.

Persistente URI:

https://id.kbde.be/019680ea-e9de-7158-8439-baacb8e78bf8

Organisatie

meemoo - Vlaams instituut voor het archief

Licentie

CC-BY-SA

Type

handleiding

Medium

Deze pagina is laatst aangepast op 29 mei 2026