Twitter archiving with Twarc

As part of the project Best practices for social media archiving in Flanders and Brussels, various tools were tested to archive different social media platforms. This publication describes the open-source tool twarc. The code of websites and the code of the tools changes constantly, it is therefore possible that at a certain point some workflows no longer work.

Inhoud

Preparation
- Installation
- Twarc Configuration
Usage

Disclaimer Since Twitter has been bought by Elon Musk, the API access on which this tool relies has been changed to payed access and the functionality has been limited for the cheapest API tier (more information).

This means that archiving Twitter / X with Twarc might technically still be possible, but realistically impossible for archives, or other cultural heritage organisations without significant financial budgets.

Twarc is an open-source tool and Python library. Developed as part of the project " Documenting The Now", the tool allows for archiving tweets and trends via the Twitter API, as well as converting enriched Twitter data to stripped data for publication.

Twarc is limited to archiving tweets that are a maximum of seven days old. It is possible to gain access to the Twitter Premium Search API by paying and extend this maximum number of days to thirty. Snscrape can, for example, go back further in time but does not use the Twitter API; therefore, the data is less comprehensive.

Preparation

Requirements:

Python 3 installed.

Basic knowledge of the terminal is required to use the tool.

Gaining access to the X API.

Go to https://developer.x.com/en/docs/x-api

Installation

pip install twarc<

Twarc Configuration

Twarc requires two keys: the API key / Consumer key and the API key secret / Consumer secret.

Open a new terminal window.
Type twarc configure followed by ENTER.
Twarc refers to a "Consumer key" here, which is where the Twitter API key is expected. Copy the keys to be found on the "Keys and Tokens" page under Projects.
Paste the API key into the terminal using ctrl + shift + v or cmd + v on Mac.

Do the same with the API key secret or "consumer secret" according to Twarc, followed by enter.

In the next step, select option 1 and press ENTER again.

Follow the instructions, ctrl + click on the link or copy and paste the link into a browser where you are logged in to Twitter. Use the Twitter account for which you have created the keys.

Click on "Authorize App".

As a final step, enter the eight-digit PIN in the terminal, followed by ENTER.

Happy Twarcing! Twarc is now configured to start archiving.

Usage

The Twarc tool is used via the terminal; the command always starts with twarc followed by one of the following options.

Users / users

The users option retrieves metadata from Twitter accounts based on ID or screen name. It is also possible to have twarc read a text file of IDs and retrieve multiple user metadata in one command.

Examples:

twarc users meemoo_be > meemoo_metadata.jsonl

> : "Redirect", send output from command to file.
meemoo_metadata.jsonl: Write the IDs to a file named "meemoo_metadata.jsonl".

Retrieving metadata based on user IDs.

twarc users 3147722223 > meemoo_metadata.jsonl

Retrieving metadata based on multiple user IDs. twarc users 31477222231652541,759251,428434894 > meemoo_metadata.jsonl

twarc users 31477222231652541,759251,...: IDs must be separated by a ,.

Retrieving metadata based on a text file

twarc users id.txt > metadata_gebruikers.jsonl

Saving only the ID of a specific Twitter account.

For this command, an installation of jq is required.

twarc users meemoo_be | jq '.id' > ids.txt

| : "pipe", output from command 1 (twarc) to command 2 (jq).
jq '.id': instructs jq to retrieve only the content of the "id" tag from the JSON data.
ids.txt: Write the IDs to a file named "ids.txt".

Saving IDs of multiple Twitter accounts. twarc users meemoo_be,amsabisg,felixarchief,faronet,beeldengeluid | jq '.id' > ids.txt

meemoo_be,amsabisg,felixarchief,...: Account names must be separated by a ,.

Search / Search

The search function uses the Twitter API to search for existing tweets, up to a maximum of seven days back, based on the entered search term. The search term can be a word or hashtag; searches are not case-sensitive. When the search term consists of a sentence or a combination of words separated by spaces, enclose the search term in quotation marks.

Examples:
Searching for tweets around the term "heritage": twarc search erfgoed > erfgoedtweeets.jsonl

twarc: Start twarc
search: The option to search for the specified terms in tweets from a maximum of seven days back.
erfgoed: The search term that Twarc should use
> : Redirect operator, output of the twarc command will be written to a file.
erfgoedtweeets.jsonl : Specify the location and filename.jsonl.

Searching for hashtags and filtering by language: twarc search '#PID OR #PURL --lang=nl > /twarc/search/pidpurltweetsnaarmeemo.jsonl

search: The option to search for the specified terms in tweets from a maximum of seven days back.
'#PID OR #PURL': Search for the hashtags "PID" OR/OF "PURL", searching for both terms in one tweet can be done with the AND operator.
--lang=nl: Twitter will attempt to determine the language of the tweet. It is possible to limit the search to tweets in a specific language only. Specify the language according to the ISO 639-1 standard
pidpurltweetsnaarmeemo.jsonl: Specify the location and filename.jsonl.

Searching for Tweets based on location using coordinates:

Use a website such as mijncoordinaten.nl to find coordinates of a place.

twarc search 'twarc search '("art nouveau" OR "art deco")' --geocode 50.8465573,4.351697,60km > /home/lode/twarc/artnord.jsonl

search: The option to search for the specified terms in tweets from a maximum of seven days back.
'("art nouveau" OR "art deco")': The way to search for either Art nouveau OR Art Deco. It is important to know that when the two terms are not queried correctly, the results will contain one of the four words.
--geocode 50.8465573,4.351697,60km: --geocode followed by latitude and longitude of a place followed by the radius from the point indicated by the coordinates in this case ,60km.
artnord.jsonl: Specify the location and filename.jsonl.

Filter

With the filter option, Twarc will immediately start collecting tweets as they are published. This option will not collect tweets from before the moment the command was started. Twarc will continue to retrieve tweets as long as the process is running. Unlike search, the filter option is not limited to 1 week but can run the Twarc filter process for as long as necessary.

Examples:

Collect all future tweets around the topic "public domain" from the moment the command starts. twarc filter 'public domain' > publicdomain_tweets.jsonl

filter: Start twarc with the filter option.
'public domain': Search term, multiple words should be enclosed in '.
publicdomain_tweets.jsonl: Specify the location and filename.jsonl.

Retrieve tweets based on multiple terms.

twarc filter parlement,brexit > /twarc/filter/poli-tweets.json

parlement,brexit: Multiple terms must be separated by a ,.
publicdomain_tweets.jsonl: Specify the location and filename.jsonl.

Retrieve tweets based on multiple terms and filter by language

The identification of the language is done using the ISO 639-1 codes, more information here.

twarc filter parlement,brexit --lang fr > /twarc/filter/fr/poli-tweets.json

parlement,brexit: Multiple terms must be separated by a ,.
--lang: Option to filter by ISO 639-1

fr: ISO 639-1 code for France.

Collect tweets from a specific account

Use the --follow to collect tweets from a specific Twitter account as they are published. This includes retweets.

twarc filter --follow idvantwitteraccount > tweets.jsonl

Go to the Users section for more information on how user IDs can be retrieved with twarc.

Collect all tweets from multiple accounts
twarc filter --follow idvantwitteraccount1,idvantwitteraccount2,idvantwitteraccount3 > tweets.jsonl

--follow:
idvantwitteraccount1,idvantwitteraccount2,idvantwitteraccount3: IDs must be separated by a ,.

Retrieve all future tweets based on location

Use boundingbox.klokantech.com to create a bounding box around a specific location, select the "CSV" option. In this example, a frame is selected around Belgium. When the coordinates start with a -, this must be escaped with a \ for example: twarc filter --locations "\-51,86,-53,62" > tweets.jsonl

twarc filter --locations "2.3897,49.432,6.4758,51.5589" > BE-tweets.json

This option can also be combined with the --lang xx

twarc filter --locations "2.3897,49.432,6.4758,51.5589" --lang fr > BE-tweets.json

Enriching and stripping Twitter data

Enriching or stripping, hydrating or dehydrating data with Twarc.

The data obtained via Twarc and the Twitter API is initially very extensive. However, it is important to know that archived tweets may not be published without permission. According to the terms of service of the Twitter API, only the unique tweet IDs are allowed to be published. The retrieved Twitter data can be reduced (dehydrated) with Twarc to only the unique Tweet IDs. These datasets are usually published as a simple text file.

It is possible to rebuild (hydrate) the data with Twarc using these lists of IDs.

Stripping / Dehydrate:

twarc dehydrate /path/to/twitter-data.json > /path/to/twitter-ids.txt

dehydrate: Twarc option to remove data that is not allowed to be published from the Twitter dataset.
/path/to/twitter-data.json: Refers to the location of the jsonl with the rich Twitter data.
> /path/to/twitter-ids.txt: Refers to the location where the text file containing only the unique IDs should be saved.txt.

Enriching / hydrate:

twarc hydrate /path/to/twitter-ids.txt > hydrated-tweets.jsonl

hydrate: Hydrate, the Twarc option to retrieve the rest of the Twitter data based on the unique tweet ID list.
/path/to/twitter-ids.txt: Refers to the location of the tweet ID text file.
hydrated-tweets.jsonl: Specify a location or just the filename where the enriched Twitter data should be rebuilt. /path/to/file.jsonl or filename.jsonl.