Twitter archiving with Twarc
As part of the project Best practices for social media archiving in Flanders and Brussels, various tools were tested to archive different social media platforms. This publication describes the open-source tool twarc. The code of websites and the code of the tools changes constantly, it is therefore possible that at a certain point some workflows no longer work.
This means that archiving Twitter / X with Twarc might technically still be possible, but realistically impossible for archives, or other cultural heritage organisations without significant financial budgets.
Twarc is an open-source tool and Python library. Developed as part of the project " Documenting The Now", the tool allows for archiving tweets and trends via the Twitter API, as well as converting enriched Twitter data to stripped data for publication.
Twarc is limited to archiving tweets that are a maximum of seven days old. It is possible to gain access to the Twitter Premium Search API by paying and extend this maximum number of days to thirty. Snscrape can, for example, go back further in time but does not use the Twitter API; therefore, the data is less comprehensive.
Preparation
Requirements:- Python 3 installed.
Basic knowledge of the terminal is required to use the tool.
- Command line tutorial for beginners, Linux, Mac, Windows (English)
- Command line tutorial for beginners, Linux (Dutch)
Gaining access to the X API.

Installation
-
pip install twarc<
Twarc Configuration
Twarc requires two keys: the API key / Consumer key and the API key secret / Consumer secret.
- Open a new terminal window.
- Type
twarc configurefollowed by ENTER. - Twarc refers to a "Consumer key" here, which is where the Twitter API key is expected. Copy the keys to be found on the "Keys and Tokens" page under Projects.
- Paste the API key into the terminal using
ctrl + shift + vorcmd + von Mac.

- Do the same with the API key secret or "consumer secret" according to Twarc, followed by enter.

- In the next step, select option 1 and press ENTER again.

- Follow the instructions,
ctrl + clickon the link or copy and paste the link into a browser where you are logged in to Twitter. Use the Twitter account for which you have created the keys.

- Click on "Authorize App".

- As a final step, enter the eight-digit PIN in the terminal, followed by ENTER.


- Happy Twarcing! Twarc is now configured to start archiving.
Usage
The Twarc tool is used via the terminal; the command always starts with twarc followed by one of the following options.
Users / users
The users option retrieves metadata from Twitter accounts based on ID or screen name. It is also possible to have twarc read a text file of IDs and retrieve multiple user metadata in one command.
twarc users meemoo_be > meemoo_metadata.jsonl
-
>: "Redirect", send output from command to file. -
meemoo_metadata.jsonl: Write the IDs to a file named "meemoo_metadata.jsonl".
twarc users 3147722223 > meemoo_metadata.jsonl
Retrieving metadata based on multiple user IDs.
twarc users 31477222231652541,759251,428434894 > meemoo_metadata.jsonl
-
twarc users 31477222231652541,759251,...: IDs must be separated by a,.
twarc users id.txt > metadata_gebruikers.jsonl
Saving only the ID of a specific Twitter account.
For this command, an installation of jq is required.
twarc users meemoo_be | jq '.id' > ids.txt
-
|: "pipe", output from command 1 (twarc) to command 2 (jq). -
jq '.id': instructs jq to retrieve only the content of the "id" tag from the JSON data. -
ids.txt: Write the IDs to a file named "ids.txt".
twarc users meemoo_be,amsabisg,felixarchief,faronet,beeldengeluid | jq '.id' > ids.txt
-
meemoo_be,amsabisg,felixarchief,...: Account names must be separated by a,.
Search / Search
The search function uses the Twitter API to search for existing tweets, up to a maximum of seven days back, based on the entered search term. The search term can be a word or hashtag; searches are not case-sensitive. When the search term consists of a sentence or a combination of words separated by spaces, enclose the search term in quotation marks.
Searching for tweets around the term "heritage":
twarc search erfgoed > erfgoedtweeets.jsonl
-
twarc: Start twarc -
search: The option to search for the specified terms in tweets from a maximum of seven days back. -
erfgoed: The search term that Twarc should use -
>: Redirect operator, output of the twarc command will be written to a file. -
erfgoedtweeets.jsonl: Specify the location and filename.jsonl.
twarc search '#PID OR #PURL --lang=nl > /twarc/search/pidpurltweetsnaarmeemo.jsonl-
search: The option to search for the specified terms in tweets from a maximum of seven days back. -
'#PID OR #PURL': Search for the hashtags "PID"OR/OF "PURL", searching for both terms in one tweet can be done with theANDoperator. -
--lang=nl: Twitter will attempt to determine the language of the tweet. It is possible to limit the search to tweets in a specific language only. Specify the language according to the ISO 639-1 standard -
pidpurltweetsnaarmeemo.jsonl: Specify the location and filename.jsonl.
Use a website such as mijncoordinaten.nl to find coordinates of a place.
twarc search 'twarc search '("art nouveau" OR "art deco")' --geocode 50.8465573,4.351697,60km > /home/lode/twarc/artnord.jsonl
-
search: The option to search for the specified terms in tweets from a maximum of seven days back. -
'("art nouveau" OR "art deco")': The way to search for either Art nouveau OR Art Deco. It is important to know that when the two terms are not queried correctly, the results will contain one of the four words. -
--geocode 50.8465573,4.351697,60km:--geocodefollowed by latitude and longitude of a place followed by the radius from the point indicated by the coordinates in this case,60km. -
artnord.jsonl: Specify the location and filename.jsonl.
Filter
With the filter option, Twarc will immediately start collecting tweets as they are published. This option will not collect tweets from before the moment the command was started. Twarc will continue to retrieve tweets as long as the process is running. Unlike search, the filter option is not limited to 1 week but can run the Twarc filter process for as long as necessary.
twarc filter 'public domain' > publicdomain_tweets.jsonl
-
filter: Start twarc with the filter option. -
'public domain': Search term, multiple words should be enclosed in'. -
publicdomain_tweets.jsonl: Specify the location and filename.jsonl.
twarc filter parlement,brexit > /twarc/filter/poli-tweets.json
-
parlement,brexit: Multiple terms must be separated by a,. -
publicdomain_tweets.jsonl: Specify the location and filename.jsonl.
The identification of the language is done using the ISO 639-1 codes, more information here.
twarc filter parlement,brexit --lang fr > /twarc/filter/fr/poli-tweets.json
-
parlement,brexit: Multiple terms must be separated by a,. -
--lang: Option to filter by ISO 639-1
fr: ISO 639-1 code for France.
Collect tweets from a specific account
Use the --follow to collect tweets from a specific Twitter account as they are published. This includes retweets.
twarc filter --follow idvantwitteraccount > tweets.jsonl
- Go to the Users section for more information on how user IDs can be retrieved with twarc.
twarc filter --follow idvantwitteraccount1,idvantwitteraccount2,idvantwitteraccount3 > tweets.jsonl
-
--follow: -
idvantwitteraccount1,idvantwitteraccount2,idvantwitteraccount3: IDs must be separated by a,.
Use boundingbox.klokantech.com to create a bounding box around a specific location, select the "CSV" option. In this example, a frame is selected around Belgium. When the coordinates start with a -, this must be escaped with a \ for example: twarc filter --locations "\-51,86,-53,62" > tweets.jsonl

twarc filter --locations "2.3897,49.432,6.4758,51.5589" > BE-tweets.json
This option can also be combined with the --lang xx
twarc filter --locations "2.3897,49.432,6.4758,51.5589" --lang fr > BE-tweets.json
Enriching and stripping Twitter data
Enriching or stripping, hydrating or dehydrating data with Twarc.
The data obtained via Twarc and the Twitter API is initially very extensive. However, it is important to know that archived tweets may not be published without permission. According to the terms of service of the Twitter API, only the unique tweet IDs are allowed to be published. The retrieved Twitter data can be reduced (dehydrated) with Twarc to only the unique Tweet IDs. These datasets are usually published as a simple text file.
It is possible to rebuild (hydrate) the data with Twarc using these lists of IDs.
Stripping / Dehydrate:twarc dehydrate /path/to/twitter-data.json > /path/to/twitter-ids.txt
-
dehydrate: Twarc option to remove data that is not allowed to be published from the Twitter dataset. -
/path/to/twitter-data.json: Refers to the location of the jsonl with the rich Twitter data. -
> /path/to/twitter-ids.txt: Refers to the location where the text file containing only the unique IDs should be saved.txt.
twarc hydrate /path/to/twitter-ids.txt > hydrated-tweets.jsonl
-
hydrate: Hydrate, the Twarc option to retrieve the rest of the Twitter data based on the unique tweet ID list. -
/path/to/twitter-ids.txt: Refers to the location of the tweet ID text file. -
hydrated-tweets.jsonl: Specify a location or just the filename where the enriched Twitter data should be rebuilt./path/to/file.jsonlorfilename.jsonl.
Persistente URI:
https://id.kbde.be/0196816f-ec4e-72c4-add4-f1b46305d304Organisatie
Licentie
- CC-BY-SA
Type
Expertisedomein
Verwante software
Deze pagina is laatst aangepast op 29 mei 2026
Zie je geen video? Pas dan je cookieinstellingen aan onderaan deze pagina: Cookie policy Klik op ‘verander uw toestemming’ vlak boven de tabel en vink ‘voorkeuren’ en ‘statistieken’ aan.