Intro

Recently I had a task to take higher education organizations (e.g., universities and research institutes) and obtain their name, country, city, address and geographical coordinates to be confronted with Web of Science information. It was similar to what I did in Key Actors in Higher Education Research and Science Studies (HERSS), a Flexdashobrd developed in R. To tackle the task, I downloaded and used a full data dump from Wikidata (see here for info on their database downloads and here for description of their data model/structure).

I searched a lot online to find best practices on how to handle this relatively large compressed file (33 GB with close to 57 million items, each line is a valid json for an item with all its properties as dump of 27th March 2019) and read and use it in reasonable time. Here I am documenting my experience as a “pay back” to the online community I am always learning from. Hopefully someone would read and benefit from this (or future me will be able to replicate the steps, if needed).


You might ask, what is the use-cases of Wikidata information?


I would suggest you to have a look at their statistics page to see the type of information you can get. As one example, Wikidata has more than 18 million scientific publications indexed (18,771,018, equal to 42.4% of all content) which includes different meta-data from title, publication date, author names, source, DOI, pubmed ID and the like (see here for two example articles published in 1986 and 2014).

They have a nicely prepared and documented SPARQL service with many exemplar queries that you can see and use. But in case you need to query a large portion of their data or do it repeatedly, you would be better off to read on. As I did, you might prefer to download and parse a full dump of their data.

I have adapted and modified a short Python script which builds a connection to the .bz2 file without decompressing it, reads it line by line and parses it to valid json objects. You then need to subset it to the information of your interest.


Main requirements

  1. You need to download a full dump from here: https://dumps.wikimedia.org/wikidatawiki/entities/. Following guide uses the .bz2 json version for which the latest dump is named as “latest-all.json.bz2.”
  2. Although following these steps doesn’t require you to know how to code in Python, but you need to have Python 3 installed on your machine
  3. You need to be familiar with Wikidata data structure, specifically items and their properties


Steps to parse the full dump

In order to use the dump you downloaded and obtain the information you want, follow these steps:

  1. Copy the local URL where you save the full Wikidata dump (33 GB in size) \your_local_directory\wikidata\ the file named latest-all.json.bz2
  2. The script below is building a connection to the .bz2 file without decompressing it. It reads it line by line and extracts information requested (based on property names discussed above)
  3. Open my sample Python script (copied below) in editor of your choice (if you code in Python, you don’t need the next steps, modify it the way you want and export your intended data). It is a script I have adapted and modified with others’ helps (thanks to Roland, Arno and Otmane) from here
  4. Replace the property names (P followed by a number) with the ones you are interested in
    • In the line starting with “if pydash.has(record, 'claims.P625'):” I am defining that if the item currently being read doesn’t have property P625 (which is geographical coordinates) then do not process it and skip to the next item
    • Since I know that in Wikidata structure of items and claims (which is where properties are included) my property of interest is located in a nested list like “claims.P625” and it can have more than one value for each item which is saved as a list, so I am passing latitude = pydash.get(record, 'claims.P625[0].mainsnak.datavalue.value.latitude') to obtain only the first element (designated by [0])
    • For the main item information like English label, English description, I am passing english_label = pydash.get(record, 'labels.en.value') which only takes the en as label, while if you are interested to take labels in other languages, you need to replace it with two letter language codes used in Wikidata e.g. de, es and it.
    • See here for an example of how the underlying data in one json per line looks like _https://www.mediawiki.org/wiki/Wikibase/DataModel/JSON#Example_
    • You will need to modify the line df_record_all = pd.DataFrame(columns=['id', 'type', 'english_label', 'longitude', 'latitude', 'english_desc']) which is building an empty table to save the data. You need to provide/modify column names based on the data table you intend to build/gather
  5. To run the script, you need to open a command prompt (i.e. Mac and Linux terminal or on Windows I would suggest using Anaconda prompt which is installed following step 2 in main requirements, or instead, cmder which is the only command prompt GUI in Windows that I have found to be working the way I expect it. You will need to call it while giving two arguments, where the Python script is located, and where the .bz2 file is accessible, i.e. python.exe H:\Documents\wikidata.py "\your_local_directory\wikidata\latest-all.json.bz2"
    • In case you are not using Anaconda prompt, then you will need to change your directory to where python.exe is installed and run the above command from there. (on Mac and Linux of course you don’t need to change to Python’s installation directory, it will suffice to call python (without .exe) and put the dump URL after it)
  6. Let the script run (it might take from few hours to few days since there are 57 million items in the dump depending on the number of properties you extract and how frequent they exist in items)
  7. It will export a CSV file in the “extracted” folder that you can use
    • While running I have asked it to print the name of current item being processed, and once an output file is exported, it says CSV exported
    • It will generate a CSV of every 5000 items (not to lose the progress in case something goes wrong and keep output files small/manageable). When the process finishes (and in case the number of items processed was not dividable to 5000) it exports a final CSV including the rest of results named as “final_csv_till_…” and prints a message All items finished, final CSV exported


Sample Python script

My sample python script that you can either use based on steps described above, or modify as you wish and run on the dump file. It is a script I have adapted and modified with others’ helps (thanks to Roland, Arno and Otmane) from here

#!/usr/bin/env python3

"""Get Wikidata dump records as a JSON stream (one JSON object per line)"""
# Modified script taken from this link: "https://www.reddit.com/r/LanguageTechnology/comments/7wc2oi/does_anyone_know_a_good_python_library_code/dtzsh2j/"

import bz2
import json
import pandas as pd
import pydash

i = 0
# an empty dataframe which will save items information
# you need to modify the columns in this data frame to save your modified data
df_record_all = pd.DataFrame(columns=['id', 'type', 'english_label', 'longitude', 'latitude', 'english_desc'])

def wikidata(filename):
    with bz2.open(filename, mode='rt') as f:
        f.read(2) # skip first two bytes: "{\n"
        for line in f:
            try:
                yield json.loads(line.rstrip(',\n'))
            except json.decoder.JSONDecodeError:
                continue

if __name__ == '__main__':
    import argparse
    parser = argparse.ArgumentParser(
        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
        description=__doc__
    )
    parser.add_argument(
        'dumpfile',
        help=(
            'a Wikidata dumpfile from: '
            'https://dumps.wikimedia.org/wikidatawiki/entities/'
            'latest-all.json.bz2'
        )
    )
    args = parser.parse_args()
    for record in wikidata(args.dumpfile):
        # only extract items with geographical coordinates (P625)
        if pydash.has(record, 'claims.P625'):        
            print('i = '+str(i)+' item '+record['id']+'  started!'+'\n')
            latitude = pydash.get(record, 'claims.P625[0].mainsnak.datavalue.value.latitude')
            longitude = pydash.get(record, 'claims.P625[0].mainsnak.datavalue.value.longitude')
            english_label = pydash.get(record, 'labels.en.value')
            item_id = pydash.get(record, 'id')
            item_type = pydash.get(record, 'type')
            english_desc = pydash.get(record, 'descriptions.en.value')
            df_record = pd.DataFrame({'id': item_id, 'type': item_type, 'english_label': english_label, 'longitude': longitude, 'latitude': latitude, 'english_desc': english_desc}, index=[i])
            df_record_all = df_record_all.append(df_record, ignore_index=True)
            i += 1
            print(i)
            if (i % 5000 == 0):
                pd.DataFrame.to_csv(df_record_all, path_or_buf='\\wikidata\\extracted\\till_'+record['id']+'_item.csv')
                print('i = '+str(i)+' item '+record['id']+'  Done!')
                print('CSV exported')
                df_record_all = pd.DataFrame(columns=['id', 'type', 'english_label', 'longitude', 'latitude', 'english_desc'])
            else:
                continue
    pd.DataFrame.to_csv(df_record_all, path_or_buf='\\wikidata\\extracted\\final_csv_till_'+record['id']+'_item.csv')
    print('i = '+str(i)+' item '+record['id']+'  Done!')
    print('All items finished, final CSV exported!')


Extensions & other software platforms, R, Ruby, Perl

In case you modified the script above to be more efficient or less prone to errors, or if you can make R, Ruby, Perl or other languages work with the above .bz2 file without needing to decompress it, in a more efficient fashion, I will be very much interested to learn about it. Please do let me know. So far my efforts to replicate the above in R was not successful.