Data journalists don’t always find the data they need for their stories. That might happen when the data doesn’t exist because it’s not collected or when the data is collected but it’s not available to the public.
In the absence of data, one solution is to create our own dataset. To do this, it is key to define a strong methodology and establish clear criteria to avoid mixing apples and oranges when collecting the information.
In this post, I have listed three projects that had to create a methodology to collect their own data. All of them provide the dataset they used to find the stories, which it’s worth having a look.
The Washington Post has created its own database of every fatal shooting in the United States by a police officer since January 2015.
The Washington Post brings context to each shooting with details such as the race of the deceased, the circumstances of the shooting, whether the person was armed and whether the person was experiencing a mental-health crisis.
As a base to build the methodology, the Post follows the circumstances of Michael Brown death in Ferguson and exclude another kind of shootings or incidents:
“The Post is not tracking deaths of people in police custody, fatal shootings by off-duty officers or non-shooting deaths.”
Naming the death
This project of the Bureau of Investigative Journalism (BIJ) tries to identify by name all the people reportedly killed by CIA drone strikes in Pakistan since 2004.
Each drone strike is confirmed through multiple sources: media sources as well as international organisations such as the New America Foundation, Al Akhbar, WikiLeaks, the UN or Amnesty International.
As drone strikes take place in remote regions, the reports use to disagree with the number of casualties. When this happens, the BIJ record reported deaths or injuries as a range and included both the lowest number of casualties and the highest. The exception to this is where early reports of casualties have been superseded by later ones, in which case the BIJ takes the latest report of casualties.
News reports also disagree in the names of the casualties. The BIJ collects each name indicating the original source where the name was found and use standardised conventions and spellings to write the names.
A remarkable aspect of the methodology of this project is the creation of different categories and definitions. There are definitions to classify the casualties (militant, child, civilian or unknown) and also a definition of what a drone strike is.
The Migrant files
The Migrant Files compiles the number of men, women and children that have lost their lives on their journey to Europe. This information had never been collected before and it had never been visualised in the way The Migrant Files did.
The project compiles data publicly available from governments publications and media reports. It also includes data from United for Intercultural Action, Fortress Europe, and Puls, a project run by the University of Helsinki and commissioned by the Joint Research Center of the European Commission.
Each of these organisations collects the data in a different way and The Migrant Files team had to clean, check and analyse the data using different tools and techniques.
As the amount of information was huge, the project needed a group of Journalism students to do the fact-checking.
The database includes the name, age, gender, and nationality and every fatal incident is recorded with its date, latitude, longitude, the number of dead and/or missing as well as the cause.
The data collection was really complex and mixed several sources. For this reason, The Migrant Files’ team recognise that they cannot fully eliminate the “inherent” bias of each dataset.
The team published their first results in 2014 and has updated the dataset for two years until June 2016, when they decided to stop.
“This decision comes as we have outspent the 17,000€ in grants that the project received. More importantly, it comes as the goal we set ourselves has been reached”.
You can check The Migrant Files database here.