Identifying Duplicate Mass Shooting Records in the Gun Violence Archive

2025-09-22

Some time ago, I was working on a project that merges multiple mass shooting databases into a single master dataset. It’s called MSMDB, or Mass Shootings Master Database. The most comprehensive source used in this project is the Gun Violence Archive (GVA), which has catalogued over 5,700 mass shooting incidents since 2013.

Note: This is partly because there is no clear consensus on what defines a mass shooting. GVA defines it as being incidents with four or more people shot. Other sources apply filters to this rule or go by other definitions (example FBI considers it to be 4+ killed).

Incident Deduplication

Because MSMDB merges data from multiple sources, there’s prone to be duplicate incidents. For that reason, I implemented deduplication logic that merges records with identical or exteremely similar fields.

The deduplication process was intended to merge duplicate records of the same incident from across multiple databases. When manually reviewing flagged duplicates, I found that some incidents being marked as “duplicates” were two distinct incidents from the GVA database (with separate incident IDs).

This shouldn’t have been happening in the first place considering these were two separate incidents from the same source (GVA), which I had assumed wouldn’t contain duplicates.

Incident deduplication largely relied on comparing their coordinates and dates: records would be merged if they were within one mile and day of each other. This resulted in GVA incidents within close proximity of each other and on the same date being treated as duplicates and merged.

I lowered the geographical threshold to half a mile, but I was still getting false positives. For example:

Incident IDDateAddressKilledInjuredLatitudeLongitude
2192046December 17, 20211300 block of W North Ave1439.310198925452-76.639541672845
2191752December 17, 2021600 block of Laurens St1339.303712703668-76.636384414946

These two incidents were flagged as duplicates. The calculated distance between their coordinates was 0.479 miles which was just under the threshold. This and the fact that they occurred on the same day caused them to be incorrectly merged.

Discovering Duplicate Incidents

I continued lowering the threshold and manually reviewing flagged duplicates until I found something interesting:

Incident IDDateAddressKilledInjuredLatitudeLongitude
2255832March 16, 2022800 block of 18th Ave1340.73763619978-74.222439374877
2256080March 16, 2022862 18th Ave1340.73798250907-74.222869088046

These incidents were only 0.0328 miles apart. My first thought was that this might have been a geolocating inaccuracy from ArcGIS, which I used to generate coordinates for each incident (see this project for more info). Instead of continuing to make my logic more and more lenient, I decided to manually check the GVA entries for both incidents:

Incident 2255832:

Type: Victim
Name: Faquan Davis
Age: 44
Age Group: Adult 18+
Gender: Male
Status: Killed

Incident 2256080:

Type: Victim
Name: Fuquan Davis
Age: 44
Age Group: Adult 18+
Gender: Male
Status: Killed

These are clearly the same victim (“Faquan” vs. “Fuquan”) yet these are catalogued on the GVA as two separate incidents. Both entries even cite the same news source: https://pix11.com/news/local-news/new-jersey/one-killed-in-irvington-quadruple-shooting-officials/

Additional Duplicate Cases

I wanted to see if there were any other duplicate incidents in the GVA database. Here’s what I found:

Case 2

Incident IDDateAddressKilledInjuredLatitudeLongitude
2136132October 6, 20219200 block of Marshall Ave0441.470064429739-81.621943822579
2135659October 6, 2021Marshall Ave and E 93rd St0441.470068996525-81.621493968005

These are the same incident: music studio shooting, 4 injured, age and gender of the victims are the same, etc.

Case 3

Incident IDDateAddressKilledInjuredLatitudeLongitude
1765624August 16, 2020700 block of Chalfonte Pl1339.150103442521-84.488763354682
1766423August 16, 2020700 block of Chalfonte Pl1339.150103442521-84.488763354682

Same incident:

Case 4

Incident IDDateAddressKilledInjuredLatitudeLongitude
3150525September 6, 2014576 Poplar St0532.834771248901-83.629953285507
186488September 6, 2014576 Poplar Street0532.834771248901-83.629953285507

Same incident:

Conclusion

In total, I found four pairs of mass shooting incidents (a total of eight records) where the GVA’s data catalogued the same event twice.

I don’t believe these findings indicate that the GVA is intentionally “inflating” their data or that they are an unreliable source. In fact, four duplicates out of over 5,700 catalogued mass shootings in over a decade indicates quite a stellar track record. It’s a very small margin of error. However, for a database cited as widely as the GVA is, it’s important that these duplicate records are addressed and fixed as soon as possible to maintain a perfect accuracy.