Page MenuHomePhabricator

Create a suitable template and import into wikidata the list of major data breaches
Open, MediumPublic

Description

This is an additional (optional) task proposal for the wikidata stream of wiki-techstorm-2019

There is an actively maintained list of data breach events on wikipedia. The objective of this task would be import this data into wikidata. Once in wikidata, the catalog of events can be queried in various simple or more complex ways to create interesting insights about data security risks.

The steps required would be to

  • parse the wikitext of that link
  • process the row entries and create unique entities per event in wikidata using a data model adapted to the available data and the nature of these entries (events, involving corporate/public entities etc.)
  • some data wrangling with openrefine / other might be required as the records are not consistent

Event Timeline

Phofx created this task.Nov 21 2019, 5:47 PM
Ecritures triaged this task as Medium priority.Nov 21 2019, 7:33 PM
Pintoch moved this task from Backlog to Data imports on the OpenRefine board.Nov 21 2019, 9:04 PM

42 of these breaches are also uploaded to wikidata ( the ones whose reference links I also manually went through)

Phofx added a comment.Nov 24 2019, 2:32 PM

A first iteration of inserting data from the wikipedia table is now complete. Some lessons learned in the process are collected here.

The data quality of the input data could be improved. Per field a summary assessment:

  • Entity: Used, but many entities involved in a data breach (~50%) cannot be reliably reconciled. Some dont exist in wikidata, others are poorly formed
  • Organization type: Not used and likely not reliable. Ideally this should be pulled from the reconciled entity and belong to a standard sector classification (NACE etc.)
  • Year: No issues (but not verified), Used
  • Records: The majority of entries could converted to a numeric estimate of number of records leaked. Property data size was used with unit record. In some cases the impact is unknown or expressed otherwise (e.g. GB).
  • Method: This has been used as-is in the Description (as a hack)
  • Sources: Used, but quality of reference varies considerably

Some thoughts for future work:

  • It seems that it would significantly improve the data collection process if people updating the wikipedia table were to introduce events in wikidata first rather than the other way around
  • The method (attach vector) of the data breach is valuable information, it would require refining the data breach property to capture this properly
  • The holy grail of this effort would be to eventually be able to import the much larger Verizon community database of incidents