User story:
As an editor, I want to better understand what Items are currently edited by a lot of people in order to intervene if necessary (e.g. current events).
Background: This is important if something big happens in the world (a famous person dies, the new leader of a country is elected, natural disaster strikes, …) as this often leads to a lot more people consuming the content. The increase in attention makes the Items related to the event a target for malicious edits. Making these Items easier to find for people who care about having accurate and up-to-date data about current events will enable them to better tackle the malicious edits there.
Solution:
- Table of the Items edited by the most people per timespan.
Details:
- Primary sorting by the number of people per timespan.
- Secondary sorting by the number of edits in the same timespan.
- Available timespans: 6h, 24h, if possible: 2d, 3d
- Being able to track longer time spans is more important than getting exact numbers. This means that if tracking 2 or 3 days requires too much memory, we could e.g. only track Items that have reached the threshold of 2 editors within 24h.
- The maximum delay (from update frequency and the time needed for analysis) should stay below one hour, if possible.
- Optional: cut off long-tail of similar values
Prototype:
https://wikidata-analytics.wmcloud.org/app/CurrentEvents
Original:
Problem Statement (problem to be solved):
When something big happens in the world (famous person dies, new leader of a country is elected, natural disaster strikes, …) it very often leads to a spur of attention on the same topic in the Wikimedia projects. A lot more people consume the content and a lot more people contribute to the content. This increase in attention also makes the Items related to the event a target for malicious edits. Making the Items easier to find for people who care about having accurate and up-to-date data about current events will enable them to better tackle the malicious edits there.
Plan for a prototype:
Phase 1. Fetch revisions (SQL)
- develop a simple crontab SQL daemon that extracts all Wikidata revisions every hour;
- sort out the revision frequency per item and property and check if anything is spiking;
- sort out human from bot/spider revisions;
Phase 2. Describe what is going on on Wikidata (SPARQL/MediaWiki API/R)
- send a set of SPARQL queries (better: MediaWiki API calls) to WDQS, to describe the items and properties that are currently in the community's focus;
- report essential revision statistics on the top edited items, classes, properties, per hour, day, and maybe week;
Phase 3. See if something in the World is correlated with the changes in Wikidata (NEWSRIVER API/R)
- use the free NEWSRIVER API to fetch the latest headlines and news using the top edited Wikidata entities as search terms;
- perform a basic, quick and simple search through the collected news and headlines to see if they frequently mention any of the Wikidata entities that are found under revision (or anything in relation to them: their properties, some characteristic values, etc) - not "real", ML-driven entity-linking in the prototype;
- only if there is a clear, distinguishable match, generate a set of candidate news stories and report their URLs.
Phase 4.
- serve as a simple, client-side dependent dashboard that checks for the fresh, hourly generated data on the WMF's servers.