Page MenuHomePhabricator

Qurator: Data about Current Events
Open, Needs TriagePublic

Description

Problem Statement (problem to be solved):

When something big happens in the world (famous person dies, new leader of a country is elected, natural disaster strikes, …) it very often leads to a spur of attention on the same topic in the Wikimedia projects. A lot more people consume the content and a lot more people contribute to the content. This increase in attention also makes the Items related to the event a target for malicious edits. Making the Items easier to find for people who care about having accurate and up-to-date data about current events will enable them to better tackle the malicious edits there.

Plan for a prototype:

Phase 1. Fetch revisions (SQL)

  • develop a simple crontab SQL daemon that extracts all Wikidata revisions every hour;
  • sort out the revision frequency per item and property and check if anything is spiking;
  • sort out human from bot/spider revisions;

Phase 2. Describe what is going on on Wikidata (SPARQL/MediaWiki API/R)

  • send a set of SPARQL queries (better: MediaWiki API calls) to WDQS, to describe the items and properties that are currently in the community's focus;
  • report essential revision statistics on the top edited items, classes, properties, per hour, day, and maybe week;

Phase 3. See if something in the World is correlated with the changes in Wikidata (NEWSRIVER API/R)

  • use the free NEWSRIVER API to fetch the latest headlines and news using the top edited Wikidata entities as search terms;
  • perform a basic, quick and simple search through the collected news and headlines to see if they frequently mention any of the Wikidata entities that are found under revision (or anything in relation to them: their properties, some characteristic values, etc) - not "real", ML-driven entity-linking in the prototype;
  • only if there is a clear, distinguishable match, generate a set of candidate news stories and report their URLs.

Phase 4.

  • serve as a simple, client-side dependent dashboard that checks for the fresh, hourly generated data on the WMF's servers.

Event Timeline

@Lydia_Pintscher

  • We have live revision updates from the API implemented, and
  • real time last 10 minutes and last one hour aggregated revision frequencies tracked.

Now:

  • implement news search;
  • decoration: find out about the most popular classes, maybe geo-coordinates where present to have a map on the dashboard, and similar;
  • develop a dashboard;
  • figure out deployment/production.

@Lydia_Pintscher

Status:

  • implement news search: DONE
    • proof-of-concept: fetch relevant news for recently edited Wikidata items;

Next steps:

  • decoration: find out about the most popular classes, maybe geo-coordinates where present to have a map on the dashboard, and similar;
  • develop a dashboard;
  • figure out sync/deployment/production.

Found three ways in which Q_CE_02-WDNews.R module - fetches news articles from NEWSRIVER.io - can fail;

  • fixed all three;
  • continuing to monitor.
  • Another type of failure for the news module detected;
  • fixing now.
  • Another bug fixed in respect to T259105#6436197;
  • continuing monitoring & debugging.

@Lydia_Pintscher @WMDE-leszek Status:

  • the dashboard is currently moving from prototype to 0.0.1
  • it is "devirtualized" and will be run directly from a Shiny Server instance on the test server to
  • enable for close monitoring of the possible failures (occasional "freezes" of the update cycle).

As soon as the dashboard is ready for testing I will share its URL here.

@Lydia_Pintscher @WMDE-leszek

The dashboard is live: http://datakolektiv.org:3838/WD_CurrentEvents/

  • strict monitoring procedures are in place;
  • I will be reporting back in case of any errors/fixes;
  • please let me know if the case when an item is reported to have no English label but it does have an English label indeed is reported on the dashboard, because I was not able to find such occurrences in my tests.

Thanks.

  • 18:59 CET bug fix - restarted the app.

@Lydia_Pintscher As of the English labels related problem (i.e. the dashboard reporting items without English labels while the same items do have English labels on Wikidata indeed), I was able to detect only cases where the label was created just very recently - and than it is natural that the dashboard reports an item revision but still can't find its label via the API.

Pleas if you encounter a case where the label was created some time ago, and the dashboard still reports that item doesn't have one - let me know. Thanks.

2020/10/18:

  • freeze again; problem detected:
  • MediaWiki API did not return one field (old_revid);
  • action: potential bug fix, restart system, continue monitoring.

2020/10/20:

  • updated module crashed again: no type field vas received from the Wikidata API;
  • fixing now.

@Lydia_Pintscher @WMDE-leszek

This system is running in production from: https://wikidata-analytics.wmcloud.org/app/QURATOR_CurrentEvents

There is a slight problem with the sync of the https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/Qurator/ used by this dashboard:

  • we produce new data on stat1005 every minute,
  • but stat100* machines update the public directories only once in 15 minutes or so (or at least that is what I remember).

I have asked on the analytics-engineering IRC channel is it possible to make the sync frequency to be one minute for this directory only, and I am still waiting for a response. If not, we need a server that can run R, publish the dataset, and refresh it every minute as the data are produced from an R scrip running on crontab.

@WMDE-leszek This is not going to work the way it is in production now. We need a way to be able to update real-time, and still stay in the same production environment as the rest of the Wikidata Analytics. As I have predicted, productionizing this (rather simple software) is going cause us a problem.

@WMDE-leszek Here is what I am going to do for starters:

  • the system will be brought back from production to pre-production stage
  • and served from an independent Shiny Server instace, only linked from
  • the Wikidata Analytics portal, Qurator menu.

That means the system will not be vertically scaled until I solve how to serve it under shinyproxy and still manage to have it updated in real-time.

Also, this means that I will be phasing out any data production for this dashboard from the stat100* machines. The sync with public data directories there is simply to slow to support the Current Events dashboard.

\o/
I did a quick review and it's looking good.
One small issue I saw is this contradiction between the message and what the table shows:

Screenshot.png (434×761 px, 26 KB)

Change 674853 had a related patch set uploaded (by GoranSMilovanovic; author: GoranSMilovanovic):
[analytics/wmde/WD/WikidataAnalytics@master] T259105

https://gerrit.wikimedia.org/r/674853

Change 674853 merged by GoranSMilovanovic:
[analytics/wmde/WD/WikidataAnalytics@master] T259105

https://gerrit.wikimedia.org/r/674853

@GoranSMilovanovic

  • I tested the dashboard a bit during the last 2 weeks and the issue that Lydia was reporting on happened to me as well (items were included in the list even though they were edited by less than three editors). Unfortunately I didn't take a screenshot, but I copied the items it showed me:

Item / Revisions /TimeStamp /Editors / Link to look up Item on News Site
2021 Cholet-Pays de la Loire (Q104805713) / 64 / 2021-04-13T12:58:42Z /2 /Search News
Salviniales (Q7175204) / 32 / 2021-04-13T12:47:21Z / 2 / Search News

The issue hasn't happened since April 14th.

  • Another issue we discussed and wanted to bring up: Can you change the time frame for revisions in the previous 60 minutes to revisions in the previous 12 hours? We are hoping to get more data displayed with this change.

Thanks :)

Change 690692 had a related patch set uploaded (by GoranSMilovanovic; author: GoranSMilovanovic):

[analytics/wmde/WD/WikidataAnalytics@master] T259105

https://gerrit.wikimedia.org/r/690692

Change 690692 merged by GoranSMilovanovic:

[analytics/wmde/WD/WikidataAnalytics@master] T259105

https://gerrit.wikimedia.org/r/690692

@Maria_WMDE

... items were included in the list even though they were edited by less than three editors

Well there were discussions whether to display all recently edited items if it happens that none was observed to have been edited by >=3 editors...
However, this should be fixed now, and the dashboard should now display only those items (if any) that were edited by >=3 editors in the previous hour.

Another issue we discussed and wanted to bring up: Can you change the time frame for revisions in the previous 60 minutes to revisions in the previous 12 hours? We are hoping to get more data displayed with this change.

I am not sure that this would be possible from a practical point of view. Namely, in order to keep this system as real time as possible, we run the update procedure every minute from our Analytics Clients (the stat1005 machine in this particular case). Now, processing all Wikidata revisions from previous 12h would take so much time (and computational resources too, especially memory) that it would make it practically impossible to keep up with the current events.

Change 691579 had a related patch set uploaded (by GoranSMilovanovic; author: GoranSMilovanovic):

[analytics/wmde/WD/WikidataAnalytics@master] T259105

https://gerrit.wikimedia.org/r/691579

Change 691579 merged by GoranSMilovanovic:

[analytics/wmde/WD/WikidataAnalytics@master] T259105

https://gerrit.wikimedia.org/r/691579

@GoranSMilovanovic

Thanks for getting back, Goran!

Well there were discussions whether to display all recently edited items if it happens that none was observed to have been edited by >=3 editors...
However, this should be fixed now, and the dashboard should now display only those items (if any) that were edited by >=3 editors in the previous hour

--> I haven't been able to access the dashboard, as it gives me an error message, wasn't sure if this was related to you working on it:

grafik.png (946×1 px, 424 KB)

I am not sure that this would be possible from a practical point of view. Namely, in order to keep this system as real time as possible, we run the update procedure every minute from our Analytics Clients (the stat1005 machine in this particular case). Now, processing all Wikidata revisions from previous 12h would take so much time (and computational resources too, especially memory) that it would make it practically impossible to keep up with the current events.

--> What do you think could be a practical time frame? Would 6 hours work for example?

Hi @GoranSMilovanovic!

Did you have a chance to have a look at my comment? Thanks!