Page MenuHomePhabricator

Qurator: Data about Current Events
Closed, ResolvedPublic

Description

User story:
As an editor, I want to better understand what Items are currently edited by a lot of people in order to intervene if necessary (e.g. current events).

Background: This is important if something big happens in the world (a famous person dies, the new leader of a country is elected, natural disaster strikes, …) as this often leads to a lot more people consuming the content. The increase in attention makes the Items related to the event a target for malicious edits. Making these Items easier to find for people who care about having accurate and up-to-date data about current events will enable them to better tackle the malicious edits there.

Solution:

  • Table of the Items edited by the most people per timespan.

Details:

  • Primary sorting by the number of people per timespan.
  • Secondary sorting by the number of edits in the same timespan.
  • Available timespans: 6h, 24h, if possible: 2d, 3d
    • Being able to track longer time spans is more important than getting exact numbers. This means that if tracking 2 or 3 days requires too much memory, we could e.g. only track Items that have reached the threshold of 2 editors within 24h.
  • The maximum delay (from update frequency and the time needed for analysis) should stay below one hour, if possible.
  • Optional: cut off long-tail of similar values

Prototype:
https://wikidata-analytics.wmcloud.org/app/CurrentEvents

Original:
Problem Statement (problem to be solved):

When something big happens in the world (famous person dies, new leader of a country is elected, natural disaster strikes, …) it very often leads to a spur of attention on the same topic in the Wikimedia projects. A lot more people consume the content and a lot more people contribute to the content. This increase in attention also makes the Items related to the event a target for malicious edits. Making the Items easier to find for people who care about having accurate and up-to-date data about current events will enable them to better tackle the malicious edits there.

Plan for a prototype:

Phase 1. Fetch revisions (SQL)

  • develop a simple crontab SQL daemon that extracts all Wikidata revisions every hour;
  • sort out the revision frequency per item and property and check if anything is spiking;
  • sort out human from bot/spider revisions;

Phase 2. Describe what is going on on Wikidata (SPARQL/MediaWiki API/R)

  • send a set of SPARQL queries (better: MediaWiki API calls) to WDQS, to describe the items and properties that are currently in the community's focus;
  • report essential revision statistics on the top edited items, classes, properties, per hour, day, and maybe week;

Phase 3. See if something in the World is correlated with the changes in Wikidata (NEWSRIVER API/R)

  • use the free NEWSRIVER API to fetch the latest headlines and news using the top edited Wikidata entities as search terms;
  • perform a basic, quick and simple search through the collected news and headlines to see if they frequently mention any of the Wikidata entities that are found under revision (or anything in relation to them: their properties, some characteristic values, etc) - not "real", ML-driven entity-linking in the prototype;
  • only if there is a clear, distinguishable match, generate a set of candidate news stories and report their URLs.

Phase 4.

  • serve as a simple, client-side dependent dashboard that checks for the fresh, hourly generated data on the WMF's servers.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
  • Another type of failure for the news module detected;
  • fixing now.
  • Another bug fixed in respect to T259105#6436197;
  • continuing monitoring & debugging.

@Lydia_Pintscher @WMDE-leszek Status:

  • the dashboard is currently moving from prototype to 0.0.1
  • it is "devirtualized" and will be run directly from a Shiny Server instance on the test server to
  • enable for close monitoring of the possible failures (occasional "freezes" of the update cycle).

As soon as the dashboard is ready for testing I will share its URL here.

@Lydia_Pintscher @WMDE-leszek

The dashboard is live: http://datakolektiv.org:3838/WD_CurrentEvents/

  • strict monitoring procedures are in place;
  • I will be reporting back in case of any errors/fixes;
  • please let me know if the case when an item is reported to have no English label but it does have an English label indeed is reported on the dashboard, because I was not able to find such occurrences in my tests.

Thanks.

  • 18:59 CET bug fix - restarted the app.

@Lydia_Pintscher As of the English labels related problem (i.e. the dashboard reporting items without English labels while the same items do have English labels on Wikidata indeed), I was able to detect only cases where the label was created just very recently - and than it is natural that the dashboard reports an item revision but still can't find its label via the API.

Pleas if you encounter a case where the label was created some time ago, and the dashboard still reports that item doesn't have one - let me know. Thanks.

2020/10/18:

  • freeze again; problem detected:
  • MediaWiki API did not return one field (old_revid);
  • action: potential bug fix, restart system, continue monitoring.

2020/10/20:

  • updated module crashed again: no type field vas received from the Wikidata API;
  • fixing now.

@Lydia_Pintscher @WMDE-leszek

This system is running in production from: https://wikidata-analytics.wmcloud.org/app/QURATOR_CurrentEvents

There is a slight problem with the sync of the https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/Qurator/ used by this dashboard:

  • we produce new data on stat1005 every minute,
  • but stat100* machines update the public directories only once in 15 minutes or so (or at least that is what I remember).

I have asked on the analytics-engineering IRC channel is it possible to make the sync frequency to be one minute for this directory only, and I am still waiting for a response. If not, we need a server that can run R, publish the dataset, and refresh it every minute as the data are produced from an R scrip running on crontab.

@WMDE-leszek This is not going to work the way it is in production now. We need a way to be able to update real-time, and still stay in the same production environment as the rest of the Wikidata Analytics. As I have predicted, productionizing this (rather simple software) is going cause us a problem.

@WMDE-leszek Here is what I am going to do for starters:

  • the system will be brought back from production to pre-production stage
  • and served from an independent Shiny Server instace, only linked from
  • the Wikidata Analytics portal, Qurator menu.

That means the system will not be vertically scaled until I solve how to serve it under shinyproxy and still manage to have it updated in real-time.

Also, this means that I will be phasing out any data production for this dashboard from the stat100* machines. The sync with public data directories there is simply to slow to support the Current Events dashboard.

\o/
I did a quick review and it's looking good.
One small issue I saw is this contradiction between the message and what the table shows:

Screenshot.png (434×761 px, 26 KB)

Change 674853 had a related patch set uploaded (by GoranSMilovanovic; author: GoranSMilovanovic):
[analytics/wmde/WD/WikidataAnalytics@master] T259105

https://gerrit.wikimedia.org/r/674853

Change 674853 merged by GoranSMilovanovic:
[analytics/wmde/WD/WikidataAnalytics@master] T259105

https://gerrit.wikimedia.org/r/674853

@GoranSMilovanovic

  • I tested the dashboard a bit during the last 2 weeks and the issue that Lydia was reporting on happened to me as well (items were included in the list even though they were edited by less than three editors). Unfortunately I didn't take a screenshot, but I copied the items it showed me:

Item / Revisions /TimeStamp /Editors / Link to look up Item on News Site
2021 Cholet-Pays de la Loire (Q104805713) / 64 / 2021-04-13T12:58:42Z /2 /Search News
Salviniales (Q7175204) / 32 / 2021-04-13T12:47:21Z / 2 / Search News

The issue hasn't happened since April 14th.

  • Another issue we discussed and wanted to bring up: Can you change the time frame for revisions in the previous 60 minutes to revisions in the previous 12 hours? We are hoping to get more data displayed with this change.

Thanks :)

Change 690692 had a related patch set uploaded (by GoranSMilovanovic; author: GoranSMilovanovic):

[analytics/wmde/WD/WikidataAnalytics@master] T259105

https://gerrit.wikimedia.org/r/690692

Change 690692 merged by GoranSMilovanovic:

[analytics/wmde/WD/WikidataAnalytics@master] T259105

https://gerrit.wikimedia.org/r/690692

@Maria_WMDE

... items were included in the list even though they were edited by less than three editors

Well there were discussions whether to display all recently edited items if it happens that none was observed to have been edited by >=3 editors...
However, this should be fixed now, and the dashboard should now display only those items (if any) that were edited by >=3 editors in the previous hour.

Another issue we discussed and wanted to bring up: Can you change the time frame for revisions in the previous 60 minutes to revisions in the previous 12 hours? We are hoping to get more data displayed with this change.

I am not sure that this would be possible from a practical point of view. Namely, in order to keep this system as real time as possible, we run the update procedure every minute from our Analytics Clients (the stat1005 machine in this particular case). Now, processing all Wikidata revisions from previous 12h would take so much time (and computational resources too, especially memory) that it would make it practically impossible to keep up with the current events.

Change 691579 had a related patch set uploaded (by GoranSMilovanovic; author: GoranSMilovanovic):

[analytics/wmde/WD/WikidataAnalytics@master] T259105

https://gerrit.wikimedia.org/r/691579

Change 691579 merged by GoranSMilovanovic:

[analytics/wmde/WD/WikidataAnalytics@master] T259105

https://gerrit.wikimedia.org/r/691579

@GoranSMilovanovic

Thanks for getting back, Goran!

Well there were discussions whether to display all recently edited items if it happens that none was observed to have been edited by >=3 editors...
However, this should be fixed now, and the dashboard should now display only those items (if any) that were edited by >=3 editors in the previous hour

--> I haven't been able to access the dashboard, as it gives me an error message, wasn't sure if this was related to you working on it:

grafik.png (946×1 px, 424 KB)

I am not sure that this would be possible from a practical point of view. Namely, in order to keep this system as real time as possible, we run the update procedure every minute from our Analytics Clients (the stat1005 machine in this particular case). Now, processing all Wikidata revisions from previous 12h would take so much time (and computational resources too, especially memory) that it would make it practically impossible to keep up with the current events.

--> What do you think could be a practical time frame? Would 6 hours work for example?

Hi @GoranSMilovanovic!

Did you have a chance to have a look at my comment? Thanks!

Hi @GoranSMilovanovic, I changed the description to reflect what we discussed. See you soon! :)

  • New update module tested in local (dev) environment now;
  • Testing now in CloudVPS.
  • Update module test in CloudVPS successful;
  • expanding now to 6h, 24h, if possible: 2d, 3d.

@Manuel

My estimates say that we should face no problems in our attempt to aggregate revisions across 6h, 24h, 2d and 3d on a virtual instance used for Wikidata Analytics in Cloud VPS.

  • I will deploy a full version of the update module today and let it run for a while, just to make sure everything is fine.
  • The test will continue on a test dashboard, and if everything is still fine,
  • I should be able to deploy in production w. Shiny Proxy until Monday.

This system is a bit specific and that is why "live" tests like this one are necessary. Anyways, all should be fine. Reporting back as soon as I have something.

@Manuel

  • The test dashboard is running here.
  • It will take some time before we begin to observe any differences between the 6h, 24h, 48h, and 72h tables;
  • As soon as this is evaluated, I will deploy to production on Wikidata Analytics.

Note to myself: correct Updated every minute.

@Manuel

  • The test dashboard is running here.
  • It will take some time before we begin to observe any differences between the 6h, 24h, 48h, and 72h tables;
  • As soon as this is evaluated, I will deploy to production on Wikidata Analytics.

Note to myself: correct Updated every minute.

+ typo "update the table below each ten minutes" -> "update the table below every ten minutes"?

@So9q

+ typo "update the table below each ten minutes" -> "update the table below every ten minutes"?

Yes, a typo, thank you... test server... it will run in production (hopefully w/o typos) elsewhere soon : )

@Manuel Test successful; deploying to production tomorrow.
@So9q Thank you for your insights; all of the typos and similar will be taken care of.

Reporting back here as soon as the system is deployed in CloudVPS.

@Manuel We're running in production: https://wikidata-analytics.wmcloud.org/app/QURATOR_CurrentEvents

Of course, it will take some time for the 6h, 24h, 48h, and 72h tables to begin to differ - the update engine was restarted.

The test dashboard goes off to save my resources on DataKolektiv's server (and spare the Wikibase API of my tons of requests for recent changes).

@GoranSMilovanovic: Awesome! I knew that you could do it! \o/
I'll get back to you about the UI tweaks.

One more thing: The main URL of the tool is [deleted]. Right now this main URL is throwing an error. Could you please point it to the production tool?

@Manuel

The URL of the tool is: https://wikidata-analytics.wmcloud.org/app/QURATOR_CurrentEvents

Right now this main URL is throwing an error.

The one that you have listed in T259105#7436093 throws and error, of course.

Could you please point it to the production tool?

I will, but: point it from where? The Wikidata Analytics portal points to the correct one and runs the system.

[Edited] If we want this to be used by the community I would prefer to name the main URL https://wikidata-analytics.wmcloud.org/app/CurrentEvents.

@Manuel

That will take some time because of the specifics of running R/Shiny in production under Shiny Proxy.

The Curious Facts - I am debugging something there right now - will also have to change then, because both systems now run on:

The changes also imply changing the structure of the codebase a bit.

I suggest you inform the community to switch to the URLs that I have shared here, but if you insist I will change them - and that will take some time.

Uuups, I just realized that I mixed up these two tools!

I edited my comments accordingly.

@Manuel

No problem; the change will take place (probably until the end of the week).

Can you in the meantime make redirects to the current (but temporary) location? Otherwise, we will get more bug reports from people trying to use the tools.

No; as described in T259105#7436288

That will take some time because of the specifics of running R/Shiny in production under Shiny Proxy.

@Manuel Wait; maybe it can be done in an easier way than I thought. Give me a couple of hours or so to test.

Ok cool... and if it does not work, could you please elaborate on where the problem is? This should not take so many days. Maybe our devs can help?

@Manuel No, no... now worries, see T259105#7436472; I figured out it can be done easily:

the new URLs

https://wikidata-analytics.wmcloud.org/app/CuriousFacts
https://wikidata-analytics.wmcloud.org/app/CurrentEvents

are now working.

I also need to change the links in the menu of the Wikidata Analytics Portal now.

Change 731745 had a related patch set uploaded (by GoranSMilovanovic; author: GoranSMilovanovic):

[analytics/wmde/WD/WikidataAnalytics@master] T259105

https://gerrit.wikimedia.org/r/731745

Change 731745 merged by GoranSMilovanovic:

[analytics/wmde/WD/WikidataAnalytics@master] T259105

https://gerrit.wikimedia.org/r/731745

@Manuel By the way, T259105#7436480

This should not take so many days.

Of course it would not take days; it is typically not a question of how much time, but when, because I am currently working on:

  • A synthesis of our Strategy findings (several tickers --> synthesis);
  • The Wikidata User Retention thing for WikidataCon 2021;
  • debugging the Curious Facts thing (trivial, but still needs my attention);
  • Three (3) campaigns for the New Editors team.

And there is only one of me : ) - Have mercy, please.

Manuel moved this task from Incoming to Miscellaneous on the Wikidata Analytics board.

Closing this task as we now have the basic functionality ready! \o/
I will create new tickets for the UX tweaks.