Page MenuHomePhabricator

Wikidata Concepts Monitor: usage numbers have shrunk considerably within a week
Closed, ResolvedPublic

Description

My adminbot protects "highly used item pages" in Wikidata per policy, based on input from https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/wdcm/etl/wdcm_topItems.csv. During the past update of that report (2021-06-07, 18:55), usage numbers have shrunk considerably so that the number of items with usage numbers greater than 500 have decreased from ~29867 to ~18896 within a week (-37%). Safety measures in my bot code prevented the bot from removing ~11.000 existing page protections for now.

Is this some sort of a bug that led to an incomplete report, or has something been changed in the way how "item usage" is being determined?

Event Timeline

GoranSMilovanovic triaged this task as High priority.

@MisterSynergy

Please share the previous version of wdcm_topItems.csv here. I am on it. Highest priority. Thank you for catching this.

In fread(paste0("shardTables_", i, ".tsv"), sep = "\t") :
  File 'shardTables_4.tsv' has size 0. Returning a NULL data.table.

from the WDCM_Sqoop_Clients.R log; s4 is Commons - ?
But then it seems that even more is missing. Will parse the sqoop module log.

Notes:

  • the latest sqoop run ended on 2021-06-07 06:19:55;
  • the next one is scheduled from stat1004's crontab to start on 2021-06-14 00:00:00.

The most probable next step following the analysis in SQL/MariaDB directly:

  • run a manual update of the WDCM sqoop module; monitor.

@MisterSynergy

Please share the previous version of wdcm_topItems.csv here. I am on it. Highest priority. Thank you for catching this.

I don't have any previous versions available, just the current one. The Python bot script saves a local copy of the current toplist.csv file by overwriting the previous old revision and loads it then to a pandas DataFrame which is subsequently being evaluated. Pretty straigtforward and in fact rather simple; notably, nothing went wrong in the latest run, or has changed since the previous successful run.

@MisterSynergy Thank you. No worries, I will figure this out from the WDCM sqoop logs. Sooner or later.

@MisterSynergy

  • running a manual update of the WDCM sqoop module now;
  • monitoring.

Monitoring Sqoop procedures from Core MediaWiki databases to goransm.wdcm_clients_wb_entity_usage in the DataLake:

  • Shard 1 (enwiki only): completed;
  • Shard 2: still running, no problems detected thus far.
  • Shard 3: has many wikies (> 800), to be run last;
  • Shard 4: Commons (possible problem).
  • and then testing consecutively Shards 5 - 8 (Wikidata).
  • Sqoop Shard 4 running now (Commons): in comparison to what was observed from the WDCM Sqoop Clients Log in T284850#7152935, I see no problem in relation to Commons anymore: the Commons database is found, and its wbc_entity_usage table is being sqooped with no problems right now. But then - what went wrong during the previous update?
  • Monitoring.

@MisterSynergy

  • full manual update of the WDCM Sqoop procedure is now completed;
  • 876 partitions (wiki_db) are present in the Data Lake, which means that everything should be fine,
  • except for if something changed in the per wiki wbc_entity_usage table.

Next steps:

  • a full WDCM system update is scheduled for today, 16:00 UTC;
  • it will take several hours, and then
  • we can take a look at the datasets and dashboards to make sure the data are now complete.

@MisterSynergy Could you please check the wdcm_topItems.csv dataset now and let me know if it looks alright?

@MisterSynergy Could you please check the wdcm_topItems.csv dataset now and let me know if it looks alright?

Yes, it looks pretty much as previously right now, so I'm happy with it.