Wikidata Concepts Monitor: usage numbers have shrunk considerably within a week
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	MisterSynergy
	Jun 11 2021, 10:48 PM

Description

My adminbot protects "highly used item pages" in Wikidata per policy, based on input from https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/wdcm/etl/wdcm_topItems.csv. During the past update of that report (2021-06-07, 18:55), usage numbers have shrunk considerably so that the number of items with usage numbers greater than 500 have decreased from ~29867 to ~18896 within a week (-37%). Safety measures in my bot code prevented the bot from removing ~11.000 existing page protections for now.

Is this some sort of a bug that led to an incomplete report, or has something been changed in the way how "item usage" is being determined?

Event Timeline

MisterSynergy created this task.Jun 11 2021, 10:48 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 11 2021, 10:48 PM

@MisterSynergy

Please share the previous version of wdcm_topItems.csv here. I am on it. Highest priority. Thank you for catching this.

GoranSMilovanovic moved this task from Technical Wishlist to Prioritized on the User-GoranSMilovanovic board.Jun 12 2021, 8:46 AM

RhinosF1 subscribed.Jun 12 2021, 8:46 AM

On the first sight, there were only 687 projects whose reuse data were sqooped by the WDCM_Sqoop_Clients.R run, and
that number, as far as I remember, should be higher;
Could it be that something in the organization of our core Mediawiki databases has changed?
Inspecting now, here's the first suspect:

In fread(paste0("shardTables_", i, ".tsv"), sep = "\t") :
  File 'shardTables_4.tsv' has size 0. Returning a NULL data.table.

from the WDCM_Sqoop_Clients.R log; s4 is Commons - ?
But then it seems that even more is missing. Will parse the sqoop module log.

Notes:

the latest sqoop run ended on 2021-06-07 06:19:55;
the next one is scheduled from stat1004's crontab to start on 2021-06-14 00:00:00.

The most probable next step following the analysis in SQL/MariaDB directly:

run a manual update of the WDCM sqoop module; monitor.

In T284850#7152920, @GoranSMilovanovic wrote:

@MisterSynergy

Please share the previous version of wdcm_topItems.csv here. I am on it. Highest priority. Thank you for catching this.

I don't have any previous versions available, just the current one. The Python bot script saves a local copy of the current toplist.csv file by overwriting the previous old revision and loads it then to a pandas DataFrame which is subsequently being evaluated. Pretty straigtforward and in fact rather simple; notably, nothing went wrong in the latest run, or has changed since the previous successful run.

@MisterSynergy Thank you. No worries, I will figure this out from the WDCM sqoop logs. Sooner or later.

@Manuel FYI important

@MisterSynergy

running a manual update of the WDCM sqoop module now;
monitoring.

Monitoring Sqoop procedures from Core MediaWiki databases to goransm.wdcm_clients_wb_entity_usage in the DataLake:

Shard 1 (enwiki only): completed;
Shard 2: still running, no problems detected thus far.

Shard 3: has many wikies (> 800), to be run last;

Shard 4: Commons (possible problem).
and then testing consecutively Shards 5 - 8 (Wikidata).

Sqoop Shard 4 running now (Commons): in comparison to what was observed from the WDCM Sqoop Clients Log in T284850#7152935, I see no problem in relation to Commons anymore: the Commons database is found, and its wbc_entity_usage table is being sqooped with no problems right now. But then - what went wrong during the previous update?

Monitoring.

@MisterSynergy

full manual update of the WDCM Sqoop procedure is now completed;
876 partitions (wiki_db) are present in the Data Lake, which means that everything should be fine,
except for if something changed in the per wiki wbc_entity_usage table.

Next steps:

a full WDCM system update is scheduled for today, 16:00 UTC;
it will take several hours, and then
we can take a look at the datasets and dashboards to make sure the data are now complete.

@MisterSynergy Could you please check the wdcm_topItems.csv dataset now and let me know if it looks alright?

In T284850#7158957, @GoranSMilovanovic wrote:

@MisterSynergy Could you please check the wdcm_topItems.csv dataset now and let me know if it looks alright?

Yes, it looks pretty much as previously right now, so I'm happy with it.

GoranSMilovanovic closed this task as Resolved.Jun 19 2021, 2:41 PM

Wikidata Concepts Monitor: usage numbers have shrunk considerably within a weekClosed, ResolvedPublicActions

Description

Event Timeline

Wikidata Concepts Monitor: usage numbers have shrunk considerably within a week
Closed, ResolvedPublic
Actions