Page MenuHomePhabricator

Wikidata Concepts Monitor: some datasets are empty
Closed, ResolvedPublic

Description

Some of the datasets provided via https://wikidata-analytics.wmcloud.org/app_direct/WikidataAnalytics/datasets.html seem to be empty since an update a couple of days ago. I am particularly looking at https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/wdcm/etl/wdcm_topItems.csv, which I use to protect "highly used items" in Wikidata using a bot. Based on the raw file list at https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/wdcm/etl/, I think there may be some other files affected by this as well (wdcm_category.csv, wdcm_project.csv, wdcm_project_category_item100.csv, and wdcm_topItems.csv).

Proposed solution:

  • re-run the script for fresh datasets
  • ensure that it won't put empty datasets there in the future again

Event Timeline

@MisterSynergy Thank you for catching this.

re-run the script for fresh datasets
ensure that it won't put empty datasets there in the future again

Sure, but first we need to find out about the cause of this failure. I suspect that it might be related to some recent updates and changes in our Analytics infrastructure.

I am on it.

@WMDE-leszek This is a top priority as of now.

The problem will be handled here and in T281316 (where we already have the solution thanks to @elukey).

The following T281063#7037642 :

I suspect that it might be related to some recent updates and changes in our Analytics infrastructure.

turned out to be correct, see: T231067.

Current status:

  • running WDCM_Sqoop_Clients.R to produce the entitiy usage data tables in Hadoop (manual update);
  • next step: manual update of the whole WDCM system.

Estimates:

  • we should have our WDCM analytics back online in the next 24 - 36 hours.

Current status:

  • monitoring the WDCM_Sqoop_Clients.R update,
  • all looking good;
  • it might take 10 - 12 hours for this procedure to complete.

Current status:

  • WDCM_Sqoop_Clients.R update is completed;
  • running a manual update of the WDCM system now.

WDCM system update, current status

  • Collect module (SPARQL/GAS) completed;
  • ETL module (Spark) completed;
  • ML module is now running.

Current status:

  • WDCM system update: complete;
  • monitoring WDCM datasets and dashboards now.

Current status:

  • WDCM system update completed;
  • monitoring the Wikidata Analytics now,

@MisterSynergy

The WDCM system update should be in place now.

Please let me know if the datasets that you need are now complete.

I apologize for any inconvenience. Please take into your consideration that we are operating Data Analytics in an extremely complex environment here: many things can go wrong for many different reasons - all of which depend on someone of a different (and probably unique) expertise. Thank you.

@MisterSynergy I have checked the datasets in Wikidata Analytics and everything seems to be in place now.

However, I would like to keep this ticket and T281316 opened until I manage to learn more from @elukey and the Analytics-Engineering team - when they find some time, of course - on the possible reasons of the recent failures in Apache Sqoop. If I understand the issue better maybe I would be able to figure out a more robust solution (e.g. a module that performs a check for the relevant drivers. or something similar, and decides how exactly to make a call to /usr/bin/sqoop).

GoranSMilovanovic lowered the priority of this task from High to Low.Apr 30 2021, 7:19 AM

@MisterSynergy

The WDCM system update should be in place now.

Please let me know if the datasets that you need are now complete.

I apologize for any inconvenience. Please take into your consideration that we are operating Data Analytics in an extremely complex environment here: many things can go wrong for many different reasons - all of which depend on someone of a different (and probably unique) expertise. Thank you.

Thanks, I have already seen it this night. My protection bot will run again next night and process the newest output of the topItems.csv file. In some sense I have been anticipating a situation like this, which is why my protection bot does not do anything in case it realized that something went wrong.

Side notes:

  • your wikidata-analytics.wmcloud.org seems to be down currently, but I am accessing the topItems.csv file directly anyways
  • somewhat later I will likely report some minor problems with the topItems.csv file that I observe since I started to use it roughly two months ago; nothing serious to worry about, though

@MisterSynergy

your wikidata-analytics.wmcloud.org seems to be down currently, but I am accessing the topItems.csv file directly anyways

Yes, some changes in production were just deployed (in relation to T277554) that called for a short service interruption.
We' re back online.

@MisterSynergy I will close this task now. Please re-open or file a new ticket altogether if you encounter any similar problems. Again, thanks for catching this.