Page MenuHomePhabricator

WDCM Dashboards Maintenance
Open, NormalPublic

Description

This is an umbrella ticket for all WDCM Dashboards

  • improvements,
  • bug fixes,
  • new features.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 10 2019, 10:43 PM
  • Fix the update back-end engine for the [[ URL | WDCM Biases Dashboard ]]. @RazShuty
  • ! - Advanced Search Extension Dashboard update is broken: inspect, fix.
GoranSMilovanovic triaged this task as High priority.Mar 29 2019, 7:37 PM
  • Status: ! - Advanced Search Extension Dashboard update has *probably* failed on one day only,
  • leading to a discontinuation of the timeseries;
  • running updates manually now to prevent data losses;
  • inspect the issue and then put the engine back to automation.
GoranSMilovanovic lowered the priority of this task from High to Normal.Apr 2 2019, 9:50 AM
GoranSMilovanovic lowered the priority of this task from Normal to Low.Apr 2 2019, 7:02 PM
  • Advanced Search Extension dashboard update fixed.
GoranSMilovanovic raised the priority of this task from Low to High.May 27 2019, 8:05 AM
GoranSMilovanovic moved this task from WDCM to Prioritized on the User-GoranSMilovanovic board.
  • the WDCM_Sqoop_Clients.R procedure fails for some databases;
  • however, this seems to be happening on occassion only;
  • inspect and solve.

@Lydia_Pintscher

  • Correction. It is not the Apache Sqoop procedure in WDCM_Sqoop_Clients.R that fails (luckily!);
  • explanation: I've stumbled into an unfinished log file when I was checking this... and some databases, naturally, where missing - because the procedure did not complete all the passes yet.
  • Continue with: inspect why the usage number reported on WD_percentUsageDashboard 'oscilates'.

@Lydia_Pintscher I got it:

  • the dashboard updated before the Apache Sqoop run - which produces the essential data for the WDCM and this dashboard as well - was completed;
  • since we need the product of the Apache Sqoop procedure - the one and the only re-use Hive table that we use for everything - to be available in an asynchronous manner to various statistical systems,
  • the only solution is to correct the crontab run of the dashboard update from stat1004 to a later time (currently it is 08:00 UTC; the Apache Sqoop run starts at midnight and it seems to be taking more than nine hours to complete - and that is the root of the problem).

Implementing change; testing as of tomorrow.

GoranSMilovanovic lowered the priority of this task from High to Normal.May 27 2019, 11:10 AM
GoranSMilovanovic added a comment.EditedMay 31 2019, 11:12 AM
  • WDCM Geo is back on updates, re-designed, fully client-side dependent, and does not rely on the wdcm_maintable anymore (see T214586):

http://wmdeanalytics.wmflabs.org/WDCM_GeoDashboard/

  • Next step: do the same for the WDCM Biases dashboard.
GoranSMilovanovic added a comment.EditedJun 7 2019, 6:37 PM

@RazShuty @Lea_Lacroix_WMDE

  • The WDCM Biases Dashboard's back-end has to switch to Apache Spark;
  • it maintains large data sets, depends upon WDQS heavily by sending queries that time-out every now and then; beyond that,
  • it relies on geo-coordinates which, when fetched together with items from WDQS, produce long vectors that cannot be converted from raw to character() in R (and which then need to be parsed by {jsonlite} into readable data.frame types for processing).

We will deal with this dashboard once and for all.
@Lea_Lacroix_WMDE I know we said this dashboard is on maintenance only, but this is exactly it: if the changes do not take place, it will never update again.

  • Pyspark ETL procedures for WDCM Biases completed;
  • next step: re-factor R code to work with Spark outputs.

@Lea_WMDE

  • archive the Advanced Search Extension dashboard (tracking).
  • Goodbye Advanced Search Extension dashboard :(

@RazShuty @Lea_Lacroix_WMDE

The WDCM Biases dashboard is now back on updates:

  • it will be update once monthly, and
  • it's update is dependent upon the most recent version of the WD dump copy in hdfs (see T209655);
  • the dashboard is now fully client-side dependent,
  • and because it now uses Spark as its ETL back-end we can finally remove
  • the huge wdcm_maintable from HDFS that previously supported our ETL procedures for the WDCM system (Hive; see T214586).

@Lea_Lacroix_WMDE This is the only WDCM dashboard whose design is not consistent with others, take a look under "Wikidata Concepts Monitor" on the WMDE Analytics Portal.
I did not invest any work into its re-design since the decision was to keep the dashboard on maintenance only. However, for reasons of consistency, I think it should be done. Please let me know if you agree.