Page MenuHomePhabricator

Refactor the Wikidata Data Quality Report analytics procedures
Closed, ResolvedPublic

Description

Refactor the Wikidata Data Quality Report analytics procedures:

  • refactor (most of) the data engineering code to work in the analytics cluster;
  • it is now done in R on a single server across the data sets produced from Pyspark,
  • by a process that eats up to 50Gb RAM memory on stat1007 - the Analytics-Engineering keep on killing it and for a good reason;
  • everything must migrate to Pyspark and run in the analytics cluster.

Also:

  • inspect what exactly was going so hard on the stat1007 resources: (a) the joins, or (b) rendering {ggplot2} visualizations;
  • if (a) then we fix it simply by moving everything to the cluster, if (b) find the solution how to render the visualizations or visualize aggregated data sets only.

Event Timeline

  • ORES score predictions moved to hdfs, loaded to Spark;
  • all join operations will be performed in the cluster.

Next:

  • produce the final analytics dataset w. Pyspark;
  • see if R can cope with it (rendering {ggplot2}, aggregations maybe) without putting to much stress on stat1007.

Status:

  • produce the final analytics dataset w. Pyspark: DONE.

Next:

  • see if R can cope with it (rendering {ggplot2}, aggregations maybe) without putting to much stress on stat1007;
  • current RAM usage on stat1007: < 30Gb (compare to initial ~50Gb).

Status:

  • incorrect data.frame produced;
  • fixing now.

@Ladsgroup We might have a problem with your most recent version of ORES quality score predictions for Wikidata.

In our Wikidata Quality Report, based on the first data set that you have provided, we had the ORES score prediction for 57,014,687 items.

However, your most recent data set - run_201910.out - (and ask me in an email if you need to be reminded the path under which it is found on stat1000*), when filtered for the most recent item revision scored, has 56,315,883 items.

It seems that we are missing most of the A (best) class items and many B class items from the new data set. Let's compare the counts:

New data set: run_201910.out

A        B         C          D          E
381      442,998   21,809,754 20,974,324 13,088,426

Previous data set (used for the Wikidata Quality Report)

A        B           C          D          E
10730    1,007,744   23,997,946 19,136,231 14,405,722

Do you have any idea on what could have gone wrong here? Thanks.

@Ladsgroup We might have a problem with your most recent version of ORES quality score predictions for Wikidata.

In our Wikidata Quality Report, based on the first data set that you have provided, we had the ORES score prediction for 57,014,687 items.

However, your most recent data set - run_201910.out - (and ask me in an email if you need to be reminded the path under which it is found on stat1000*), when filtered for the most recent item revision scored, has 56,315,883 items.

The result is based on dump of 2019-10-01, any edit after that date is not counted in the system. This is by design to have snapshots of a wiki in the given times.

@Ladsgroup

The result is based on dump of 2019-10-01, any edit after that date is not counted in the system. This is by design to have snapshots of a wiki in the given times.

Thank you. Correct me if I am wrong, please:

  • that means that I need to (1) keep the initial ORES score predictions (the predictions that you have initially provided), and then (2) use your subsequent ORES predictions to update them?

I have trouble understanding you. Can you elaborate more?

@Ladsgroup

My attempt at a logico-historical approach:

Step 1. You have produce a first data set of ORES score predictions that included all items.
Step 2. I have produced the Wikidata Quality Report based on that data set.
Step 3. You have produced an update of ORES score predictions.
Claim A. The update produced in Step 3. does not encompass the ORES score predictions for all items, but only for those items that have received a revision between steps 1. and 3.

The central question is: is my statement in Claim A. correct?

@Ladsgroup

My attempt at a logico-historical approach:

Step 1. You have produce a first data set of ORES score predictions that included all items.
Step 2. I have produced the Wikidata Quality Report based on that data set.
Step 3. You have produced an update of ORES score predictions.
Claim A. The update produced in Step 3. does not encompass the ORES score predictions for all items, but only for those items that have received a revision between steps 1. and 3.

The central question is: is my statement in Claim A. correct?

No, it's not correct. The update is based on a snapshot of the new month, it means things that have not been changed got repeated and things that have changed got the new version. If you look at the file, it has around 50M lines, not all of Wikidata items change every month.

@Ladsgroup

If you look at the file, it has around 50M lines, not all of Wikidata items change every month.

Well,

wc -l run_201910.out
407193775 run_201910.out

so it really has more than 407M lines?

no no, it should have only ~50mio lines, no more.
Looking at the file it has lots of old data. I should drop them, it's too much and only these lines should stay.

ladsgroup@stat1007:~/articlequality$ grep 20190901000000 run_20190901.out | wc -l
54049707

Does it make sense to you?

Does it make sense to you?

Ping :)

@Ladsgroup Pong. I am on it as soon as I figure out what is wrong with this thing in WDCM: T239196.

@Ladsgroup Got it.

Ok: please do not delete any of your ORES predictions from stat1007 until I let you know that the update pipeline for the Data Quality Report is in place. That should happen soon. Thanks!

@Ladsgroup @WMDE-leszek

I have started a gradual update procedure (fromt the initial 111Gb run.out -> run_20190901.out -> run_201910.out -> run_201911.out).

As soon as this is completed, I will let @Ladsgroup know that these files can be removed from his directories on stat1007.

@Ladsgroup Following this update run, we need to (a) standardize the update filename, and (b) decide upon how frequently do we want to have it.

@Lydia_Pintscher and I have agreed that the Data Quality Report should be produced quarterly. However, just in case we need a fresh update immediately, maybe @Ladsgroup should put his ORES routine on a monthly update schedule. We'll see.

@Ladsgroup As of me, you don't need to keep the following ORES updates on stat1007 anymore:

  • run.out
  • run_20190901.out
  • run_201910.out
  • run_201911.out

From the filenames of your most recent updates: run_201910.out, run_201911.out, I infer that we will have run_201912.out, run_202001.out, then run_202002.out, etc. in the future. Please confirm.
Also, you do not need to worry about the following (see T237013#5666639):

Looking at the file it has lots of old data. I should drop them, it's too much and only these lines should stay.

because filtering out of the updates (like yours: grep 20190901000000 run_20190901.out) now happens in the Wikidata Quality Report code.

@Lydia_Pintscher Now that the data sets are updated, we will certainly have the Wikidata Quality Report updated on or before December 15 as planned.

@Ladsgroup As of me, you don't need to keep the following ORES updates on stat1007 anymore:

  • run.out
  • run_20190901.out
  • run_201910.out
  • run_201911.out

From the filenames of your most recent updates: run_201910.out, run_201911.out, I infer that we will have run_201912.out, run_202001.out, then run_202002.out, etc. in the future. Please confirm.

I confirm

Also, you do not need to worry about the following (see T237013#5666639):

Looking at the file it has lots of old data. I should drop them, it's too much and only these lines should stay.

because filtering out of the updates (like yours: grep 20190901000000 run_20190901.out) now happens in the Wikidata Quality Report code.

I care about the storage issues. I dropped unneeded lines and freed 70GB in stat1007.

@Ladsgroup

I care about the storage issues. I dropped unneeded lines and freed 70GB in stat1007.

Even better. Thank you for all your efforts to support our quality assessment process!

@Lydia_Pintscher @WMDE-leszek

Since @JAllemandou has kindly provided a fresh copy of the WD JSON dump in hdfs (T209655#5713452), our next WD Quality Report update will be constrained only by the timestamp of the most recent snapshot of the mediawiki_history available to us.

Let me elaborate. To produce the report, we need to coordinate the data from:

  • the update of ORES scores (available monthly),
  • the update of the wmf.mediawiki_history table (available monthly),
  • the WD JSON dump data in hdfs (available when @JAllemandou produces a new data set before the process is productionized), and
  • the WDCM reuse statistics update (available weekly).

At this point we have the ORES scores update for November, the WDCM reuse statistics are updated weekly, the dump was copied to hdfs just now, and the current snapshot of wmf.mediawiki_history is snapshot=2019-11 (November).
In effect, that means that our December 2019 version of the quality report will be based on November 2019 data.

@Lydia_Pintscher The WD Data Quality Report is now updated.

As of the future updates, they will be prepared monthly as new ORES data sets are expected from @Ladsgroup on a monthly basis.

Closing the ticket as resolved and focusing on T234161 WD Data Quality: compare quality vs usage on commons vs everything else now.