Page MenuHomePhabricator

Grafana: "wikidata-datamodel-terms" doesn't update anymore
Closed, ResolvedPublic

Description

Problem:
We want to understand how the terms (labels, descriptions, aliases) in different languages develop over time. We have a dashboard at https://grafana.wikimedia.org/dashboard/db/wikidata-datamodel-terms?var-lang=All that is broken. It should be fixed to show data for all graphs there again. It would be ideal to backfill the data for the past year at least.

Acceptance criteria:

  • all graphs show data again.

Tech note:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@GoranSMilovanovic: would be good if you can put this on your todo list. People are asking for current data.

@Lydia_Pintscher @Addshore

  • I've seen a lot of data re-engineering going on around the wb_terms SQL table recently.
  • The WDCM system also experiences some problems when fetching from this table (check the WDCM dashboards and you will spot too many missing labels for quite frequently used items - while the WDCM R code itself didn't change) .
  • As soon as I'm done with Cognate/Wiktionary (T166487), I will take a look at the state in which the wb_terms table is found at the present moment, and then suggest a solution for this.
  • This is not the first time that I am about to state the following (c.f. T193969#4185028): we are obviously in the Big Data segment here, which implies that SQL must go. It was not meant to be.
  • Stay in touch.
  • Cheers.

I will take a look at the state in which the wb_terms table is found at the present moment, and then suggest a solution for this.

We probably shouldn't look at getting this from the wb_terms table as this will be removed in the hopefully not to distant future.
The best bet is probably to get this information from the JSON dumps.

@Addshore After reconsidering this, I have to state openly that I am against relying on JSON dumps as the only source of data.

@WMDE-leszek @Aleksey_WMDE will also be interested to hear, I guess. Well:

  • Given the context of the contemporary Data Science, and the really powerful infrastructure that we have at our disposal, it simply doesn't seem right to have to process the Wikidata dumps in order to be able to fetch any statistics on the data model or provide (simple or not) aggregates.
  • I think we need a solution that lives properly in the Big Data segment. That most probably means Hadoop. Maybe sqooping the whole wb_terms table - or whatever re-engineered SQL version of it - to Hadoop, and then re-direct any relevant changes that we would previously have recorded in SQL to Hadoop.

@Lydia_Pintscher @Addshore @WMDE-leszek @Aleksey_WMDE

Putting aside the question of the afterlife of the SQL wb_terms table for now: is there anything that can be done to fix this Dashboard anytime soon, or we wait for a new data engineering solution for the labels first?

@Addshore After reconsidering this, I have to state openly that I am against relying on JSON dumps as the only source of data.

Why?

@WMDE-leszek @Aleksey_WMDE will also be interested to hear, I guess. Well:

  • Given the context of the contemporary Data Science, and the really powerful infrastructure that we have at our disposal, it simply doesn't seem right to have to process the Wikidata dumps in order to be able to fetch any statistics on the data model or provide (simple or not) aggregates.
  • I think we need a solution that lives properly in the Big Data segment. That most probably means Hadoop. Maybe sqooping the whole wb_terms table - or whatever re-engineered SQL version of it - to Hadoop, and then re-direct any relevant changes that we would previously have recorded in SQL to Hadoop.

I'm not sure if there is priority to get this done.

@Lydia_Pintscher @Addshore @WMDE-leszek @Aleksey_WMDE

Putting aside the question of the afterlife of the SQL wb_terms table for now: is there anything that can be done to fix this Dashboard anytime soon, or we wait for a new data engineering solution for the labels first?

The way to get this dashboard fixed in the short term would be to use the dumps.
Another alternate is to wait for the changes to wb_terms, but that could take a while.

@WMDE-leszek recently altered the dump script to count the number of labels in total, this would be trivial to split per language for labels and also include other types of terms.
https://gerrit.wikimedia.org/r/#/c/analytics/wmde/toolkit-analyzer/+/440133/2/analyzer/src/main/java/org/wikidata/analyzer/Processor/MetricProcessor.java

Ok. Someone please ping me know when the relevant metrics are instantiated in Graphite.

@Addshore As of the comments in respect to T154601#4275121 (wb_terms -> Big Data), well, obviously because it makes more sense to work with a proper Big Data solution than to process dumps (similar to the efforts made to migrate from Wikistats to Wikistats 2.0). In the meantime I have learned from @WMDE-leszek and @Aleksey_WMDE that the choice of technology for this is already being debated.

Addshore changed the task status from Open to Stalled.Jun 25 2018, 9:46 AM

What's the status of this after wb_terms is removed?

What's the status of this after wb_terms is removed?

It still will not update and the data collection will still need to be re written.

A tech task, but I think this mainly needs to be in the product prioritization list

Lydia_Pintscher changed the task status from Stalled to Open.May 12 2020, 12:20 PM

No longer stalled.

Moving back to TODO and removing my face as I haven't started writing any code yet and won't be on camp from next week.

I won't be around but also I want to mention I'm not sure reportupdater would be a great idea here. We are not using the codebase, reportupdater is not under active development (44 commits according to github), its yaml files are scattered everywhere, we have our own system of refinery scripts that we can use.

My suggestion would be to just try making this work with the new term store, it should be faster as the tables are way smaller and we don't need to scan large rows (that have term text in them)

Change 603557 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[analytics/wmde/scripts@master] Fix and reenable terms_by_language script

https://gerrit.wikimedia.org/r/603557

The above change implements Amir’s suggestion, I think – stick to the usual analytics-wmde-scripts methods, just with a different SQL query. The PHP code isn’t tested at all yet because I don’t know how to try it out.

That said, I tried to run the queries against stat1007. The properties query completes in 18 seconds, that’s fine. But the items query just kept running, so I killed it after one hour or so (I don’t know how long exactly). This might need to be optimized further.

That said, I tried to run the queries against stat1007. The properties query completes in 18 seconds, that’s fine. But the items query just kept running, so I killed it after one hour or so (I don’t know how long exactly). This might need to be optimized further.

Yup, with the old wb_terms storage this script was disabled due to the query never really finishing/causing too much load.
Hence the desire to switch to Hadoop rather than using SQL.

PHP Fatal error:  Uncaught exception 'RuntimeException' with message 'Something went wrong with the db query' in /a/analytics-wmde/src/scripts/src/wikidata/datamodel/terms_by_language.php:24
Stack trace:
#0 /a/analytics-wmde/src/scripts/src/wikidata/datamodel/terms_by_language.php(12): WikidataTermsByLanguage->execute()
#1 {main}
  thrown in /a/analytics-wmde/src/scripts/src/wikidata/datamodel/terms_by_language.php on line 24
+ date '+%F %T daily.03.sh Ended!'
2017-01-05 05:00:07 daily.03.sh Ended!

Probably due to the query taking too long now as reported in https://lists.wikimedia.org/pipermail/analytics/2016-December/005631.html

I'll disable the script to stop it running repeatedly while failing and work on a dump based replacement.

So we either need to use the dump, or use hadoop.
AFAIK we can't read from Hadoop using the wmde analytics scripts stuff.
If we were to do this in a dump then we could do it as part of wmde scripts.
Although that would be massively inefficient compared with hadoop

That said, I tried to run the queries against stat1007. The properties query completes in 18 seconds, that’s fine. But the items query just kept running, so I killed it after one hour or so (I don’t know how long exactly). This might need to be optimized further.

Yup, with the old wb_terms storage this script was disabled due to the query never really finishing/causing too much load.
Hence the desire to switch to Hadoop rather than using SQL.

I might be missing something obvious but that was the old horrible term store and I assume that since the new tables are ten times smaller, this would be possible (specially that we don't need to go through wbt_text for example). Can we try with the new term store SQL before trying hadoop?

We need queries like this:

SELECT count(lang), lang
FROM wmf.wikidata_entity
LATERAL VIEW explode(labels) t AS lang, text
WHERE snapshot='2020-03-02'
GROUP BY lang
LIMIT 100

So briefly chatting with WMF Analytics.

7:33 PM <ottomata> addshore: just fyi, SRE is trying to deprecate graphite
7:33 PM <ottomata> so you might not want to add anything new there
9:37 PM <ottomata> addshore: the tricky thing is that grafana's datasources (that we currently use) can't accept historical data
9:37 PM <ottomata> graphite does
9:37 PM <ottomata> but prometheus does not
9:37 PM <ottomata> there is a unmaintained druid datasource for grafana that would work
9:38 PM <ottomata> but until someone fixes it up we can't use it
9:38 PM <ottomata> could you make a dashboard in superset instead?
9:39 PM <addshore> Now thats a valid question! I imagine the answer is yes? all of the data could just live in another hadoop table after all. the fact said dashboard wouldnt be public could be annoying
9:39 PM <addshore> we have a whole bunch of this sort of data that we currently put in graphite

@Lydia_Pintscher would it be okay for these dashboards to move to superset? (not public)

I don't like this solution, as community also want these data (T208425#5994513). I even think we should have a Hadoop installation in Labs.

@Addshore you can query wikidata_entity from the SQL lab tb in superset. Superset is not public however so to access you need LDAP authentication https://superset.wikimedia.org

@Lydia_Pintscher would it be okay for these dashboards to move to superset? (not public)

As @Bugreporter said the editors should have access to this information as well as it's pretty vital to better understand our biases and gaps in language coverage and how they develop.

Then I suggest something like: https://wmcs-edits.wmflabs.org/#wmcs-edits this dashboard is public and powered by data that is extracted from hive (via using reportupdater, see queries: https://github.com/wikimedia/analytics-reportupdater-queries/tree/master/wmcs)

@Lydia_Pintscher with all of the changes this is starting to get toward something that it might make sense to have @GoranSMilovanovic tackle.

@Lydia_Pintscher

... the editors should have access to this information as well as it's pretty vital to better understand our biases and gaps in language coverage.

I would suggest expanding our existing Wikidata Languages Landscape system with

  • tabular data, because it now relies merely on data visualizations,
  • split by labels, descriptions, and aliases, as desired, and include
  • historical data to match the representation used previously on the Grafana dashboard under discussion.

This development would also match my current workflow nicely since I am focused on the optimization of the Wikidata Languages Landscape back-end (see: T253345) right now; namely, it processes a huge amount of data and needs to be optimized so to do it faster and reduce the resource consumption on the stat100* servers.

We can discuss this option in out 1:1 today. Otherwise, we're talking a new Wikidata dashboard with an Apache Spark back-end which would update every week as the new snapshot of the dump becomes available.

@Lydia_Pintscher @GoranSMilovanovic

Following our 2020/07/27 1:1 meeting:

  • the T154601#6337065 approach to incorporate the data from the wikidata-datamodel-terms dashboard into the Wikidata Languages Landscape dashboard will be taken;
  • the dashboard's existing features will be clearly separated from the new data;
  • this ticket is related to the WMDE Wikidata (non-WDCM) Analytical Systems Optimization (T253345) as a parent task now, since the Wikidata Languages Landscape analytical and ETL back-end needs serious optimization before any changes are introduced to produce the new datasets for the terms datamodel.

@Addshore @Lydia_Pintscher

Since the wikidata-datamodel-terms Grafana dashboard does not show any data, I find it difficult to understand the y-axis (%) in the charts in the second row (Label Languages, Description Languages, and Alias Languages):

  • is that the % of Wikidata entities that currently have a label/description/alias in a given language, or
  • is it something else?

The ETL code for this is (almost) completed in PySpark.

@Lydia_Pintscher Hey thank you, I see know.

The languages on the three above mentioned chart seem to sum up to 100%, which probably means

  • that the charts represent the proportion of labels/aliases/descriptions in a given language
  • relative to the total number of labels/descriptions/languages.

I was wondering how useful this information really is, and should we maybe consider to present a different thing:

  • for a given language, for how many Wikidata items (in % of the total number of items) does it provide a label/description/alias?

@Lydia_Pintscher Please check-out the dashboard: http://wmdeanalytics.wmflabs.org/WD_LanguagesLandscape/ - the Datamodel:Terms (landing) tab.
I am slightly unsure about the numbers reported on item descriptions. What do you think?

Note. If the global descriptions chart is broken (shows nothing), use the test server please: http://datakolektiv.org/WD_LanguagesLandscape/ - it is a consequence of some minor incompatibility which will be resolved once the analytics front-end is fully containerized (as it is on the test server).

GoranSMilovanovic lowered the priority of this task from High to Medium.Aug 12 2020, 7:24 PM

I took a look. Here's what I noticed;

  • for English for example the number of descriptions and the number of labels seem identical for some of the last values I checked. And then there is a very very steep decline for the last value. That seems suspicious.
  • for Aliases: I assume it's counting two aliases on the same Item in the same language as two and not one? Just making sure.
  • We probably want to shows graphs in the order "labels, descriptions, aliases" to follow what we have in the UI on Wikidata as well as have the more important things first.
  • I like that the language list seems to be ordered by amount of content.
  • For labels and descriptions it might be useful to see the global change in numbers relative to the growth of the number of Items as well. Basically to answer if we are getting better or worse when it comes to the average number of labels or descriptions per Item.

I like it :)

@Lydia_Pintscher

for English for example the number of descriptions and the number of labels seem identical for some of the last values I checked. And then there is a very very steep decline for the last value. That seems suspicious.

That is true. I need to check for this. What might be the case is that my dashboard update used the wmf.wikidata_entity Hive table - which is really an hdfs version of our JSON dumps - while the table update itself was not completed yet. But this I need to check empirically.

for Aliases: I assume it's counting two aliases on the same Item in the same language as two and not one? Just making sure.

Hm, hopefully yes. That depends upon the structure of the aliases field in the wmf.wikidata_entity table: I will check on some particular item which has two aliases in the same language.

We probably want to shows graphs in the order "labels, descriptions, aliases" to follow what we have in the UI on Wikidata as well as have the more important things first.

Ok. That can be taken care of quickly - let me please just make sure the data related problems are solved first.

For labels and descriptions it might be useful to see the global change in numbers relative to the growth of the number of Items as well. Basically to answer if we are getting better or worse when it comes to the average number of labels or descriptions per Item.

I would suggest to calculate the following proportions and have them all presented at the dashboad:

  • number of labels in a particular language / number of items in Wikidata
  • number of aliases in a particular language / number of items in Wikidata
  • number of descriptions in a particular language / number of items in Wikidata

and then also include (a) the average number of labels per item, (b) the average number of aliases per item, and (c) the average number of descriptions per item - but I am not sure if we need (a), (b), (c) also calculated per language? Let me know what you think.

@Lydia_Pintscher

For labels and descriptions it might be useful to see the global change in numbers relative to the growth of the number of Items as well. Basically to answer if we are getting better or worse when it comes to the average number of labels or descriptions per Item.

I would suggest to calculate the following proportions and have them all presented at the dashboad:

  • number of labels in a particular language / number of items in Wikidata
  • number of aliases in a particular language / number of items in Wikidata
  • number of descriptions in a particular language / number of items in Wikidata

and then also include (a) the average number of labels per item, (b) the average number of aliases per item, and (c) the average number of descriptions per item - but I am not sure if we need (a), (b), (c) also calculated per language? Let me know what you think.

I think these are the most important:

  • number of labels in all languages combined / number of items in Wikidata
  • number of descriptions in all languages combined / number of items in Wikidata
  • number of aliases in all languages combined / number of items in Wikidata

@Lydia_Pintscher

for Aliases: I assume it's counting two aliases on the same Item in the same language as two and not one? Just making sure.

No (good catch!): it is now fixed to do that.

for English for example the number of descriptions and the number of labels seem identical for some of the last values I checked.

I am currently inspecting this problem.

And then there is a very very steep decline for the last value.

Introduced a ten hours pause between (a) the first recognition that the wmf.wikidata_entity table has a new snapshot and (b) our update, to make sure that we work with a fully populated Hive table. Let's see what happens on the next update.

I think these are the most important:

number of labels in all languages combined / number of items in Wikidata
number of descriptions in all languages combined / number of items in Wikidata
number of aliases in all languages combined / number of items in Wikidata

Working on these aggregates too.

@Lydia_Pintscher

for English for example the number of descriptions and the number of labels seem identical for some of the last values I checked.

This should be fixed now. While it is impossible to reconstruct the exact context in which this error occurred I think we can assume that it was due merely to my erroneous handling of the initial data intake (historical data from wmf.wikidata_entity since the beginning of 2020). The correct data should be available anytime soon on the dashboard (test server: http://datakolektiv.org/WD_LanguagesLandscape/)

  • To Do: aggregates
    • number of labels in all languages combined / number of items in Wikidata
    • number of descriptions in all languages combined / number of items in Wikidata
    • number of aliases in all languages combined / number of items in Wikidata; and
    • show graphs in the order "labels, descriptions, aliases" to follow what we have in the UI on Wikidata as well as have the more important things first.

That looks good! The huge difference between average labels/item and average descriptions/item is quite astonishing... But then again a lot of the descriptions are probably mass-added translation of "scientific article" etc so not unrealistic.

@Lydia_Pintscher

The huge difference between average labels/item and average descriptions/item is quite astonishing...

I was just about ask: do we need to worry about this? Are the numbers plausible, for whatever reason the difference might be huge?

Change 603557 abandoned by Lucas Werkmeister (WMDE):
[analytics/wmde/scripts@master] Fix and reenable terms_by_language script

Reason:
never went anywhere, task was resolved in the meantime

https://gerrit.wikimedia.org/r/603557