Page MenuHomePhabricator

Interlanguage links dashboard is broken since November 2018
Closed, ResolvedPublic

Description

The interlanguage links dashboard at https://language-reportcard.wmflabs.org/interlanguage/#desktop has been broken since November 4 2018. It shows no data for particular languages, but if you select all the languages, it shows a chart for a "null" language. We haven't looked at this dashboard for some time, but we may make changes in the products around interlanguage links in the foreseeable future, so it must be fixed.

I thought that it happened because of some change that we didn't notice in the wmf.projectview_hourly table. However, I cannot find anything broken in how the query runs at the moment. I tried running it and it seems to give sensible data.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Amire80 renamed this task from Fix the interlanguage links dashboard to Interlanguage links dashboard is broken since November 2018.May 27 2019, 7:41 AM
Amire80 updated the task description. (Show Details)

Tagging as Analytics-Dashiki, but please move it to the right product if needed.

Screen Shot 2019-05-27 at 9.45.07 AM.png (1×2 px, 331 KB)

Please see screenshot, seems to be rendering data , can you be more specific as to what does not work.

I'll clarify. It suddenly drops for all languages in November 2018. This isn't supposed to happen: all languages must be shown continuously, to the current week. However, if you select "null" in the list of projects in the sidebar, you'll see that it starts working in November 2018. So something happened then that made all languages collapse into "null". "null" is invalid, and it isn't supposed to be there at all. The list at the sidebar is supposed to include only Wikipedia, Wikisource, and Wikivoyage projects in all the valid languages. The Wikipedia projects are supposed to have non-zero data from mid-2016 until today (for Wikisource and Wikivoyage, the data collection began in December 2017).

If I run this query in hive on stat2007, I see sensible data:

 select '2018-05-20' AS date,
        weekly_navigation_count.project,
        navigation_count_total / view_count_total as percent_interlanguage_navigation

   from (select concat(current_project, '.', project_family) as project,
                sum(navigation_count) as navigation_count_total
           from wmf.interlanguage_navigation
          where date between '2018-05-20' and '2018-05-23'
            and project_family in ('wikipedia', 'wikivoyage', 'wikisource')
          group by project_family, current_project
        ) weekly_navigation_count

            inner join
        (select project,
                sum(view_count) as view_count_total

           from wmf.projectview_hourly
          where concat(year, '-', lpad(month, 2, '0'), '-', lpad(day, 2, '0')) between '2018-05-20' and '2018-05-23'
            and access_method = 'desktop'
            and agent_type = 'user'
            and (   project like '%wikipedia'
                or  project like '%wikivoyage'
                or  project like '%wikisource'
                )
          group by project
        ) weekly_projectview                    on weekly_navigation_count.project = weekly_projectview.project
;

So I'm really not sure where is it broken.

I took this query directly from the code that collects the data: https://phabricator.wikimedia.org/diffusion/ANLL/browse/master/interlanguage/percent_interlanguage_navigation_curr and added dates of my own instead of $1, $2.

fdans moved this task from Incoming to Ops Week on the Analytics board.

Change 514168 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/limn-language-data@master] Make interlanguage queries column names explicit

https://gerrit.wikimedia.org/r/514168

Change 514168 merged by Mforns:
[analytics/limn-language-data@master] Make interlanguage queries column names explicit

https://gerrit.wikimedia.org/r/514168

Looking at the generated reports, the header was the following:

date, project, percent_interlanguage_navigation, weekly_navigation_count.project

But the data had only 3 columns:
From start of data until 2018-10 the second column was empty and the last column contained the project dimension,
and then starting from 2018-11 the last column was empty and the second column contained the project dimension.
Dashiki was configured to use weekly_navigation_count.project as the project dimension,
so from 2018-11 on there were no counts for projects, only null.

Why that happened?
Reportupdater allows to add new columns to an already existing report. It adds them to the end of the header.
I believe some column renaming happened at some point in the queries, and RU added a 4th column.
Then, seems like the queries changed the name of the column back to project and the data was stored in the original field.

What I did.
I reordered the report files to have always (date, project, percent_interlanguage_navigation).
I modified the Dashiki config to use the project dimension instead of the weekly_navigation_count.project one.
I created and merged a limn-language-data patch to make the project column naming explicit.
The split works now in the Dashboard.

However!
There's another problem, unrelated to this one.
The RU files belong still to the HDFS user, and RU has not been able to execute since 2019-05-20.
This happens with all RU jobs in Hive.
I will open another task for that. Once permission are fixed, RU will automatically backfill.
So, I will move this task to DONE.

mforns added a project: Analytics-Kanban.
mforns moved this task from Next Up to Done on the Analytics-Kanban board.

As far as I can see, the charts look sensible now, continuously since 2016 until 2019-05-12. This is already very good. I was a bit concerned that some data after November 2018 was lost, but appears good. Thanks!

The RU files belong still to the HDFS user, and RU has not been able to execute since 2019-05-20.

So I guess this is the reason why the chart ends on 2019-05-12 at the moment?

So I guess this is the reason why the chart ends on 2019-05-12 at the moment?

Yes, you can follow progress to fix that in T224948.

@Amire80 the dashboard should now be showing up to date data, please check everything looks good.
Cheers