Page MenuHomePhabricator

Investigate issues with mediawiki-fileimporter grafana dashboard panels
Closed, InvalidPublic

Description

There appear to be some issues with the dashboard, the easy ones to spot are the following:

  • When the failure type "All" is selected the top right single stat panel does not work. This probably needs a sumSeries call to add the series together then multiple are selected.
  • Max import time reported lower than the median import time in single stat panels. Currently the max is reported as 6k but the median is 9.7k (see in F25749476)

We should probably have a look as the dashboard as a whole again and see what is happening.

Event Timeline

Restricted Application added a project: TCB-Team. · View Herald TranscriptSep 10 2018, 10:00 AM
  • These and other similar problems on this dashboard are waiting for some time already because of my focus on Wikidata/WDCM (primarily T203389, T187396);
  • however, I can take some time of these tickets this week (a) if the cluster reboot is finished (T203165) and (b) if I manage to start running a bunch of Hive jobs that need to be run in order to fix the WDCM's update.

I also think that we need to meet (Hangouts) and try to work on this dashboard together for a while.
Namely, the problems there are rather strange in the following respect:

  • the Graphite queries used are (most of the part, if not always) simply a product of a CTRL-C-CTRL-V procedure and then a variable change form our other dashboards, where the queries deliver;
  • I have demonstrated some of the odd behavior to @Lea_WMDE before Wikimania 2018.

In my understanding,

(1) if the same queries are used to produce the same functionality,
(2) while the only change present is in the choice of variables,
(3) and the queries deliver non-desired results on one and desired results on another dashboard,
(4) the source of the problem is in the structure of data used by two dashboards which is different.

So, we need to take a close look at this dashboard, think through thoroughly on what is happening, and see if we can standardize our approach to producing and maintaining Grafana dashboards once and for all.

Please suggest a convenient time later this week for a Hangouts session when we can inspect this dashboard closely.

I will not be able to take a look at this dashboard before Saturday (09/15) or Sunday (09/16).

GoranSMilovanovic added a comment.EditedSep 16 2018, 2:38 PM

@Addshore @Lea_WMDE @Franziska_Heine

The following:

When the failure type "All" is selected the top right single stat panel does not work. This probably needs a sumSeries call to add the series together then multiple are selected.

is now hopefully fixed by using

summarize(sumSeries(MediaWiki.FileImporter.specialPage.execute.fail.$FAILTYPE.sum), '1d', 'sum', false)

in place of

summarize(MediaWiki.FileImporter.specialPage.execute.fail.$FAILTYPE.sum, '1d', 'sum', false)

as suggested by @Addshore.

I have to admit that I am still experiencing confusion when I need to use sumSeries() across metrics whose names end in .sum.

GoranSMilovanovic added a comment.EditedSep 16 2018, 3:31 PM

@Addshore @Lea_WMDE @Franziska_Heine

As of the following:

Max import time reported lower than the median import time in single stat panels. Currently the max is reported as 6k but the median is 9.7k (see in F25749476)

At the present moment (16:41 CEST 09/16/2018) the dashboard reports all four Import Time reporting values (min, max, median, and mean) to be the same.

Now, here are the queries that I've used in the respective singlestats; as you may observe, the only difference between the queries is in the usage of the respective aggregation (min, max, median, average).

MINIMAL IMPORT TIME

summarize(MediaWiki.FileImporter.import.timing.wholeImport.sum, '1d', 'min', false)

MAXIMUM IMPORT TIME

summarize(MediaWiki.FileImporter.import.timing.wholeImport.sum, '1d', 'max', false)

MEDIAN IMPORT TIME

summarize(MediaWiki.FileImporter.import.timing.wholeImport.sum, '1d', 'median', false)

MEAN IMPORT TIME

summarize(MediaWiki.FileImporter.import.timing.wholeImport.sum, '1d', 'average', false)

Now, in the Options (metrics tab), I've switched the Stat value back from Average to Current - I would say that should to be the correct setting here. However, I am still getting all four values the same. NOTE: By changing the Stat in Options (metrics tab) one does obtain different numbers that make sense: min, max, average, for example; however (a) it is unclear how does the usage of this parameter interact with queries that are already written out so to summarize by min, max, mean, and median, and (b) this value cannot be set to median, and we need to show the median import time.

I have then changed the value of the alignToFrom function parameter from false to true: no change at all. This change was made in accordance with the following observation from the Graphite functions overview (see under summarize):

By default, buckets are calculated by rounding to the nearest interval. This works well for intervals smaller than a day. For example, 22:32 will end up in the bucket 22:00-23:00 when the interval=1hour. Passing alignToFrom=true will instead create buckets starting at the from time. In this case, the bucket for 22:32 depends on the from time. If from=6:30 then the 1hour bucket for 22:32 is 22:30-23:30.

Then I have experimented with the MAXIMUM IMPORT TIME singlestat and changed the complete query to:

consolidateBy(MediaWiki.FileImporter.import.timing.wholeImport.sum, 'max')

(inspired by a recent @Addshore's blog post on Grafana)
getting exactly the same number as when

summarize(MediaWiki.FileImporter.import.timing.wholeImport.sum, '1d', 'max', false)

was used. I have left this singlestat with consolidateBy for us to observe the difference in its behavior (if any) in respect to the remaining three singlestats that summarize Import Time.

In total despair I have also set Max.Data Points to 9999999 (as @Addshore has already observed that might be useful in some cases, although I don't see how it would help us here). No effect; same numbers.

In my opinion: something is rotten in the dashboard of Grafana; or, to cite @Addshore

"I have no doubt that Grafana will continue to trip me and others up with little quirks like this."

@Lea_WMDE @Addshore

I wonder what is the status of this task following our joint inquiry into the inner depths and miracles of Grafana?

Lea_WMDE changed the task status from Open to Stalled.Jan 25 2019, 1:03 PM

The general look through the dashboard still needs to happen, but as long as that didn't happen there is nothing that can be done

Addshore removed GoranSMilovanovic as the assignee of this task.Jan 29 2019, 9:25 AM
Addshore moved this task from Incoming to Engineering Teams on the WMDE-Analytics-Engineering board.
awight closed this task as Invalid.Sep 11 2019, 1:51 PM
awight added a subscriber: awight.

These bugs are no longer relevant to our dashboard.