Differentiate between main, talk and non-talk/main pages
Open, NormalPublic

Description

The current grafana board shows page views, resolved conflicts and the resulting percentage for both TwoColConflict and the current resolution page.

Add another row that splits all of these numbers for main, talk and non-talk pages or, even better, allow for the selection of a certain namespace that is then applied to all numbers of all graphs and numbers of the board.

Info
We can use the templating part for solving the ticket

NB: The namespaces are only the default namespaces, no wiki specific ones (except for the total)

Lea_WMDE created this task.Dec 4 2017, 5:17 PM
Lea_WMDE moved this task from Proposed to Todo on the WMDE-QWERTY-Team board.
Addshore claimed this task.
Restricted Application added a project: User-Addshore. · View Herald TranscriptDec 6 2017, 12:03 PM

Change 395724 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/TwoColConflict@master] Track edit conflicts by namespace

https://gerrit.wikimedia.org/r/395724

Change 396027 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/core@master] Track which namespaces edit conflicts are resolved in

https://gerrit.wikimedia.org/r/396027

Addshore moved this task from Backlog to Needs Review on the User-Addshore board.Dec 11 2017, 4:06 PM

Change 395724 merged by jenkins-bot:
[mediawiki/extensions/TwoColConflict@master] Track edit conflicts & resolutions by namespace

https://gerrit.wikimedia.org/r/395724

Change 396027 merged by jenkins-bot:
[mediawiki/core@master] Track which namespaces edit conflicts are resolved in

https://gerrit.wikimedia.org/r/396027

Addshore removed Addshore as the assignee of this task.Jan 12 2018, 10:17 AM
Addshore added a subscriber: Addshore.

Data should not be available, so unassigning myself

New metrics in graphite will be:

  • TwoColConflict.conflict.byNamespaceId.$NSID
  • MediaWiki.edit.failures.conflict.resolved.byNamespaceId.$NSID -TwoColConflict.conflict.resolved.byNamespaceId.$NSID

Where $NSID is a integer representing the namespace id.

MediaWiki.edit.failures.conflict.byNamespaceId.$NSID should already exist and is not added by the code in the changes linked above.

NSID will only be one of the default mediawiki namespaces, so no custom namespaces will be tracked here.

@Addshore @Lea_WMDE @Tobi_WMDE_SW

While I currently see no problem in adding new singlestats to include the desired indicators, I need to remind you that some numbers on this dashboard are still incorrect.

In fact, none of the percentage indicators in the top three rows show correct numbers. It is obvious, just go and take a look at them. @Addshore I remember us having some discussion on the divideSeries() Graphite function in relation to this, but I can't remember whether we've ever found out about the origin of the problem. By looking at the Graphite documentation, I can't see that either @Addshore or @GoranSMilovanovic did anything wrong in the usage of the divideSeries() function. This is still a mystery to me.

Also, the Total Edit Resolution Page Views value is lower than the Total Conflicts Resolved value, which I don't think should be possible (the third singlestat in the first and in the second row, please check out).

A thorough analysis of the problem with divideSeries() in the calculation of percents, with several experiments provided in order to determine what factors influence the behavior of this function, is attached.

I have ditched the percentages as these just seem to be terrible. I'm sure that we can manage to calculate this ourselves (at least roughly) on the fly.
It looks like the latest version of grafana now has a "Total" option for single metrics which we now make use of and I have removed the use of summarize.

GoranSMilovanovic added a comment.EditedJan 15 2018, 12:39 PM

@Addshore Please, let's not do this. I would say that removing a feature because we do not understand the way it works (or the way it errs) is not something that we should allow as a practice, if you agree. We've already talked about divSeries() problems in Graphite/Grafana. I've read the full Graphite functions documentation, searched Stackoverflow extensively, and still wasn't able to understand what is wrong, however, that does not mean that we should simply give up. It will come back to us sooner or later just to cause additional frustration.

This is either a problem (a) in the implementation of the function itself, (b) in the way Grafana communicates with Graphite, (c) in the structure of data that we store in Graphite, or (d) in our understanding of the systems that we need to work with. Conservatively, when I face similar problems, I tend to assume (d) is correct.

Please suggest how would you like to handle this. Shall we have a Google Hangouts session dedicated to Graphite/Grafana only? Why not. I can prepare a document on Google Drive, list all the problems and sources (documentation, important Stackoverflow discussions, problems that we've encountered) for us to prepare.

@Lea_WMDE Lea, I can put the new numbers on the Grafana dashboard for this. However, I do not suggest making additional steps in that respect before @Addshore and I figure out once and for all why the functions that we need on this dashboards do not behave as we expect them to. Your call, if you want the row that splits all of these numbers for talk and non-talk pages implemented ASAP, you can have it very soon.

@Addshore The reason I am taking this approach: the Graphite/Grafana stack seems to be the only thing that we're using and that still causes trouble since you and I met and I started working for WMDE. I mean: no way we cannot solve this. After all, once we know precisely how to deal with this, these tasks will become a routine. I think we need an additional push in that respect. Please correct me if you don't find this approach adequate.

I removed the row as the data is bad, there is no point in displaying bad
or confusing data while we try to solve this.
Removing it and adding it when it is correct is the best path.

I don't really have any time to sink into solving this problem, unless the
specific task / issue is picked up by QWERTY in a future sprint.

@Addshore @Lea_WMDE @Tobi_WMDE_SW Let's see:

However, I do not know how to split the number of conflicts resolved per namespace.

Lea_WMDE renamed this task from Differentiate between talk and non-talk pages to Differentiate between main, talk and non-talk/main pages.Feb 12 2018, 11:00 AM
Lea_WMDE updated the task description. (Show Details)

So, we could do this using templating in grafana using the 'byNamespaceId' metircs.
You can find 'Templating' in the cog / settings menu at the top of the dashboard.
The query that can be used to select the namespaces can be something like "MediaWiki.TwoColConflict.conflict.byNamespaceId.*"
The "Name" given to the templating can be something like "namespace" which will provide the "$namespace" variable for use in all of the graphite queries in the individual panels.
All of the queries in the individual panels will need to be updated from using the metrics that represent ALL namespaces to the metrics which are "byNamespaceId"

https://github.com/wikimedia/mediawiki-extensions-TwoColConflict/blob/master/docs/metrics.md

GoranSMilovanovic added a comment.EditedFeb 19 2018, 7:32 AM

@Addshore @Lea_WMDE

I will be referring to the following three singlestats (in the bottom three rows of the dashboard, in the leftmost column): (1) TwoColConflict page views NS:Main, (2) TwoColConflict resolved conflicts NS:Main, and (3) Percent resolved in NS:Main w. TwoColConflict.

The respective queries are:

(1) TwoColConflict page views NS:Main, summarize(MediaWiki.TwoColConflict.conflict.byNamespaceId.0.sum, '1d', 'sum', false)
(2) TwoColConflict resolved conflicts NS:Main: summarize(MediaWiki.TwoColConflict.conflict.resolved.byNamespaceId.0.sum, '1d', 'sum', false)
(3) Percent resolved in NS:Main w. TwoColConflict: divideSeries(#B, #A), where the #B is the query (2), and #A is the query (1).

The resulting percent is incorrect. All other singlestat parameters are copied exactly from the parameters of the working singlestats in the top rows of the dashboard.

@Addshore @Lea_WMDE Check this out please:

  • summarize(MediaWiki.TwoColConflict.conflict.byNamespaceId.0.sum, '1d', 'sum', false) is the query used in the TwoColConflict page views NS:Main singlestat;
  • summarize(MediaWiki.TwoColConflict.conflict.resolved.byNamespaceId.0.sum, '1d', 'sum', false) is the query used in the TwoColConflict resolved conflicts NS:Main singlestat;
  • Currently (09:21 CET), the first query (page views) reports 1, while the second (number of conflicts resolved) reports 18, which is clearly impossible (more conflict resolutions than page views in the same NS).

All other query parameters are absolutely identical.

This is representative of the problems that I am facing in managing Grafana dashboards. I am using the variable names exactly as specified by @Addshore on GitHub.

So when going to the dashboard I spotted that the date range that has been saved with the dashboard was once again very odd.

Once I reset the dashboard to 7 days the % once again makes sense:

We might want to look at the alignToFrom param for summarize as it might solve the issue with alternative and older date ranges.

https://graphite.readthedocs.io/en/latest/functions.html#graphite.render.functions.summarize

  • We're clear on how to proceed here; technically it takes some copy and pasting from the existing (new) singlestats to develop the missing ones. This shouldn't take long.

@Lea_WMDE @Addshore Completed. Please review. Thanks.

@GoranSMilovanovic what you reported on Monday is not solved yet, is it? (Just making sure you are not waiting for me to approve the dashboard here, since it right now doesn't change anything when you enter something for the namespaces)

Addshore moved this task from Done ✔️ to Watching on the User-Addshore board.

@Lea_WMDE Yes, I am waiting for you to close this task. BTW I don't have no idea on what needs to happen when you 'enter something for the namespaces'.

@Tobi_WMDE_SW @Addshore Please check this out and let me know if there's anything missing.

As in the task description:

The current grafana board shows page views, resolved conflicts and the resulting percentage for both TwoColConflict and the current resolution page.
Add another row that splits all of these numbers for main, talk and non-talk pages.

It turned out to be three new rows, actually, and they are found at the bottom of the dashboard.

@GoranSMilovanovic ah I didn't see that, because at the top of the page, there is a selector for namespaceIDs that doesn't do anything. If the selector did something, and was applied to ALL graphs and numbers we have on the board, that would be my no 1 choice (and I realize I did not update the ticket since we discussed this in the meeting with @Addshore, sorry about that).
If this is not possible, and we use the rows below, I would also want the percentages of the non-two-col conflict behavior for comparison (i.e. the current resolution page of the ticket description)

Lea_WMDE updated the task description. (Show Details)Feb 22 2018, 8:19 AM

@Lea_WMDE I think I understand what you mean, similar to the demo that @Addshore showed us once. Let me complete the Dashboard by adding the singlestats for the non-TwoColConflict percentages first, and then we will see what do with the template switch.

GoranSMilovanovic added a comment.EditedFeb 26 2018, 12:39 AM

@Addshore @Lea_WMDE This dashboard is broken again (screenshot attached). 26 February, approx. 01:35:

  • TwoColConflict page views (first row, the leftmost singlestat) = 2
  • Conflicts Resolved w. TwoColConflict (second row, the leftmost singlestat) = 45
  • Percent Resolved w. TwoColConflict (third row, the leftmost singlestat) = 39.13%

Absolutely no changes (well, at least no changes that I've made) were introduced in the meantime.

Other singlestats are also affected, for example:

  • TwoColConflict page views Talk Pages
  • TwoColConflict conflict resolution: Talk Pages
  • Percent resolved in Talk Pages w. TwoColConflict

(the previous three are the rightmost singlestats of the first row below the User choose their version/User choose conflicting version row; these singlestats were showing perfectly sensible information until now).

I will pause the introduction of new singlestats (Default view by Namespaces) until this problem is resolved. This is really beginning to make no sense at all: we don't touch the Dashboard at all, and something that was previously showing one information begins to show some other information. Could it be that the problem is somewhere in Graphite?

@Addshore could you have a look at it?

@Addshore could you have a look at it?

The board (at least the top section) currently looks totally fine for me.

I imagine the bottom section is still a work in progress.

@Addshore @Lea_WMDE This dashboard is broken again (screenshot attached). 26 February, approx. 01:35:

  • TwoColConflict page views (first row, the leftmost singlestat) = 2
  • Conflicts Resolved w. TwoColConflict (second row, the leftmost singlestat) = 45
  • Percent Resolved w. TwoColConflict (third row, the leftmost singlestat) = 39.13%

    Absolutely no changes (well, at least no changes that I've made) were introduced in the meantime.

Well, not strictly true, as you have saved the dashboard 10 times on the 26th since my last saved version on the 21st!

In your screenshot note the loading symbol on the second single stat down in the first column, the data hasn't yet updated from whatever time range you had selected before / whatever data was being read before.

I noticed that again the date range for the dashboard has been switched to 24 hours, this probably want to remain at 7 days to the give the graphs some meaningful default length.

Other singlestats are also affected, for example:

  • TwoColConflict page views Talk Pages
  • TwoColConflict conflict resolution: Talk Pages
  • Percent resolved in Talk Pages w. TwoColConflict (the previous three are the rightmost singlestats of the first row below the User choose their version/User choose conflicting version row; these singlestats were showing perfectly sensible information until now).

These also look fine for me right now

I will pause the introduction of new singlestats (Default view by Namespaces) until this problem is resolved. This is really beginning to make no sense at all: we don't touch the Dashboard at all, and something that was previously showing one information begins to show some other information. Could it be that the problem is somewhere in Graphite?

You just need to always make sure you have a valid date range, just leaving the dashboard on the last 7 days will be fine.
Also make sure all panels on the dashboard have updated when comparing data / doing maths, as if they haven't nothing will make sense.

@Addshore

Well, not strictly true, as you have saved the dashboard 10 times on the 26th since my last saved version on the 21st!

Well, Ok, I did some experimenting, but I didn't touch any of the previously settled features.

I noticed that again the date range for the dashboard has been switched to 24 hours, this probably want to remain at 7 days to the give the graphs some meaningful default length.

But in Berlin we've said last 24h would be the default, correct me if I am wrong? I have intentionally switched the time range from 7 days to last 24h on this occasion.

You just need to always make sure you have a valid date range, just leaving the dashboard on the last 7 days will be fine.
Also make sure all panels on the dashboard have updated when comparing data / doing maths, as if they haven't nothing will make sense.

Will do, thanks for feedback. I'm back to the drawing board to introduce several new singlestats and then see about global templating for the rest (if that doesn't interact with what I've already placed there).

Long live the RStudio Shiny framework, by the way :)

I noticed that again the date range for the dashboard has been switched to 24 hours, this probably want to remain at 7 days to the give the graphs some meaningful default length.

But in Berlin we've said last 24h would be the default, correct me if I am wrong? I have intentionally switched the time range from 7 days to last 24h on this occasion.

The single stats will always only show data for the last 1 day, as they have a per panel setting.
The last 7 days / dashboard default is used for the other charts.
As long as this whole dashboard range includes the data for the last 1 day / 24 hours then all of the single stats should function correctly.

Ok, got it: 7 days default for the dashboard globally, everything else gets an appropriate override setting.

GoranSMilovanovic added a comment.EditedFeb 26 2018, 11:34 AM

@Addshore Now,

  • all the settings are double-checked and certainly fall in line with all the previous suggestions made; comparisons were made with the singlestats that already compute a correct percent by divideSeries() to ensure that the settings are correct;
  • a new singlestat is introduced (the bottom row, Percent resolved in NS:Main w. Default, which is produced by divideSeries() from the two singlestats immediately above it; all of the relevant Graphite queries were simply copy-pasted and the pattern of differences and divisions re-arranged accordingly; double-checked);
  • the obtained percent is incorrect (see attached; it should be about 42%, while Grafana reports 34.04%).

Plus with any settings unchanged, all the singlestats in the BF Disables/Enables/Users column in the top row show NAs.

So, when looking at the 3 pink single stat panels, the first and the last both have "Current" set as their stat value, which will correctly give you the current stat value.
You can find this value in the panel settings under "Options" > "Value:" > "Stat"

The middle pink panel has the "Average" stat selected. This will provide the value of the average of all points selected.

The points returned by graphite are 163 and 146 (146 being correct for what I am currently looking at)

The average of these is 154.5 and the current displayed value is 155

So the bottom value is correct, but then middle panel "Default resolved conflicts NS:Main" is wrong.

@Addshore My bad. Grafana doesn't like me. Thank you for your time and the help you have provided today.

@Lea_WMDE The bottom six rows of the Dashboard should now present all information on NS:Main, Talk and Non-Talk pages. Please review and provide feedback. Thanks.

@GoranSMilovanovic If you add up the numbers of the split up rows, you don't get to the total numbers further up, e.g. for TwoColConflict the total field says that yesterday there were 19 conflicts, but main space + non-talk + talk = 11 + 17 + 5
The same goes for the currently deployed solution.
Or am I understanding something wrong?

TwoColConflict the total field says that yesterday there were 19 conflicts, but main space + non-talk + talk = 11 + 17 + 5

Main namespace is included in non-talk.

So the comparison would actually be 17+5 = 22 I guess, which still is not equal to 19.

Currently I see 30 non talk, 10 talk (so 40) and 38 reported in the total two col conflict page views. So the numbers still seem slightly off. I'm not sure where this small error might be occurring, it could be that some data is getting dropped from the system before it reaches graphite, as the data is recorded as best effort, so sometimes things can be missed.

@GoranSMilovanovic Rather than writing each query for those single stats separately you can combine them with brackets to:

summarize(sumSeries(MediaWiki.TwoColConflict.conflict.byNamespaceId.{0,2,4,6,8,10,12,14}.sum), '1d', 'sum', false)

@Addshore thanks for a useful hint.

@Lea_WMDE To my best knowledge, the queries are correct; the differences that you are referring to seem odd from another perspective, and that is the fact that the queries for the split rows are simple derivatives (e.g. copy + paste) of the respective queries for the non-split rows with just a namespace parameter added.

Maybe we should consider studying the Graphite back-end in more depth.

@GoranSMilovanovic so it looks like there is still something wrong with the numbers in the new six rows. Since being able to just apply a restriction to any namespace (as it would be possible with the selection box on top of the page) is the preferred solution anyways, how about we stall work on the new rows, and instead try to get the selection box on top working?

@Lea_WMDE That will most probably not change anything, but we can give it a try. The problem is related to us not understanding what is happening between the Grafana and the Graphite feedback, and similar problems will persist in the future if we don't change that. That is the essential problem.

We are doing everything the way it's supposed to be done, by the book, and the dashboard still looks broken. And we do not understand why. I do not advise proceeding with any attempts at new solutions before making sure we understand why the current solution does not work. Because if we don't understand why does it not work, and still change our approach - then we are simply guessing.

Currently my Wikidata tasks are prioritized. As soon as I can prioritize the Technical Wishlist tasks again, I will introduce the change on this dashboard. But again, that is going to be more of an exercise in Grafana usage for me than a real solution to the problems that we are facing here.

Are we actually sure that the numbers for each of the rows are generated for exactly the same timeframe? Or could it be that the timeframes differ by a few minutes which would explain the slightly off numbers if you sum up the numbers of individual rows and compare this to the overall number?

@Tobi_WMDE_SW A similar concern is expressed by @Addshore

...it could be that some data is getting dropped from the system before it reaches graphite, as the data is recorded as best effort, so sometimes things can be missed.

If something like this is happening, it wouldn't be the first NoSQL system where similar things can happen; I've heard people explaining before how data can be dropped here and there from Elasticsearch for similar reasons, for example.
However, I am advocating the option to learn and make sure that we understand why is this happening - if it is happening, of course.

As of your question, all the singlestats on this dashboard - the ones in the rows under discussion, for example - have exactly the same time frame: they override the default last 7 days time frame by 1d. The parameter is set in each singlestat manually, double-checked, and should be visible in the upper left corner of each singlestat on the dashboard.

It's quite hard to spot possible errors in these new single stats due to their verbosity.

Using the query that I mentioned above:

summarize(sumSeries(MediaWiki.TwoColConflict.conflict.byNamespaceId.{0,2,4,6,8,10,12,14}.sum), '1d', 'sum', false)

all of the queries in the single stats could be changed to 3 easy to ready queries (similar to the single stats at the top of the dashboard) that might eliminate any errors.

@Addshore I agree, but have you actually found any errors? Because when I wrote out those queries, I did it so carefully - given the previous history of failures on this dashboard I wanted to be extra-extra-careful - and there were certainly no errors there.

BTW, the singlestats in the upper right corner of the dashboard - BF Disables and BF Enables - both show N/A values for days already. They haven't been touched by me while I was introducing the new singlestats in the bottom rows, and I also don't think @Addshore would have a reason to introduce any change there simply because I remember them reporting sensible numbers before.

BTW, the singlestats in the upper right corner of the dashboard - BF Disables and BF Enables - both show N/A values for days already. They haven't been touched by me while I was introducing the new singlestats in the bottom rows, and I also don't think @Addshore would have a reason to introduce any change there simply because I remember them reporting sensible numbers before.

That issue is tracked by T175790

thiemowmde moved this task from Edit Conflict Handling to Analytics Tasks on the TCB-Team board.