Page MenuHomePhabricator

Add additional dimensions to edits_hourly in Turnilo and Superset
Closed, ResolvedPublic5 Estimated Story Points

Description

In T211173, we created an edits_hourly table using a simplified version of the mediawiki_history for druid to be available in Turnilo and Superset,

We'd like to add the following dimensions in the next iteration:

  • platform (mobile web, mobile app, desktop) and editing_interface (visual edit, wikitext, other). These are pending addition of change tags to the mediawiki_history table (T161149)
  • user_tenure_bucket.

edits_hourly schema (updated as of May 13th to include additional dimensions)

Event Timeline

MNeisler renamed this task from Add additional dimensions to edits table in Turnilo and Superset pending changes to mediawiki_history table to Add additional dimensions to edits_hourly in Turnilo and Superset .Apr 17 2019, 10:22 PM
MNeisler updated the task description. (Show Details)
MNeisler updated the task description. (Show Details)
fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

The new datasource is available in Turnilo!
Please, have a look :]

Awesome, thanks @mforns! Looks good. Just a few questions/comments:

  • You mentioned that the Count metric in Turnilo is added by default. Will that still be removed?
  • I noticed some display issues in Turnilo when you try splitting the data by either the User Groups or Revision Tags dimensions. I’m assuming this might be due to the high cardinality of those dimensions. For example, when I tried to create a line chart with a split using revision-tags to show only 1 value no lines were displayed and when I selected two values only one line was displayed. See screen shots below:

user_groups_split_1.png (734×1 px, 119 KB)

user_groups_split_2.png (729×1 px, 130 KB)

  • Also, I reviewed the data with @kzimmerman and we found that the current “Other” value for the new interface and platform dimensions was slightly confusing and had limited value because it encompasses many different revision tag types resulting in it being much larger proportionally to the other values. I'm not sure the best way to fix this right now but just making a note here. Future improvements to our tagging infrastructure will likely help so we can add additional values to those dimensions such as desktop edits. @Neil_P._Quinn_WMF - Any thoughts or suggestions?
  • Also, I reviewed the data with @kzimmerman and we found that the current “Other” value for the new interface and platform dimensions was slightly confusing and had limited value because it encompasses many different revision tag types resulting in it being much larger proportionally to the other values. I'm not sure the best way to fix this right now but just making a note here. Future improvements to our tagging infrastructure will likely help so we can add additional values to those dimensions such as desktop edits. @Neil_P._Quinn_WMF - Any thoughts or suggestions?

Nothing particular—I definitely agree "other" has limited value so we could, say, make it null instead if that would lead to more sensible behavior in Turnilo. But other than that, yeah, that's not much we can do besides the longer-term work of improving out tagging.

Hi @MNeisler and @Neil_P._Quinn_WMF, thank you for all the feedback.

  • You mentioned that the Count metric in Turnilo is added by default. Will that still be removed?

Yes, totally, haven't done this yet, but will do before we close this task!

  • I noticed some display issues in Turnilo when you try splitting the data by either the User Groups or Revision Tags dimensions.

Oh, yes that's weird.
I could not reproduce the errors that you attached, or any other errors. But filtering by those fields is a bit slow...
Maybe the on and off errors and the slowness is a sign we should go towards daily granularity?

In T219323#5201926, @MNeisler wrote:

  • Also, I reviewed the data with @kzimmerman and we found that the current “Other” value for the new interface and platform dimensions was slightly confusing and had limited value because it encompasses many different revision tag types resulting in it being much larger proportionally to the other values. I'm not sure the best way to fix this right now but just making a note here. Future improvements to our tagging infrastructure will likely help so we can add additional values to those dimensions such as desktop edits. @Neil_P._Quinn_WMF - Any thoughts or suggestions?

Nothing particular—I definitely agree "other" has limited value so we could, say, make it null instead if that would lead to more sensible behavior in Turnilo. But other than that, yeah, that's not much we can do besides the longer-term work of improving out tagging.

Yes, agree with the low "actionability" of the Other bucket in the revision_tags fields, and I also thing that we should fix this problem at the tagging level, by i.e. adding a desktop edit tag? Regarding changing the Other to NULL, I would prefer not to. For Turnilo dimensions NULL behaves just like any other value, but we still might want to use NULL in the future to indicate i.e. that there is no value. Would that be OK?

Regarding changing the Other to NULL, I would prefer not to. For Turnilo dimensions NULL behaves just like any other value, but we still might want to use NULL in the future to indicate i.e. that there is no value. Would that be OK?

Well, in this case, there really is no value—we have no tag or other information that indicates what platform was used. It could be the 2010 wikitext editor, it could be Huggle, it could some ad-hoc use of the API, and so forth. But I think "other" is fine; I was just thinking that maybe Turnilo might automatically omit nulls from the graph, which would probably be a more useful behavior.

Well, in this case, there really is no value—we have no tag or other information that indicates what platform was used. It could be the 2010 wikitext editor, it could be Huggle, it could some ad-hoc use of the API, and so forth. But I think "other" is fine; I was just thinking that maybe Turnilo might automatically omit nulls from the graph, which would probably be a more useful behavior.

I understand. Yea, Turnilo treats NULLS as any other value when it comes to regular dimensions. It does not hide them, For the time dimension or metrics, NULLS are indeed different than other values, but in this case they are the same.

@MNeisler @Neil_P._Quinn_WMF

Quick questions regarding the user tenure bucket.

  • Is "undefined" anons?
  • Similarly, do IPs accumulate edits when it comes to edit_count bucket?

@JKatzWMF I think that is correct, all anons users have undefined tenure, you can see it here: https://bit.ly/2HI2JeL (the opposite might not be true (undefined tenure does not imply anon)

I think there might be a bug when it comes to edit_count and anons as they all show up with +10000 as default value https://bit.ly/2wn7xQ4

cc @mforns https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/510188/2..2/oozie/edit/hourly/edit_hourly.hql

@Nuria Thanks! Please let me know if/when the +10000 edits gets scheduled or fixed.

The code changes to fix the 10,000 bug are in, I think we need to reaload the whole dataset with the new code to correct data cc @mforns https://bit.ly/2MDDM8N

I launched the Oozie coordinator to precompute the edit_hourly table in Hive last Thursday.
And I forgot to launch the other Oozie coordinator, to load it to Druid.
It's running now, if there's no issues, should be live within the next hour.

Data looks good now, note that anonymous editors do not have an edit count cause at this time we do not have it on the data lake. For anonymous editors edit_bucket shows as undefined.

Nuria set the point value for this task to 5.