Maniphest T202751

Ingest data from PageIssues EventLogging schema into Druid
Closed, ResolvedPublic3 Estimated Story Points
Actions

Description

Once the PageIssues schema is live, we would like to ingest some of its data into Druid, so that it can be viewed in Superset.

Per a review of the draft schema guidelines and subsequent discussion with the AE team on IRC this should be possible, with the following fields as dimensions:

isAnon
action
issuesVersion
issuesSeverity This field is an array. As discussed on IRC, it should be flattened into a string, see T201873
sectionNumbers This field is also an array and should be treated in the same way.
editCountBucket
namespaceId

The sole measure would be the number of actions (events), aggregated by (say) hour.

The following fields should be left out:

pageTitle
pageToken
sessionToken

Details

	Subject	Repo	Branch	Lines +/-
	Add druid_load.pp to refinery jobs	operations/puppet	production	+61 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	mforns	T202751 Ingest data from PageIssues EventLogging schema into Druid
Resolved	ovasileva	T191532 Mobile page issues - instrument page issues
Resolved	• Niedzielski	T201124 Provide standard/reproducible way to access a PageToken
Resolved	• Tbayer	T202098 Log sectionNumbers in PageIssues schema
Resolved	• Tbayer	T202651 Section number in issues is reported incorrectly.
Resolved	• Tbayer	T202940 [subtask] Various page issues events sending NaN for sectionNumbers in old treatment
Resolved	• Tbayer	T203050 sectionNumbers and issuesSeverity should always be consistent in length
Resolved	ovasileva	T202786 [subtask] Edits via the main edit icon are not triggering Schema:Edit or Schema:PageIssues events
Resolved	• Tbayer	T203725 Icon correct but issuesSeverity in instrumentation is wrong
Invalid	None	T201873 [EventLoggingToDruid] Allow ingestion of simple-type arrays by converting them to strings

Event Timeline

• Tbayer created this task.Aug 24 2018, 3:52 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 24 2018, 3:52 PM

• Tbayer added a subtask: T191532: Mobile page issues - instrument page issues.Aug 24 2018, 3:53 PM

• Tbayer mentioned this in T191532: Mobile page issues - instrument page issues.Aug 24 2018, 4:00 PM

ovasileva added projects: Page-Issue-Warnings, Web-Team-Backlog (Tracking).Aug 24 2018, 4:01 PM

• fdans triaged this task as Medium priority.Aug 27 2018, 3:46 PM

• fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

Jdlrobson moved this task from Untriaged to Analyst Consultation on the Web-Team-Backlog (Tracking) board.Aug 27 2018, 6:12 PM

Jdlrobson added a project: Reading-analysis.

Restricted Application added a project: Product-Analytics. · View Herald TranscriptAug 27 2018, 6:12 PM

• Tbayer mentioned this in T201063: Modern Event Platform: Schema Repostories.Aug 29 2018, 8:14 PM

MBinder_WMF moved this task from Triage to Tracking on the Product-Analytics board.Aug 30 2018, 8:26 PM

• Tbayer mentioned this in T203814: Turn on MinervaErrorLogSamplingRate (Schema:WebClientError).Sep 11 2018, 10:13 PM

@mforns Just a last ping - I know we said earlier that no field renames or other schema changes would be needed, but if you could double-check once more as this schema is about to go live, that would be great.

ovasileva closed subtask T191532: Mobile page issues - instrument page issues as Resolved.Sep 19 2018, 9:38 PM

@Tbayer
Yes, no renames will be needed! We'll find a solution to the array field and implement it soon.
The only caveat is that fields (dimensions) with high cardinality, like pageToken, sessionToken, pageTitle and pageIdSource perform very bad in Druid, so I would blacklist them from Druid ingestion if possible.

In T202751#4606076, @mforns wrote:

@Tbayer
Yes, no renames will be needed! We'll find a solution to the array field and implement it soon.

Thanks for confirming!

The only caveat is that fields (dimensions) with high cardinality, like pageToken, sessionToken, pageTitle and pageIdSource perform very bad in Druid, so I would blacklist them from Druid ingestion if possible.

Yes, that's been the idea (see task description).

mforns added a subtask: T201873: [EventLoggingToDruid] Allow ingestion of simple-type arrays by converting them to strings.Sep 25 2018, 6:29 PM

For the record: @Nuria and I discussed this task earlier this week, and I understand that the AE team feels that due to the workload from other projects it might not be possible to implement this (specifically, the subtask T201873 ) before the end of Q2. The team has suggested to instead tackle T205562: Ingest data aggregate ReadingDepth data into Druid first, as it might be an easier case.

mforns closed subtask T201873: [EventLoggingToDruid] Allow ingestion of simple-type arrays by converting them to strings as Invalid.Oct 2 2018, 1:29 PM

@Tbayer
As, in the end, there was no change needed to ingest array types, PageIssues for the month of September 2018 is loaded to Druid.
Please, test that it is what you expected, and I will productionize the loading job in puppet.
https://turnilo.wikimedia.org/#event_PageIssues/3/N4IgbglgzgrghgGwgLzgFwgewHYgFwhLYCmAtAMYAWcATmiADQgYC2xyOx+IAomuQHoAqgBUAwoxAAzCAjTEaUfAG1QaAJ4AHLgVZcmNYlO4B9E3sl6ASnGwBzYkryqQUNLXoEATAAYAjAAcpD4AnMF+Ij4+eFExPgB0UT4AWpLE2AAm3L6BpH4+4ZHRsVGJUakAvgC61UxQmkhoTi4a2twWTBkQbNhQWLgEZh0gdjS2MAi0EBrcAAoifgASklCYdPighsaD5t36IF2G5Bg43HBQ5Old9iAVTEgs0/jYEwi1IGznMIZOoNAAshMMPgpIgoMQ6hB7AgdCByJgYNh6EwWECICo4QikWkwOk0GIsfQakxNFCSBkACJ7Xr9ZpVElk4gZADKa08mMRyMIxAcmWeryYlAgdkoSBFnheCDeQA==
Thanks!

mforns claimed this task.Oct 2 2018, 1:34 PM

mforns added a project: Analytics-Kanban.

mforns moved this task from Next Up to In Progress on the Analytics-Kanban board.Oct 2 2018, 3:01 PM

mforns moved this task from In Progress to In Code Review on the Analytics-Kanban board.Oct 2 2018, 8:26 PM

Change 464833 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] Add druid_load.pp to refinery jobs

https://gerrit.wikimedia.org/r/464833

gerritbot added a project: Patch-For-Review.Oct 5 2018, 3:57 PM

mforns set the point value for this task to 5.Oct 8 2018, 4:08 PM

mforns changed the point value for this task from 5 to 3.

@mforns Great to hear that Druid already allows ingestion of array types! But just to clarify, it seems that this involves information reduction of some kind? At least I'm only seeing scalar values in the selection dropdown in Turnilo (below).
If that's the case, could we document how that works - does it always pick the first element of the array? (i.e. [0,2,5] --> 0, etc.)

Turnilo event_PageIssues Even Section Numbers selector Screenshot from 2018-10-08.png (580×367 px, 18 KB)

[edited to add Turnilo link]

Another question: It seems that the dimensions lack e.g. Ua Browser Major and other user agent derived fields (that we have and use in e.g. https://turnilo.wikimedia.org/#pageviews_daily/ ). In the web team we often need these when evaluating EL data, see e.g. this example from earlier today: T204143#4650771 . Could they be added, analogously to the pageviews data?

BTW, I understand we are focusing on use in Turnilo for now, but out of curiosity (and considering the task description) I checked Superset too and didn't see this data there yet. I clicked "scan new datasources", which appears to have imported it, alongside data from some other schemas:

Refreshed metadata from cluster [public-eqiad]
Adding new datasource [event_NavigationTiming]
Adding new datasource [event_ReadingDepth]
Adding new datasource [event_PageIssues]
Refreshed metadata from cluster [analytics-eqiad]

Sorry in case that was not intended - feel free to temporarily remove it again in case the import was intended for later.

Back to the view in Turnilo: This looks very exciting indeed!

I have to mention that @ovasileva and I spent quite a bit of time earlier today investigating what looked like inconsistent or wrong data in Turnilo. E.g. in this view, the number of events should be roughly the same for the old and new design, as the test and control group have the same sample size in the A/B test. But they are not. (E.g. enwiki has 691.5 k pageLoaded events in test vs. only 251.8 k in control.)

It occurred to me afterwards though that this might be because we were looking at the Count measure instead of the Event Count measure. What is the meaning of the former? Is this documented somewhere? Is it necessary to include in the Turnilo options? This will likely not be the last time that it causes that kind of confusion.

When switching to Event Count, the numbers look plausible so far. I ran a Hive query to double-check the numbers from this view, and they check out (including showing roughly the same pageLoaded numbers for test and control):

wiki	version	pageloaded_events
enwiki	new2018	1305905
enwiki	old	1327275
fawiki	new2018	33513
fawiki	old	33509
jawiki	new2018	482114
jawiki	old	488352
lvwiki	new2018	3452
lvwiki	old	3424
ruwiki	new2018	90951
ruwiki	old	92328

SELECT wiki, event.issuesVersion AS version, 
SUM(IF(event.action = 'pageLoaded', 1, 0)) AS pageloaded_events
FROM event.pageissues 
WHERE year = 2018 AND month = 10 AND day =3
AND event.sectionnumbers[0] = 0
GROUP BY wiki, event.issuesVersion
ORDER BY wiki, version LIMIT 10000;

@Tbayer

@mforns Great to hear that Druid already allows ingestion of array types! But just to clarify, it seems that this involves information reduction of some kind? At least I'm only seeing scalar values in the selection dropdown in Turnilo (below).

I think it's working OK, no? There are 2 fields that are arrays right now. One of them, sectionNumbers, is an array of integers (I think that's the one in the screenshot, no?). The other one, issuesSeverity is an array of strings, and seems to be working fine on my side.

does it always pick the first element of the array? (i.e. [0,2,5] --> 0, etc.)

Yea, good point. The other day I was wondering about that as well...
When you activate a split over an array-typed dimension, each of the split values is the sum of all records that contain that value in the array. So, at any point in time the split values may not add up to the total, but might be actually bigger than the total (if the arrays contain more than one element). @JAllemandou please correct me if I'm wrong here.

If that's the case, could we document how that works

Yes, we totally have to document this and other gotchas that raised with this new datasources.

Change 464833 abandoned by Mforns:
Add druid_load.pp to refinery jobs

Reason:
already taken care by https://gerrit.wikimedia.org/r/#/c/operations/puppet/ /465692/

https://gerrit.wikimedia.org/r/464833

BTW, I understand we are focusing on use in Turnilo for now, but out of curiosity (and considering the task description) I checked Superset too and didn't see this data there yet.

Turnilo and superset both read from the same storage, Druid. Any dataset available in turnilo is also available in superset.

@Tbayer

Another question: It seems that the dimensions lack e.g. Ua Browser Major and other user agent derived fields (that we have and use in e.g. https://turnilo.wikimedia.org/#pageviews_daily/ ). In the web team we often need these when evaluating EL data, see e.g. this example from earlier today: T204143#4650771 . Could they be added, analogously to the pageviews data?

Sure! Will do. I will add them also to ReadingDepth.
As segments older than 90 days will be deleted from Druid, and as the data set is not really big, I think it will be fine to add those.

@Tbayer

It occurred to me afterwards though that this might be because we were looking at the Count measure instead of the Event Count measure. What is the meaning of the former? Is this documented somewhere? Is it necessary to include in the Turnilo options? This will likely not be the last time that it causes that kind of confusion.

Yes, this thing already caused confusion to other people. The EventCount metric is generated at ingestion time by our data crunching job, it represents the number of EventLogging events that fall inside the given slice/dice. The Count metric is added automatically at some point in the pipeline, it corresponds to the number of aggregated/rolled-up rows of the Druid datasource, and it will be different from EventCount in most cases. The addition of Count metric was not intended, I'm trying to see whether we can drop it. If not, I will add it as a gotcha to the documentation.

In T202751#4656446, @Nuria wrote:

BTW, I understand we are focusing on use in Turnilo for now, but out of curiosity (and considering the task description) I checked Superset too and didn't see this data there yet.

Turnilo and superset both read from the same storage, Druid.

We all know that.

Any dataset available in turnilo is also available in superset.

That had been my assumption, hence the surprise about not yet seeing this data in the list of Druid datasources in Superset. Now I know that one needs to hit "scan" in such cases (BTW I did so again yesterday, resulting in another datasource being added ([mediawiki_history_reduced_2018_09]).
I had started a "usage notes" section in the documentation, containing a remark on that point.

In T202751#4657897, @mforns wrote:

@Tbayer

It occurred to me afterwards though that this might be because we were looking at the Count measure instead of the Event Count measure. What is the meaning of the former? Is this documented somewhere? Is it necessary to include in the Turnilo options? This will likely not be the last time that it causes that kind of confusion.

Yes, this thing already caused confusion to other people. The EventCount metric is generated at ingestion time by our data crunching job, it represents the number of EventLogging events that fall inside the given slice/dice. The Count metric is added automatically at some point in the pipeline, it corresponds to the number of aggregated/rolled-up rows of the Druid datasource, and it will be different from EventCount in most cases. The addition of Count metric was not intended, I'm trying to see whether we can drop it. If not, I will add it as a gotcha to the documentation.

Thanks for the explanation! Yes, removing Count it would be great. I'm also aware of various other people now who already fell victim to this confusion.

In T202751#4653115, @mforns wrote:

@Tbayer

@mforns Great to hear that Druid already allows ingestion of array types! But just to clarify, it seems that this involves information reduction of some kind? At least I'm only seeing scalar values in the selection dropdown in Turnilo (below).

I think it's working OK, no? There are 2 fields that are arrays right now. One of them, sectionNumbers, is an array of integers (I think that's the one in the screenshot, no?). The other one, issuesSeverity is an array of strings, and seems to be working fine on my side.

By "information reduction" (in both of these fields), I meant that several possible values will be mapped to the same value.
E.g. the arrays [0,1,2] and [0] in the EL data will both result in the integer 0 in the Druid data. In the task description and T201873 we had IIRC understood the term "flattened into a string" as mapping e.g. the array [0,1,2]into the string '[0,1,2]'.

So this was not quite what I expected based on the earlier discussion. But if it is what Druid gives us out of the box, I can see the pragmatic reason to go with it instead; it has both advantages and disadvantages from the data analysis perspective.

does it always pick the first element of the array? (i.e. [0,2,5] --> 0, etc.)

Yea, good point. The other day I was wondering about that as well...
When you activate a split over an array-typed dimension, each of the split values is the sum of all records that contain that value in the array. So, at any point in time the split values may not add up to the total, but might be actually bigger than the total (if the arrays contain more than one element). @JAllemandou please correct me if I'm wrong here.

OK, for the present use case that is not a huge problem. (The most important application in the context of this A/B test is to distinguish values where the array contains 0, corresponding to pages where the entire article is marked as having issues, as opposed to just individual sections.) We need to be aware of the specific behavior though, so thanks for figuring this out. BTW, it might be worth documenting what we learned here about array ingestion at https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid .

Just a note that I'm deleting the Druid dataset temporarily, to apply some renames and productionize the final job.
Will be back up within 1 day hopefully.

Milimetric raised the priority of this task from Medium to High.Oct 18 2018, 5:17 PM

Data is in druid, please let us know if this ticket can be closed: https://turnilo.wikimedia.org/#event_pageissues

• Nuria moved this task from In Code Review to Done on the Analytics-Kanban board.Oct 22 2018, 9:33 PM

• Nuria moved this task from Done to Ready to Deploy on the Analytics-Kanban board.Oct 22 2018, 9:39 PM

I backfilled the last 3 months of data. This is now productionized!
Data will continue to be imported automatically every hour
(with a 5 hour lag to allow for previous collection and refinement of EL events into Hive).
Next steps are:

Write a comprehensive documentation about EventLoggingToDruid ingestion.
Remove the confusing Count metric from the datasource in Turnilo, or at least uncheck it by default (and make the default the actual eventCount).
Try to add a new metric to the datasource, eventCountPercentage, that normalizes eventCount splits by the total aggregate, so that time measure buckets become percentage-of-total values, instead of frequencies. This way they will not vary with throughput changes or seasonality, and will be a lot easier to follow. (not sure if this will be possible, though)

In any case these items will not be part of this task, I will tackle them as part of T206342.
Will move this task to Done in Analytics-Kanban.
Cheers!

mforns moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Oct 23 2018, 2:02 PM

• Nuria closed this task as Resolved.Oct 25 2018, 4:25 PM

• Tbayer mentioned this in T214136: event_pageissues Turnilo view contains no valid data from before January 5.Jan 18 2019, 2:45 AM

• Tbayer mentioned this in T218964: Ingest data from PrefUpdate EventLogging schema into Druid.Mar 21 2019, 11:38 PM

	F26420945: Turnilo event_PageIssues Even Section Numbers selector Screenshot from 2018-10-08.png
	Oct 9 2018, 3:27 AM

Ingest data from PageIssues EventLogging schema into DruidClosed, ResolvedPublic3 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Ingest data from PageIssues EventLogging schema into Druid
Closed, ResolvedPublic3 Estimated Story Points
Actions

Related Objects
Search...