Page MenuHomePhabricator

Ingest data from PageIssues EventLogging schema into Druid
Closed, ResolvedPublic3 Story Points

Description

Once the PageIssues schema is live, we would like to ingest some of its data into Druid, so that it can be viewed in Superset.

Per a review of the draft schema guidelines and subsequent discussion with the AE team on IRC this should be possible, with the following fields as dimensions:

  • isAnon
  • action
  • issuesVersion
  • issuesSeverity This field is an array. As discussed on IRC, it should be flattened into a string, see T201873
  • sectionNumbers This field is also an array and should be treated in the same way.
  • editCountBucket
  • namespaceId

The sole measure would be the number of actions (events), aggregated by (say) hour.

The following fields should be left out:

  • pageTitle
  • pageToken
  • sessionToken

Details

Related Gerrit Patches:
operations/puppet : productionAdd druid_load.pp to refinery jobs

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 24 2018, 3:52 PM
fdans triaged this task as Medium priority.Aug 27 2018, 3:46 PM
fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.
Restricted Application added a project: Product-Analytics. · View Herald TranscriptAug 27 2018, 6:12 PM

@mforns Just a last ping - I know we said earlier that no field renames or other schema changes would be needed, but if you could double-check once more as this schema is about to go live, that would be great.

@Tbayer
Yes, no renames will be needed! We'll find a solution to the array field and implement it soon.
The only caveat is that fields (dimensions) with high cardinality, like pageToken, sessionToken, pageTitle and pageIdSource perform very bad in Druid, so I would blacklist them from Druid ingestion if possible.

@Tbayer
Yes, no renames will be needed! We'll find a solution to the array field and implement it soon.

Thanks for confirming!

The only caveat is that fields (dimensions) with high cardinality, like pageToken, sessionToken, pageTitle and pageIdSource perform very bad in Druid, so I would blacklist them from Druid ingestion if possible.

Yes, that's been the idea (see task description).

For the record: @Nuria and I discussed this task earlier this week, and I understand that the AE team feels that due to the workload from other projects it might not be possible to implement this (specifically, the subtask T201873 ) before the end of Q2. The team has suggested to instead tackle T205562: Ingest data aggregate ReadingDepth data into Druid first, as it might be an easier case.

mforns claimed this task.Oct 2 2018, 1:34 PM
mforns added a project: Analytics-Kanban.
mforns moved this task from Next Up to In Progress on the Analytics-Kanban board.Oct 2 2018, 3:01 PM

Change 464833 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] Add druid_load.pp to refinery jobs

https://gerrit.wikimedia.org/r/464833

mforns set the point value for this task to 5.Oct 8 2018, 4:08 PM
mforns changed the point value for this task from 5 to 3.
Tbayer added a comment.EditedOct 9 2018, 3:27 AM

@mforns Great to hear that Druid already allows ingestion of array types! But just to clarify, it seems that this involves information reduction of some kind? At least I'm only seeing scalar values in the selection dropdown in Turnilo (below).
If that's the case, could we document how that works - does it always pick the first element of the array? (i.e. [0,2,5] --> 0, etc.)

[edited to add Turnilo link]

Another question: It seems that the dimensions lack e.g. Ua Browser Major and other user agent derived fields (that we have and use in e.g. https://turnilo.wikimedia.org/#pageviews_daily/ ). In the web team we often need these when evaluating EL data, see e.g. this example from earlier today: T204143#4650771 . Could they be added, analogously to the pageviews data?

BTW, I understand we are focusing on use in Turnilo for now, but out of curiosity (and considering the task description) I checked Superset too and didn't see this data there yet. I clicked "scan new datasources", which appears to have imported it, alongside data from some other schemas:

Refreshed metadata from cluster [public-eqiad]
Adding new datasource [event_NavigationTiming]
Adding new datasource [event_ReadingDepth]
Adding new datasource [event_PageIssues]
Refreshed metadata from cluster [analytics-eqiad]

Sorry in case that was not intended - feel free to temporarily remove it again in case the import was intended for later.

Back to the view in Turnilo: This looks very exciting indeed!

I have to mention that @ovasileva and I spent quite a bit of time earlier today investigating what looked like inconsistent or wrong data in Turnilo. E.g. in this view, the number of events should be roughly the same for the old and new design, as the test and control group have the same sample size in the A/B test. But they are not. (E.g. enwiki has 691.5 k pageLoaded events in test vs. only 251.8 k in control.)

It occurred to me afterwards though that this might be because we were looking at the Count measure instead of the Event Count measure. What is the meaning of the former? Is this documented somewhere? Is it necessary to include in the Turnilo options? This will likely not be the last time that it causes that kind of confusion.

When switching to Event Count, the numbers look plausible so far. I ran a Hive query to double-check the numbers from this view, and they check out (including showing roughly the same pageLoaded numbers for test and control):

wikiversionpageloaded_events
enwikinew20181305905
enwikiold1327275
fawikinew201833513
fawikiold33509
jawikinew2018482114
jawikiold488352
lvwikinew20183452
lvwikiold3424
ruwikinew201890951
ruwikiold92328
SELECT wiki, event.issuesVersion AS version, 
SUM(IF(event.action = 'pageLoaded', 1, 0)) AS pageloaded_events
FROM event.pageissues 
WHERE year = 2018 AND month = 10 AND day =3
AND event.sectionnumbers[0] = 0
GROUP BY wiki, event.issuesVersion
ORDER BY wiki, version LIMIT 10000;

@Tbayer

@mforns Great to hear that Druid already allows ingestion of array types! But just to clarify, it seems that this involves information reduction of some kind? At least I'm only seeing scalar values in the selection dropdown in Turnilo (below).

I think it's working OK, no? There are 2 fields that are arrays right now. One of them, sectionNumbers, is an array of integers (I think that's the one in the screenshot, no?). The other one, issuesSeverity is an array of strings, and seems to be working fine on my side.

does it always pick the first element of the array? (i.e. [0,2,5] --> 0, etc.)

Yea, good point. The other day I was wondering about that as well...
When you activate a split over an array-typed dimension, each of the split values is the sum of all records that contain that value in the array. So, at any point in time the split values may not add up to the total, but might be actually bigger than the total (if the arrays contain more than one element). @JAllemandou please correct me if I'm wrong here.

If that's the case, could we document how that works

Yes, we totally have to document this and other gotchas that raised with this new datasources.

Change 464833 abandoned by Mforns:
Add druid_load.pp to refinery jobs

Reason:
already taken care by https://gerrit.wikimedia.org/r/#/c/operations/puppet/ /465692/

https://gerrit.wikimedia.org/r/464833

Nuria added a comment.Oct 10 2018, 8:52 PM

BTW, I understand we are focusing on use in Turnilo for now, but out of curiosity (and considering the task description) I checked Superset too and didn't see this data there yet.

Turnilo and superset both read from the same storage, Druid. Any dataset available in turnilo is also available in superset.

@Tbayer

Another question: It seems that the dimensions lack e.g. Ua Browser Major and other user agent derived fields (that we have and use in e.g. https://turnilo.wikimedia.org/#pageviews_daily/ ). In the web team we often need these when evaluating EL data, see e.g. this example from earlier today: T204143#4650771 . Could they be added, analogously to the pageviews data?

Sure! Will do. I will add them also to ReadingDepth.
As segments older than 90 days will be deleted from Druid, and as the data set is not really big, I think it will be fine to add those.

@Tbayer

It occurred to me afterwards though that this might be because we were looking at the Count measure instead of the Event Count measure. What is the meaning of the former? Is this documented somewhere? Is it necessary to include in the Turnilo options? This will likely not be the last time that it causes that kind of confusion.

Yes, this thing already caused confusion to other people. The EventCount metric is generated at ingestion time by our data crunching job, it represents the number of EventLogging events that fall inside the given slice/dice. The Count metric is added automatically at some point in the pipeline, it corresponds to the number of aggregated/rolled-up rows of the Druid datasource, and it will be different from EventCount in most cases. The addition of Count metric was not intended, I'm trying to see whether we can drop it. If not, I will add it as a gotcha to the documentation.

BTW, I understand we are focusing on use in Turnilo for now, but out of curiosity (and considering the task description) I checked Superset too and didn't see this data there yet.

Turnilo and superset both read from the same storage, Druid.

We all know that.

Any dataset available in turnilo is also available in superset.

That had been my assumption, hence the surprise about not yet seeing this data in the list of Druid datasources in Superset. Now I know that one needs to hit "scan" in such cases (BTW I did so again yesterday, resulting in another datasource being added ([mediawiki_history_reduced_2018_09]).
I had started a "usage notes" section in the documentation, containing a remark on that point.

@Tbayer

It occurred to me afterwards though that this might be because we were looking at the Count measure instead of the Event Count measure. What is the meaning of the former? Is this documented somewhere? Is it necessary to include in the Turnilo options? This will likely not be the last time that it causes that kind of confusion.

Yes, this thing already caused confusion to other people. The EventCount metric is generated at ingestion time by our data crunching job, it represents the number of EventLogging events that fall inside the given slice/dice. The Count metric is added automatically at some point in the pipeline, it corresponds to the number of aggregated/rolled-up rows of the Druid datasource, and it will be different from EventCount in most cases. The addition of Count metric was not intended, I'm trying to see whether we can drop it. If not, I will add it as a gotcha to the documentation.

Thanks for the explanation! Yes, removing Count it would be great. I'm also aware of various other people now who already fell victim to this confusion.

@Tbayer

@mforns Great to hear that Druid already allows ingestion of array types! But just to clarify, it seems that this involves information reduction of some kind? At least I'm only seeing scalar values in the selection dropdown in Turnilo (below).

I think it's working OK, no? There are 2 fields that are arrays right now. One of them, sectionNumbers, is an array of integers (I think that's the one in the screenshot, no?). The other one, issuesSeverity is an array of strings, and seems to be working fine on my side.

By "information reduction" (in both of these fields), I meant that several possible values will be mapped to the same value.
E.g. the arrays [0,1,2] and [0] in the EL data will both result in the integer 0 in the Druid data. In the task description and T201873 we had IIRC understood the term "flattened into a string" as mapping e.g. the array [0,1,2]into the string '[0,1,2]'.

So this was not quite what I expected based on the earlier discussion. But if it is what Druid gives us out of the box, I can see the pragmatic reason to go with it instead; it has both advantages and disadvantages from the data analysis perspective.

does it always pick the first element of the array? (i.e. [0,2,5] --> 0, etc.)

Yea, good point. The other day I was wondering about that as well...
When you activate a split over an array-typed dimension, each of the split values is the sum of all records that contain that value in the array. So, at any point in time the split values may not add up to the total, but might be actually bigger than the total (if the arrays contain more than one element). @JAllemandou please correct me if I'm wrong here.

OK, for the present use case that is not a huge problem. (The most important application in the context of this A/B test is to distinguish values where the array contains 0, corresponding to pages where the entire article is marked as having issues, as opposed to just individual sections.) We need to be aware of the specific behavior though, so thanks for figuring this out. BTW, it might be worth documenting what we learned here about array ingestion at https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid .

Just a note that I'm deleting the Druid dataset temporarily, to apply some renames and productionize the final job.
Will be back up within 1 day hopefully.

Milimetric raised the priority of this task from Medium to High.Oct 18 2018, 5:17 PM
Nuria added a comment.Oct 22 2018, 9:32 PM

Data is in druid, please let us know if this ticket can be closed: https://turnilo.wikimedia.org/#event_pageissues

Nuria moved this task from In Code Review to Done on the Analytics-Kanban board.Oct 22 2018, 9:33 PM
Nuria moved this task from Done to Ready to Deploy on the Analytics-Kanban board.Oct 22 2018, 9:39 PM

I backfilled the last 3 months of data. This is now productionized!
Data will continue to be imported automatically every hour
(with a 5 hour lag to allow for previous collection and refinement of EL events into Hive).
Next steps are:

  • Write a comprehensive documentation about EventLoggingToDruid ingestion.
  • Remove the confusing Count metric from the datasource in Turnilo, or at least uncheck it by default (and make the default the actual eventCount).
  • Try to add a new metric to the datasource, eventCountPercentage, that normalizes eventCount splits by the total aggregate, so that time measure buckets become percentage-of-total values, instead of frequencies. This way they will not vary with throughput changes or seasonality, and will be a lot easier to follow. (not sure if this will be possible, though)

In any case these items will not be part of this task, I will tackle them as part of T206342.
Will move this task to Done in Analytics-Kanban.
Cheers!

mforns moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Oct 23 2018, 2:02 PM
Nuria closed this task as Resolved.Oct 25 2018, 4:25 PM