Page MenuHomePhabricator

Enable Page Previews EventLogging instrumentation
Closed, ResolvedPublic1 Story Points

Description

Background

In T180036: Instrument time to first user link interaction, we added an additional piece of data to every event sent by the EventLogging instrumentation that would allow us to identify the typical time to first link interaction on a page, accurate to less than 1 ms. We did this because the timestamps available to us from the server-side were only accurate to the second (see T179426#3741809 onwards).

In T178500: Stop sending data for Page Previews enwiki and dewiki A/B test (again), the EventLogging instrumentation was disabled. Obviously, we need to enable it.

Developer Notes

  1. This is as simple as setting $wgPopupsEventLogging = true in the config for English and German Wikipedia only.
  2. The logging should be disabled eight days after enabling it. (We want to collect a full week's worth of data, allowing, as last time, an additional day for the initial change to propagate.)

Post-deploy Actions

Done in T181493#3830489.

Done in https://www.mediawiki.org/w/index.php?title=Reading%2FWeb%2FRelease_timeline&type=revision&diff=2644441&oldid=2620937.

Event Timeline

phuedx created this task.Nov 28 2017, 11:23 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 28 2017, 11:23 AM
phuedx set the point value for this task to 1.Nov 28 2017, 11:24 AM
phuedx added a subscriber: ovasileva.

^ Pre-emptively estimating this as a 1 (per Readers Web norms) as it's a config change.

ovasileva triaged this task as High priority.Nov 28 2017, 2:58 PM
Restricted Application added a subscriber: Dereckson. · View Herald TranscriptNov 28 2017, 5:41 PM

So I understand our main purpose is just to reproduce the previous histograms with more accurate timestamps (without drilling down by OS, browser or other dimension). One full week of data at the previous rates should make these charts sufficiently smooth already.

ovasileva added a subscriber: MBinder_WMF.

adding the tag back to help @MBinder_WMF with plog tracking

Jdlrobson updated the task description. (Show Details)Nov 29 2017, 6:35 PM

@ovasileva @Tbayer It's my understanding that we need to get this done in the next 4 days if you want to make the deploy deadline, but because of the deadline we can't UNdeploy in the following 2 weeks. Wouldn't it be safer to wait until January? Am I missing something? :)

@ovasileva @Tbayer It's my understanding that we need to get this done in the next 4 days if you want to make the deploy deadline, but because of the deadline we can't UNdeploy in the following 2 weeks. Wouldn't it be safer to wait until January? Am I missing something? :)

What exactly is meant by "get this done", and what are the safety concerns? The remaining action item is the deploy itself, which (as discussed in kickoff) basically consists in flipping a switch that we have flipped before.
Having two weeks instead of one week of data won't harm the data analysis, it would just be a bit inefficient (we could reduce the sampling rate by half if that's a concern).

And for context regarding prioritization: I understand that T176211: Page Previews could load less JS on pageload is still the main application of this data, so perhaps @Gilles could quickly weigh in on how timely that work is, i.e. how important it would be to have these histograms in two weeks from now instead of in the second half of January.

A concrete suggestion for a timeframe would be to activate it tomorrow (Nov 30) and deactivate it after eight days, on Friday Dec 8 - or whatever the first opportunity is.

@ovasileva @Tbayer It's my understanding that we need to get this done in the next 4 days if you want to make the deploy deadline, but because of the deadline we can't UNdeploy in the following 2 weeks.

Also, could you elaborate more on the exact deployment deadline? https://www.mediawiki.org/wiki/Scrum_of_scrums/2017-11-29 says "Deployment freeze starts the week of December 18th."

I think it'd be nice to do the data collection before the deployment freeze and then that gives us time to analyze results and discuss next steps for the next quarter?

@Tbayer If there is actually no risk, simply more data collection, then it's likely fine. Up to @ovasileva . I'm just the ignorant one so I want to make sure. :)

I think we can go ahead and do this. I am similarly confused about the schedule - the deployment freeze doesn't begin until December 18th.

Jdlrobson added a subscriber: Jdlrobson.EditedNov 30 2017, 7:16 PM

I think we can go ahead and do this. I am similarly confused about the schedule - the deployment freeze doesn't begin until December 18th.

The deployment schedule is here: https://wikitech.wikimedia.org/wiki/Deployments#Upcoming
There is no problem turning this on. The problem developers were flagging in kick off was knowing how much data we needed and checking we were able to turn it off in time.

The last deploy this year is the 15th December. Then the next deploy is 2nd January.

For instance if we turn this on December 5th, we will not be able to turn it off until after the deployment freeze - at earliest January 1st - so that's almost a month of data.
There is no SWAT tomorrow, so unless I turn it off later today, the next swat would be the 4th which means we can only get 11 days or a month of data.
We can of course enable 15th, but that's very risky if we hit any EventLogging issues.
We should not be running this data collection during a period we cannot respond to issues.

Hope this clears up confusion... please provide guidance on the precise amount of data we need. If it's a week and we quickly decide this, we can get this done. This is trivial from an engineering side, but needs clearer idea of time frame.

Thanks, @Jdlrobson . That was super clear and helpful. :-D

I think we can go ahead and do this. I am similarly confused about the schedule - the deployment freeze doesn't begin until December 18th.

The deployment schedule is here: https://wikitech.wikimedia.org/wiki/Deployments#Upcoming
There is no problem turning this on. The problem developers were flagging in kick off was knowing how much data we needed and checking we were able to turn turning it off in time.

As already mentioned during kickoff (and posted above shortly beforehand: T181493#3793413 ), one full week of data is enough, adding in a day or so for caching (conservatively assuming that the sampling rate change doesn't propagate faster than the launch of the full experiment last time).

IIRC the problem in kickoff was rather that there was some confusion about the timing the deployment freeze, with someone mentioning an earlier date. This confusion has since been resolved.

The last deploy this year is the 15th December. Then the next deploy is 2nd January.
For instance if we turn this on December 5th, we will not be able to turn it off until after the deployment freeze - at earliest January 1st - so that's almost a month of data.
There is no SWAT tomorrow, so unless I turn it off later today, the next swat would be the 4th which means we can only get 11 days or a month of data.
We can of course enable 15th, but that's very risky if we hit any EventLogging issues.
We should not be running this data collection during a period we cannot respond to issues.
Hope this clears up confusion... please provide guidance on the precise amount of data we need. If it's a week and we quickly decide this, we can get this done. This is trivial from an engineering side, but needs clearer idea of time frame.

A precise timeframe was already proposed yesterday at T181493#3797982 .

Tbayer updated the task description. (Show Details)Nov 30 2017, 9:34 PM

Thanks for the description update! Looks clear now.

Olga should this be moved to todo?

Change 395053 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[operations/mediawiki-config@master] Enable Page Previews EventLogging instrumentation

https://gerrit.wikimedia.org/r/395053

Per review from @phuedx this is blocked apparently (not sure why) so I've removed it from the swat calendar.

Sorry. This is blocked on T180036. I think I should've moved T181493 to Blocked.

phuedx added a comment.Dec 4 2017, 7:07 PM

Sorry. I think I should've moved this to Blocked – that being said, a direct subtask of this task is unresolved.

This is blocked on T180036: Instrument time to first user link interaction, which itself is blocked on T182000: Popups timestamp field contains multiple types.

phuedx updated the task description. (Show Details)Dec 5 2017, 6:25 PM

@phuedx to talk to analytics engineering to see if there are space constraints.

@phuedx to talk to analytics engineering to see if there are space constraints.

Why? They have told us earlier that space is not an issue at this point (now that we have switched this instrumentation to use Hive, and blacklisted it from MySQL.) Also, I already gave them a heads-up on November 30.
We do need to notify them after deploy though about the data cleanup, as specified in the task description.

phuedx added a comment.EditedDec 6 2017, 1:19 PM

^ @Tbayer's informed AE that we'll be enabling the instrumentation. @ovasileva and @Tbayer are OK with keeping the instrumentation enabled until January, which removes the time pressure we imposed by trying to collect the requisite amount of data before the December Deployment Freeze (week of the Monday, 18th).

Jdlrobson updated the task description. (Show Details)Dec 11 2017, 6:16 PM

Change 395053 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable Page Previews EventLogging instrumentation

https://gerrit.wikimedia.org/r/395053

Mentioned in SAL (#wikimedia-operations) [2017-12-11T19:47:23Z] <ebernhardson@tin> Synchronized wmf-config/InitialiseSettings.php: SWAT: T181493: Enable Page Previews EventLogging instrumentation (duration: 00m 56s)

The change has been SWAT deployed and I see events coming in:

phuedx updated the task description. (Show Details)Dec 11 2017, 8:06 PM

Thanks @bmansurov ! Now what? Does this need QA or can it be signed off? (cc @Tbayer @ovasileva )

phuedx updated the task description. (Show Details)Dec 12 2017, 6:09 AM

@Tbayer, @Ottomata: We stopped sending Popups events on ~7:40 PM UTC on Wednesday, 15th November and started sending them again on at ~7:47 PM UTC yesterday (Monday, 11th December). AFAIK all events that appeared between those two times can be dropped. Is this correct, @Tbayer?

phuedx updated the task description. (Show Details)Dec 12 2017, 11:18 AM

Let's leave this open until @Tbayer and @Ottomata have acknowledged/confirmed the above.

phuedx claimed this task.Dec 12 2017, 11:19 AM

Ok, great. I’ve actually already been pruning the event.popups table over
the last week or so to avoid getting alerts about that timestamp int vs.
float problem. I haven’t seen any of those errors since Monday the 11th.
There is some data for a few days before Monday the 11th, but consider it
incomplete (I deleted the offending hours :) ).

The table has a long (bigint) timestamp field. So, you should be good to
go! Just keep your analysis focused on the 11th and beyond.

Thanks y’all!

Tbayer closed this task as Resolved.Dec 12 2017, 6:08 PM

@Tbayer, @Ottomata: We stopped sending Popups events on ~7:40 PM UTC on Wednesday, 15th November

I guess this was meant to link to https://phabricator.wikimedia.org/T178500#3764072 instead...

and started sending them again on at ~7:47 PM UTC yesterday (Monday, 11th December). AFAIK all events that appeared between those two times can be dropped. Is this correct, @Tbayer?

...but yes, these dates look correct. Closing the task now as Andrew has already confirmed too.

PS: the task description should have included updating the schema talk page, I just did that myself (based on my understanding that sampling rates etc. remain the same as in October/November): https://meta.wikimedia.org/wiki/Schema_talk:Popups#enwiki_and_dewiki_A/B_test_v3_(December_12-)