Page MenuHomePhabricator

Roll out VirtualPageViews to all Wikipedia wikis
Closed, ResolvedPublic2 Story Points

Description

Over several stages we'll want to turn on the VirtualPageView schema to all wikis. We'll want to leave at least a day between each.

Currently it is only live on Hungarian Wikipedia, where we are seeing 4 events per second.
We estimated that recording a page interaction when a preview has been open for > 1000 ms will correspond to an increase in webrequests per pageview of 0.13%, which corresponds to ~700-800 events/sec (or, roughly, 2x the peak rate from the Page Previews instrumentation). AIUI the Hive EventLogging backend can handle this event rate 💪 but the processors need to be monitored to see if more need to be added.

Roll out plan

  • s6.dblist (fr, ja and ru wiki)
  • wikipedia.dblist - top6-wikipedia.dblist (all but the top 6 wikipedias)
  • All wikis

For each roll out:

  • Ping @Ottomata and @Nuria on ticket to let him know its happening.
  • Enable the wikis ($wgPopupsVirtualPageViews)
  • Check rate per second graph, ensuring we are below the estimated 700-800 events per second and report in ticket.

Details

Related Gerrit Patches:
operations/mediawiki-config : masterEnable Page Previews for 10% enwiki anon users
operations/mediawiki-config : masterRollout VirtualPageViews (final stage)
operations/mediawiki-config : masterRollout VirtualPageViews (stage 3)
operations/mediawiki-config : masterEnable VirtualPageViews on s6 (ja,ru,fr) wikis
operations/mediawiki-config : masterEnable VirtualPageViews on s6 wikis

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@pmiazga any chance you can deploy this in European Mid-day SWAT tomorrow ? ( 13:00–14:00 UTC )
Otherwise I'll aim for the 11am (my morning) slot
The patch will turn on VirtualPageViews for Russian, Japanese and French. To verify we just need to make sure that hovering a link sends a VirtualPageView event.

pmiazga claimed this task.Mar 22 2018, 5:18 PM

Change 421134 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable VirtualPageViews on s6 wikis

https://gerrit.wikimedia.org/r/421134

Change 421352 had a related patch set uploaded (by Pmiazga; owner: Pmiazga):
[operations/mediawiki-config@master] Enable VirtualPageViews on s6 (ja,ru,fr) wikis

https://gerrit.wikimedia.org/r/421352

Change 421352 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable VirtualPageViews on s6 (ja,ru,fr) wikis

https://gerrit.wikimedia.org/r/421352

Mentioned in SAL (#wikimedia-operations) [2018-03-22T18:41:00Z] <ladsgroup@tin> Synchronized wmf-config/InitialiseSettings.php: [[gerrit:421352|Enable VirtualPageViews on s6 (ja,ru,fr) wikis (T189906)]] (duration: 01m 16s)

pmiazga updated the task description. (Show Details)Mar 22 2018, 10:16 PM
pmiazga added a comment.EditedMar 22 2018, 10:20 PM

After deploying to s6 Kafka MSG/s jumped to ~170 just after deployment, now it's bit lower (~100 messages per second)

I'll keep an eye on this graph to make sure it doesn't cross events per second

The peak rate was ~230events/s which is way below 700-800 events per second.

150 events per second seems OK for just fr, ja, hu and ru wiki.. as that does include 3 of the top 6 wikis.
Do we feel comfortable to roll out to all wikipedia's except dewiki, enwiki and eswiki as the next stage, or should we be a bit more cautious?
Ping @Tbayer @Nuria

+1 in general.

https://grafana-admin.wikimedia.org/dashboard/db/eventlogging-schema?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-schema=VirtualPageView

Note that there are some errors happening. (I'm actually not sure where this client_errors metric comes from, will have to look).
I looked on processor logs, and I see both Extra data: line 1 column 408 - line 1 column 410 (char 407 - 409) and No JSON object could be decoded type errors. Perhaps the custom client side code yall are using has a bug somewhere?

@Ottomata we're actually not using any custom client side code and we're using $.extend so event.VirtualPageView should always be passed a JSON.

pageviewTracker( 'event.VirtualPageView', $.extend( {},

How can I dig into that data (in particular what it's sending)? How can I access those errorlogs?

+1 in general.

Is that asking for a cautious approach or all wikipedia's except dewiki, enwiki and eswiki as the next stage ?

Nuria added a comment.Mar 26 2018, 6:46 PM

@Jdlrobson some errors are being logged from client (url too large). Of these there are not many about 1 every 4 secs, at a rate of 150 per sec that is small but maybe you wish to look into that:

https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-schema=VirtualPageView&from=now-7d&to=now

url too large means that url is bigger than 2000 chars

Nuria added a comment.Mar 26 2018, 6:59 PM

You can do so in stats machine by consuming from error topic:

kafkacat -C -b kafka-jumbo1001.eqiad.wmnet -t eventlogging_EventError | grep irtual

The number of errors that do not validate is quite a big higher than the ones with url too large so you might want to start there.

Nuria added a comment.Mar 26 2018, 7:00 PM

I would take a look at those before opening to more wikis

Nuria added a comment.Mar 26 2018, 9:33 PM

Left a file in stat1005 for ya: /home/nuria/errors-for-jrobson.txt

Remember it is raw data so it should not leave stats machines.

Doing some back of the napkin math here I think enabling this feature for en wiki plus de wiki is going to be a 4 times increase on the throughput we are seeing here (estimating for looking at desktop pageview ratios briefly) so at peak iIthink we are going to be looking at about 1000 reqs /sec.

Thanks for the sample!
So in that sample the errors are { 'No JSON object could be decoded': 61, 'Extra data': 91 }

In Node.js it seems like all these JSONs should be parseable (provided you deal with ?, ; and = characters padding it.

	const rV = ev.rawEvent.split('\t')[0].split('\n')[0];
			const decoded = decodeURIComponent(rV).replace(/;$|;=$/, '').replace(/^\?\=?/, '');

What's weird, is why these are appearing.
Our client code does the following:

var queryString = encodeURIComponent( JSON.stringify( data ) );
			return baseUrl + '?' + queryString + ';';

How an '=' can get after '?' or ';' at the end is not clear to me. Maybe possible that something intercepts the network requests e.g. proxy ? Given the low amount of errors to the amount of events we're sending this seems plausible.

Looking at the IP addresses of the issues, the 152 errors come from 2 IPs.

I ran a query again on some newer data:

ssh stat1005.eqiad.wmnet
kafkacat -C -b kafka-jumbo1001.eqiad.wmnet -t eventlogging_EventError | grep Virtual > foo.txt &

The same 2 IP addresses came up. @Nuria I've pinged you privately with details of who they belong to. We may want to consider blacklisting..

I'm gonna spend time looking at larger dataset to test my hypothesis tomorrow that this is a local issue.

Doing some back of the napkin math here I think enabling this feature for en wiki plus de wiki is going to be a 4 times increase on the throughput we are seeing here (estimating for looking at desktop pageview ratios briefly) so at peak iIthink we are going to be looking at about 1000 reqs /sec.

That may still be plausible - recall that the estimate from our A/B tests (whose summary in the task description here got a bit muddled) was 700-800 events per second on average, not at peak time. (Also, as noted at T184793#4081754 it looks like the rate from the new instrumentation is a bit higher on huwiki than it was with the old instrumentation on enwiki and dewiki, but that may just be a natural difference between these wikis.)

It seems that the site sending this invalid data is an unofficial mirror. Will send an email to legal and CC you Jon.

Nuria added a comment.EditedMar 27 2018, 4:44 PM

ticket created for "unofficial mirror" https://phabricator.wikimedia.org/T190843

Change 422206 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[operations/mediawiki-config@master] Rollout VirtualPageViews (stage 3)

https://gerrit.wikimedia.org/r/422206

Nuria added a comment.Mar 27 2018, 6:37 PM

I think something to watch out is the disk size on eventlogging machine (not an imediate concern) as every incoming event is written to disk.
https://grafana.wikimedia.org/dashboard/file/server-board.json?orgId=1&var-server=eventlog1002&var-network=eno1&from=1519583793644&to=1522175793644

Looks like we have a way to go but there is almost 1 TB of space

Filesystem Size Used Avail Use% Mounted on
udev 32G 0 32G 0% /dev
tmpfs 6.3G 709M 5.6G 12% /run
/dev/md0 46G 2.8G 41G 7% /
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/mapper/eventlog1002--vg-data 870G 90G 737G 11% /srv
tmpfs 6.3G 0 6.3G 0% /run/user/4193

Tbayer added a subscriber: mforns.EditedMar 27 2018, 10:50 PM

Let's keep in mind that this EventLogging data is just an intermediate step, with the aggregate table that @mforns and @Ottomata are building in T186728 being the end product. The purging policy for this schema (which I wrote up yesterday) currently is to discard all data after the minimum time; if disk space still is a concern even for 90 days' worth of data, we could think about a custom shorter time limit for this schema.

Change 422206 merged by jenkins-bot:
[operations/mediawiki-config@master] Rollout VirtualPageViews (stage 3)

https://gerrit.wikimedia.org/r/422206

Mentioned in SAL (#wikimedia-operations) [2018-03-27T23:18:28Z] <ebernhardson@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: T189906: (duration: 00m 55s)

Jdlrobson removed Jdlrobson as the assignee of this task.Mar 27 2018, 11:47 PM
Jdlrobson removed a project: Patch-For-Review.

I enabled VirtualPageViews today for more wikis. There's only a few to go: German, English, Spanish, Italian, Portuguese, Polish and Chinese.
There's a tiny spike that suggests this worked, but we'll need more data/time to see how much it peaks at. @pmiazga could I ask you to check in on this graph first thing your morning?

If disk space still is a concern even for 90 days' worth of data, we could think about a custom shorter time limit for this schema.

My comments were not about hadoop, sorry. I was talking about something else.

Graph is ok, it's less than 300 events per second

Yeh there's a definite increase in rate but still healthy peaking at 320. Given English and German are disabled by default for anons I think we should roll this out to all the remaining wikis. Does that sound good @Nuria? This would add traffic from Spanish, Italian, Portuguese, Polish and Chinese.

Nuria added a comment.Mar 28 2018, 7:35 PM

@Jdlrobson let's let it bake for a day and let's look at the errors we see, just like we didi before.

Jdlrobson moved this task from To Do to Doing on the Readers-Web-Kanbanana-Board-Old board.

Hardly any errors other than bots. I will SWAT this either tomorrow or Monday (likely Monday at this rate since Friday is a holiday)

Nuria added a comment.Mar 29 2018, 5:44 AM

I looked at logs and graphs and i think we should be ok to launch for all but DE and EN wiki.

Patch above does that but will turn it on for DE and EN wiki but given the feature is not live there will not generate high volumes of traffic. Is that okay with you @Nuria?

Change 423047 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[operations/mediawiki-config@master] Rollout VirtualPageViews (final stage)

https://gerrit.wikimedia.org/r/423047

Scheduled for 11am Monday.

@Jdlrobson : let's please not change anything on Monday as it is a major EU holiday, rather let's launch further on Tuesday.

@Nuria I've moved this to tomorrow 4pm PDT. If @pmiazga can deploy it earlier in the day however that would be better. Let's keep discussion here about the roll out not T184793.

Change 423047 merged by jenkins-bot:
[operations/mediawiki-config@master] Rollout VirtualPageViews (final stage)

https://gerrit.wikimedia.org/r/423047

Mentioned in SAL (#wikimedia-operations) [2018-04-03T23:10:15Z] <dereckson@tin> Synchronized wmf-config/InitialiseSettings.php: Rollout VirtualPageViews (final stage) (T189906) (duration: 01m 19s)

Okay so the SWAT is done. That means its enabled everywhere but enabled only for a limited audience on German and English given the fact the feature is more or less disabled there.

I'm seeing a spike in events per second from 85 to 182 that suggests this is working.


https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-schema=VirtualPageView&from=now-2d&to=now

The graph has been a pretty steady curve so far, so I hope at peak we'll be seeing around 550 events tops.
Errors have also gone up. I'll have a look at those tomorrow.

@pmiazga just like last time can you check in on these graphs first thing tomorrow?

At 7:40am UTC it reached ~350 events per second (it's ~100 events per second more than the day before)

Yesterday at that time graph reached the max value - getting ~330 events per second.
Now we're getting ~490 events per second which is still below 700 events per second limit.

cc @Nuria

Jdlrobson updated the task description. (Show Details)Apr 4 2018, 5:43 PM
Nuria added a comment.Apr 4 2018, 6:38 PM

@Jdlrobson Please take a look at errors and let me know

Im one step ahead of you ;)

Jdlrobson reassigned this task from Jdlrobson to Nuria.Apr 4 2018, 7:09 PM

hey @Nuria @Ottomata the schema is enabled on all wikis now.
The traffic to this schema will increase tomorrow when T190188 rolls out on the train (our expectations is 10%) and when T191101 happens as this will increase the traffic to German and English and I've explicitly documented this in that task as part of the deploy process.

Given the current rate of events, it seems like we're on course for below 700 events per second. Are you feeling comfortable about that? We do have the option of a progressive roll out of T191101 so let me know ASAP if that will be needed.

I've looked into the bugs and they are all coming from the exact same 2 hosts we identified earlier.

I think we can resolve this task with your blessing. T184793 will remain open to capture any follow up work.

Nuria added a comment.Apr 4 2018, 9:39 PM

I though previews were enabled for all users? I cannot see them when I am signed in on es.wikipedia...I guess they are disabled for signed users, correct?

Jdlrobson added a comment.EditedApr 4 2018, 9:41 PM

@Nuria logged in users must opt in via Special:Preferences

Jdlrobson changed the task status from Open to Stalled.Apr 11 2018, 6:12 PM

I chatted with @Nuria and she points out that although it's available everywhere, this is not done until T191101 happens. Thus stalling.

Per recommendation from @Nuria I'm going to bump the anonymous opt in rate from 3% to 10% to get a better feel for how deploying to 100% of anons will impact VirtualPageView traffic. I'll do that either later today or tomorrow.

Change 425588 had a related patch set uploaded (by Pmiazga; owner: Pmiazga):
[operations/mediawiki-config@master] Enable Page Previews for 10% enwiki anon users

https://gerrit.wikimedia.org/r/425588

Change 425588 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable Page Previews for 10% enwiki anon users

https://gerrit.wikimedia.org/r/425588

Mentioned in SAL (#wikimedia-operations) [2018-04-12T13:15:12Z] <zfilipin@tin> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:425588|Enable Page Previews for 10% enwiki anon users (T189906)]] (duration: 01m 18s)

@Nuria we just bumped the rate to 10% for anonymous users on english wikipedia - you can expect a spike in events-per-second graph

Jdlrobson closed this task as Resolved.Apr 20 2018, 12:58 AM

This is rolled out everywhere now so I am resolving. Let me know if any problems with that. https://phabricator.wikimedia.org/T192622 captures the remaining work here!