Some VirtualPageView are too long and fail EventLogging processing
Closed, ResolvedPublic1 Story Points

Description

Under a 1% of VirtualPageViews are being dropped due to long URIs. We aim to shorten the URL length by removing source_url from the events.

This seems to happen because the encoded source_title is way too long, and exceeds the varnish URL size. You can see that the event data is truncted before it is finished.

Jun 11 14:05:05 eventlog1002 eventlogging-processor@client-side-06[11969]: 2018-06-11 14:05:05,376 [11969] (MainThread) Unable to process: ?%7B%22event%22%3A%7B%22source_page_id%22%3A346904%2C%22source_namespace%22%3A4%2C%22source_title%22%3A%22%E1%83%95%E1%83%98%E1%83%99%E1%83%98%E1%83%A1_%E1%83%A3%E1%83%A7%E1%83%95%E1%83%90%E1%83%A0%E1%83%A1_%E1%83%AB%E1%83%94%E1%83%92%E1%83%9A%E1%83%94%E1%83%91%E1%83%98%2F%E1%83%AB%E1%83%94%E1%83%92%E1%83%9A%E1%83%94%E1%83%91%E1%83%98%E1%83%A1_%E1%83%A1%E1%83%98%E1%83%90%2F%E1%83%99%E1%83%90%E1%83%AE%E1%83%94%E1%83%97%E1%83%98%2F%E1%83%A1%E1%83%90%E1%83%92%E1%83%90%E1%83%A0%E1%83%94%E1%83%AF%E1%83%9D%E1%83%A1_%E1%83%9B%E1%83%A3%E1%83%9C%E1%83%98%E1%83%AA%E1%83%98%E1%83%9E%E1%83%90%E1%83%9A%E1%83%98%E1%83%A2%E1%83%94%E1%83%A2%E1%83%98%22%2C%22source_url%22%3A%22https%3A%2F%2Fka.wikipedia.org%2Fwiki%2F%25E1%2583%2595%25E1%2583%2598%25E1%2583%2599%25E1%2583%2598%25E1%2583%259E%25E1%2583%2594%25E1%2583%2593%25E1%2583%2598%25E1%2583%2590%3A%25E1%2583%2595%25E1%2583%2598%25E1%2583%2599%25E1%2583%2598%25E1%2583%25A1_%25E1%2583%25A3%25E1%2583%25A7%25E1%2583%2595%25E1%2583%2590%25E1%2583%25A0%25E1%2583%25A1_%25E1%2583%25AB%25E1%2583%2594%25E1%2583%2592%25E1%2583%259A%25E1%2583%2594%25E1%2583%2591%25E1%2583%2598%2F%25E1%2583%25AB%25E1%2583%2594%25E1%2583%2592%25E1%2583%259A%25E1%2583%2594%25E1%2583%2591%25E1%2583%2598%25E1%2583%25A1_%25E1%2583%25A1%25E1%2583%2598%25E1%2583%2590%2F%25E1%2583%2599%25E1%2583%2590%25E1%2583%25AE%25E1%2583%2594%25E1%2583%2597%25E1%2583%2598%2F%25E1%2583%25A1%25E1%2583%2590%25E1%2583%2592%25E1%2583%2590%25E1%2583%25A0%25E1%2583%2594%25E1%2583%25AF%25E1%2583%259D%25E1%2583%25A1_%25E1%2583%259B%25E1%2583%25A3%25E1%2583%259C%25E1%2583%2598%25E1%2583%25AA%25E1%2583%2598%25E1%2583%259E%25E1%2583%2590%25E1%2583%259A%25E1%2583%2598%25E1%2583%25A2%25E1%2583%2594%25E1%2583%25A2%25E1%2583%2598%22%2C%22page_title%22%3A%22%E1%83%92%E1%83%98%E1%83%9D%E1%83%A0%E1%83%92%E1%83%98%E1%83%AC%E1%83%9B%E1%83%98%E1%83%9C%E1%83%93%E1%83%90%22%2C%22page_id%22%3A35638%2C%22page_namespace%22%3A0%7D%2C%22revision%22%3A17780078%2C%22schema%22%3A%22VirtualPageV

Acceptance criteria

Developer notes

The character limit is calculated in https://phabricator.wikimedia.org/T196904#4303878

Sign off steps

  • After deployment we'll want to make sure the error rate is negligable. If not, we may want to consider relying on page ids and dropping the title as well (will not be done as part of this task).
    • As @mforns says in T196904#4422959, the error rate will never be 0 as other potentially-long properties may cause the encoded event to overflow the limit.

QA steps

On beta cluster with page previews enabled please visit https://en.wikipedia.beta.wmflabs.org/wiki/Special:AllPages. When a popup is visible for more than 1 second the VirtualPageView event should be triggered in the network tab. (https://meta.wikimedia.org/wiki/Schema:VirtualPageView )

URL should be similar to this:
https://en.wikipedia.beta.wmflabs.org/beacon/event?%7B%22event%22%3A%7B%22source_page_id%22%3A0%2C%22source_namespace%22%3A-1%2C%22source_title%22%3A%22AllPages%22%2C%22source_url%22%3A%22https%3A%2F%2Fen.wikipedia.beta.wmflabs.org%2Fwiki%2FSpecial%3AAllPages%22%2C%22page_title%22%3A%220.18482905188547072-%C3%B6%C3%A4%C3%BC-%E2%99%A0%E2%99%A3%E2%99%A5%E2%99%A6%22%2C%22page_id%22%3A191083%2C%22page_namespace%22%3A0%7D%2C%22revision%22%3A17780078%2C%22schema%22%3A%22VirtualPageView%22%2C%22webHost%22%3A%22en.wikipedia.beta.wmflabs.org%22%2C%22wiki%22%3A%22enwiki%22%7D;

Verify that "en.wikipedia.beta.wmflabs.org%2Fwiki%2FSpecial%3AAllPages" appears in the string

Search for "1ვიკის უყვარს ძეგლები/ძეგლების სია/კახეთი/საგარეჯოს მუნიციპალიტეტი"
Hover over this one, check the above

When this is in production, we will look at the error logs to determine whether this is having the desired effect.

There are a very large number of changes, so older changes are hidden. Show Older Changes

I mentioned this to @mforns in chat, not sure if it works or not. Instead of truncating the url, could you just include the proto:://domain portion only? IIUC, that should be enough to get the project and language_variant.

Sure. That's a lot easier provided the rest of the information (including the long title) will still go under the required size. For completeness should we rename the field from "source_url" to "source_host" or will that cause too many headaches?

Would it help to sync for 15 minutes quickly next week to work out the best thing here? We have lots of options and happy to go forward with any of them!

Nuria added a comment.Jun 15 2018, 4:05 PM

that should be enough to get the project and language_variant.

The language variant is extracted from the uri like:

uriPath example with language variant /zh-hant/Wikipedia:首页
uriPath example with default language variant /zh/Wikipedia:首页

https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java

I think we can possible option is to send 1) language_variant from client (need to change schema) or 2) send url up to language variant, title is not needed

OHhh right 'variant'. Hm ok.

You could truncate everything after language variant?

I've scheduled a meeting for Wednesday to hammer out the details of this. I can also do tomorrow at same time if Wednesday doesn't work but I'm not sure if anyone is taking the WMF holiday off. Let me know if you are reading this and want an invite.

@Jdlrobson I'll be working both Tuesday and Wednesday if you want to change.

We had a chat about the different options here which were:

  1. Trim all URLs to under a certain length ensuring we keep variant.
  2. Identify after variant path in URL and trim after that e.g. https://zh.wikipedia.org/wiki/zh-hant/Wikipedia:首页 => https://zh.wikipedia.org/wiki/zh-hant/
  3. Drop source_url and send project and language variant as separate and new fields in the schema

4 Use webhost for project

  1. Encode all EventLogging requests as base64

Apparently 5 is not plausible - the base64 length is much greater than the existing length. 4 is tricky given all the alien webhosts we get and having to deal with "m." domains so we'd rather avid that. 2 and 3 require adding logic to the client or the aggregator that we'd rather avoid.

1 seems like the easiest option here so that will be the thing we try as the title will still be available in other fields - so source_url can be derived if necessary for trimmed URLs.
We're a little concerned that limiting the source_url might not be enough. @mhorns will be providing me a dump of the errors to investigate whether limiting the source_url is enough. If we also need to think about limiting titles we'll cycle back and talk through this again.

Ottomata added a comment.EditedJun 20 2018, 6:30 PM

We're a little concerned that limiting the source_url might not be enough. @mforns will be providing me a dump of the errors to investigate

This might be difficult to know, as the truncation is done by varnish, and we don't get anything beyond the truncated requests in the logs.

@Jdlrobson I put the virtualpageview error logs under stat1004.eqiad.wmnet:/home/mforns/virtualpageview_errors.log.
Let me know if you want me to copy it over somewhere else.

@Ottomata that's true.
Given the example error earlier, the EventLogging URLs I'm seeing are 2063-2176 code units long. The URL seems to be 1107 units long - so half of that. Trimming to 200 code units is more than enough to get just the base URI.

Looking at the errors in the log - it seems all the errors occur on strings of length 2123. This suggests this is the code length Varnish truncates at.

If this is correct the average length of all the erroring URLs is 2187 and the longest is 2241. ( If I look just at the length of the source_url, the longest the source_url could be is 1677 )

With that in mind, given the limit of 2123 and looking at the longest example - 2241 - 2123 = 188 characters too many, so truncating the longest source url by 200 characters = 1677 - 200 = 1477

Thus, I think given this sample we could probably get away with truncating the source_url to 1400 code points and salvage many of these errors. Sound good?

I used this code to check:

var fs = require('fs');
var txt = fs.readFileSync('/home/jdlrobson/virtualpageview_errors.log', 'utf-8' );
var avg = 0;
var lines = txt.split('\n');
var max = 0;
var maxSourceUrlLen = 0;
lines.forEach((line) => {
var newLine = line.slice(0, line.indexOf( 'VirtualPageView' ) ) + 'VirtualPageView%22%2C%22webHost%22%3A%22ka.wikipedia.org%22%2C%22wiki%22%3A%22kawiki%22%7D;';
var sourceUrl  = newLine.replace(/.*source_url%22(.+)%22page_title.*/, '$1')
if ( sourceUrl.length > maxSourceUrlLen ) maxSourceUrlLen = sourceUrl.length;
if ( newLine.length > max ) max = newLine.length;
        avg += newLine.length;
});
console.log(avg/lines.length, max, maxSourceUrlLen);
Jdlrobson changed the task status from Stalled to Open.Jun 20 2018, 8:47 PM
Jdlrobson added a project: Page-Previews.
Jdlrobson updated the task description. (Show Details)
Jdlrobson moved this task from Needs Analysis to Triaged but Future on the Readers-Web-Backlog board.
Jdlrobson updated the task description. (Show Details)

@Jdlrobson @Ottomata

Ha... I just thought that we might be ignoring lots of longer error logs...
When generating the error dumps, I grep'd for 'VirtualPageView'. The problem is that by design EL outputs the schema after all schema fields.
So, all events long enough to displace the schema outside of the varnish limit, will have not been caught by my grep.
:/

I guess I can identify VirtualPageView events by an earlier field that is specific to that schema? Maybe the same source_url will help.

mforns added a comment.EditedJun 20 2018, 9:22 PM

@Jdlrobson
Yea, many more errors when grepping for earlier fields...
My bad.
The new error dump is under stat1004.eqiad.wmnet:/home/mforns/virtualpageview_errors_corrected.log.

Jdlrobson updated the task description. (Show Details)Jun 21 2018, 5:44 PM
ovasileva set the point value for this task to 1.Jun 21 2018, 5:51 PM

@mforns running on the latest dump.

Looking at the errors in the log - it seems all the errors occur on strings of length 2123

This is still true.

Average error event length: 2208.861944312949
Longest error event length: 2213
Maximum length of source url: 1937

Thus, I think the proposal of limiting to 1400 characters looks fine to me.

@Jdlrobson
Looks good to me overall.
Now, could it be that the maximum length of the source_url is 1937 because longer source_urls are cut off by varnish?
If so, there's the possibility that we continue to have errors, even when reducing the source_url to 1400 chars in the client, no?

We could maybe apply some calculations from the client. We know that there's a practically static part of the event's payload: everything except page_title, source_title and source_url. We could approximate that statistically beforehand and get the upper bound constant, say we name it S. Also we know that varnish truncates events larger than 2000 chars. Then we can trim source_url to: L = 2000 - S - len(page_title) - len(source_title). This way we ensure the source_url is as long as possible, while we get the least amount of errors.

Now, could it be that the maximum length of the source_url is 1937 because longer source_urls are cut off by varnish?

Yep that's definitely possible.

We could maybe apply some calculations from the client.

I'm still a little cautious about adding additional logic to the client even a simple one. Am happy to be more aggressive with the cut off e.g. 1000 characters and then see what happens. How would we calculate S?

mforns added a comment.EditedJun 21 2018, 9:35 PM

I'm still a little cautious about adding additional logic to the client even a simple one. Am happy to be more aggressive with the cut off e.g. 1000 characters and then see what happens.

Yea, makes total sense. It's a very small proportion of errors to introduce much logic.

How would we calculate S?

I'd get a full encoded event and replace page_title, source_title and source_url by empty strings. This would give a good estimate. All other fields are either namespaces or page_ids. They should be of (practically) constant size. Namespaces are 4-5 digits long maximum, no? And page_ids should also be limited to a given length. So maybe I'd add a couple chars to account for these small variations.

Nuria added a comment.Jun 22 2018, 3:35 PM

Am happy to be more aggressive with the cut off e.g. 1000 characters and then see what happens.

This sounds totally fine.

Jdrewniak claimed this task.

Looking over the task description

Inside the Popups code, before sending a VirtualPageView event, ensure the source_url field is limited to 1400 characters.

I assume the length, 1400 characters (or 1000, from the conversation) refers to the uri encoded length of the source_url.

(i.e this %83%94%E1%83%A2%E1%83%98 not this ვიკის_უყვარს)

This encoding process actually takes place in the event logging codebase, not the Popups codebase.

Popups calls evLog.makeBeaconUrl here which actually invokes makeBeaconUrl here in eventLogging.

I don't think that we want to make this change in eventLogging, because that might have widespread consequences.

To do this in Popups, I think what we would have to do is first encode source_url with encodeURIComponent(), trim that value to be under 1000 char, then decodeURIComponent and send that value to eventLogging, to be encoded once more. If this is correct, then this is likely to fail, because we might chop off half of the utf8 encoding of a single character, resulting in an error.

decodeURIComponent( encodeURIComponent("ვიკის_უყვარს_ძეგლები/ძეგლების_სია/კახეთი/საგარეჯოს_მუნიციპალიტეტი").slice(0,50) ) 
> URIError: URI error

The alternative to that, I think, is to not call makeBeaconUrl from eventLogging, but replace it with our own function that does something similar, but chops off the encoded source_url.

@Jdlrobson, @phuedx I hope my assumptions here are totally wrong and this really is the 1 pointer it purports to be 🙏

@Jdrewniak, I think it'd be fine to modify the eventlogging codebase take a parameter to support chomping particular fields...although I see how that could get messy fast.

The alternative to that, I think, is to not call makeBeaconUrl from eventLogging, but replace it with our own function that does something similar, but chops off the encoded source_url.

Given how small and how independent of the EL internals the makeBeaconUrl method is, I can see an argument for writing our own in the Page Previews codebase.

We'd be coupling to the EventLogging protocol ("make an HTTP request to this endpoint with a URI encoded JSON-blob and you're good") rather than to the EventLogging JavaScript API, which means we'll have to put a little more effort into tracking upstream changes. However, I'm pretty sure that we're doing that anyway 😉

Ya fine with me too. A lot of these bits will likely get refactored as part of T185233, but the /beacon endpoint is unlikely to change.

What you say sounds right. That said the 1000 char limit does not need to be exact and consistent and the URIError can be avoided by searching for the last % character and trimming, no?

I think talk of rolling our own EventLogging code may be a little premature here.

URIError can be avoided by searching for the last % character and trimming, no?

I'm looking for a way to slice this cleanly, but it doesn't seem to be as simple as trimming the last %.
from MDN: "encoded characters can be represented by 1, 2, 3 or 4 escape sequences".

%E3%83%9A - %E3%83%87 - %E3%82%A3 - %E3%82%A212 => ペ - デ - ィ - ア
%D0%B5 - %D0%B4 - %D0%B8 - %D1%8E => е - д - и - ю
%2F - %3A - %2F - %26 => / - : - / - &

'http://url...ウィキペディア'.split('').reduce( (url, x) => {
    return (url.length < 1000) ? url += encodeURIComponent(x) : url;  
}, '' )

Do I get a gold star for using reduce?

Jdlrobson updated the task description. (Show Details)Jun 28 2018, 9:45 PM
Jdlrobson reassigned this task from Jdrewniak to ABorbaWMF.
Jdlrobson added subscribers: ABorbaWMF, Jdrewniak.

https://gerrit.wikimedia.org/r/442740 has been merged.

Let me know if you need any help testing this one @ABorbaWMF

Vvjjkkii renamed this task from Some VirtualPageView are too long and fail EventLogging processing to r9aaaaaaaa.Jul 1 2018, 1:04 AM
Vvjjkkii removed ABorbaWMF as the assignee of this task.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed the point value for this task.
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot set the point value for this task to 1.Jul 2 2018, 8:23 AM
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot assigned this task to ABorbaWMF.
CommunityTechBot renamed this task from r9aaaaaaaa to Some VirtualPageView are too long and fail EventLogging processing.
CommunityTechBot added a subscriber: Aklapper.

Heads up @mforns @Ottomata this is in 1.32.0-wmf.12. Deploys are a bit delayed but group0 will be live with this change on Tue July 10th.

Verified on Beta both through Special: AllPages and on regular articles that:

  1. When a page preview is visible for more than 1 second the VirtualPageView event is getting triggered in the network tab.
  1. Request URL is also showing up correctly, particularly for long source_title, it is not getting truncated.
  1. Source_url is also specified as "https://en.wikipedia.beta.wmflabs.org/wiki/Special:AllPages" properly.
ovasileva reassigned this task from ABorbaWMF to Ottomata.Jul 9 2018, 5:06 PM
ovasileva added a subscriber: ovasileva.

This looks good from our side. Over to you @Ottomata for any last checks and signoff.

This will be live tomorrow on mediawiki.org and then everywhere Thursday. Let's check in when this happens to check this is working as expected

Jdrewniak reassigned this task from Ottomata to phuedx.Jul 12 2018, 5:07 PM

Verified this issue on production. Looks good to me. However, found another issue with page previews while testing this, filed as T199508.

phuedx reassigned this task from phuedx to Ottomata.Jul 13 2018, 12:57 PM

I don't have access to eventlog1002 so I don't think that I can sign this off. Over to you for now, @Ottomata!

phuedx added a subscriber: elukey.Jul 13 2018, 1:18 PM

@elukey was good enough to check the logs on eventlog1002:

14:00:18 <elukey> do you need anything to check on eventlog1002?
14:01:09 <phuedx> yeah -- there was a low-frequency processing error occurring where a VirtualPageView event couldn't be processed because it was too long
14:01:37 <phuedx> we (readers web) deployed a fix for this as part of yesterday's train and i wanted to check if the error had disappeared
14:01:52 <phuedx> i checked the event.eventerror table but to no avail
14:06:00 <elukey> checking in the logs
14:06:05 <phuedx> <3
14:07:28 <elukey> so I am grepping grep -rni 'Unable to process.*VirtualPageView' eventlogging-processor@client-side-*
14:07:41 <elukey> and sort by time
14:07:45 <elukey> let's see the last occurrence
14:08:48 <elukey> so I see one at Jul 13 13:06
14:08:57 <elukey> UTC time, so basically now
14:09:31 <elukey> (checking the full entry in the logs)
14:09:54 <elukey> phuedx: yeah it seems like the one posted in the task
14:10:18 <phuedx> just gone over the ticket to find that the error was occurring ~2 times a minute
14:11:00 <phuedx> elukey: there's not much that we can do about clients that haven't refreshed and gotten the latest version of page previews. maybe i'll check back in a couple of days too?
14:11:41 <elukey> phuedx: yeah sure, I have the grep handy in my history so I can quickly report anomalies
14:11:52 <phuedx> ta <3
14:11:53 <elukey> let's recheck on monday
14:11:55 <elukey> :)

It looks like the error is still occurring. One explanation for this is that not all clients have updated yet and so will still be logging invalid events.

Let's grab a full day's error log on Monday and determine if the error rate has dropped or not.

@phuedx @elukey
I think a minimal amount of errors is expected.
AFAIK we only shorten the source_url, but there are other potentially long fields like source_title, that combined still can make the event overflow the max 2000 chars.
So, the errors should be a lot less frequent but not disappear, no?

phuedx added a comment.EditedJul 13 2018, 3:51 PM

So, the errors should be a lot less frequent but not disappear, no?

Agreed. We can test this hypothesis on Monday (or today, if you're so inclined!).

@elukey: As discussed, could you grab today's VirtualPageView-related processing errors from eventlog1002 and put them somewhere accessible like @mforns did in T196904#4303746?

Nuria added a comment.EditedJul 16 2018, 6:33 PM

@phuedx developers with access to stats machines (i think most (all?) of your team) can also pull errors, you can see those in real time

kafkacat -C -b kafka-jumbo1001.eqiad.wmnet -t eventlogging_EventError

https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Administration#Check_logs_and_errors

Let me know if that works, i cannot see any errors for virtualpageviews that have to do with url count

Excellent! Thanks, @Nuria. I'll try that out now.

I captured just over an hours worth of data between 12:22 and 13:29 UTC using [0] and found the following errors:

cat eventlogging_EventError_VirtualPageView.log | cut -d '"' -f 10 | sed 's/: .*//' | sort | uniq -c
      9 Expecting ',' delimiter
      1 Expecting ':' delimiter
      4 Expecting object
      1 Expecting property name enclosed in double quotes
     12 Extra data
      9 No JSON object could be decoded
     22 Unterminated string starting at
      2 end is out of bounds

That's 60 events that couldn't be processed from an hour in which we were receiving an average of 888 events/s, i.e. an error rate of 0.00188% vs the 0.00538% in T196904#4279274.

[0]
kafkacat -q -C -b kafka-jumbo1001.eqiad.wmnet -t eventlogging_EventError | grep VirtualPageView > eventlogging_EventError_VirtualPageView.log
phuedx claimed this task.Jul 17 2018, 2:18 PM
phuedx updated the task description. (Show Details)Jul 17 2018, 2:22 PM
phuedx updated the task description. (Show Details)Jul 17 2018, 2:41 PM
Nuria added a comment.Jul 17 2018, 4:17 PM

That error rate is, ahem, nothing , I bet is higher in any other schema. Let's close ticket.

phuedx closed this task as Resolved.Jul 17 2018, 5:00 PM

The error rate of ~0.00188% is borne out by another two hours of data collected between 2:00 and 3:58 UTC.

Restricted Application added a project: User-Ryasmeen. · View Herald TranscriptJul 17 2018, 5:00 PM