webrequest data loss 2018-11-05 on upload partition
|operations/puppet : production||profile::cache::kafka::webrequest: filter out Timestamp:Restart logs|
Looking at data loss :
hive (wmf_raw)> select * from webrequest_sequence_stats_hourly where webrequest_source="upload" and year=2018 and month=11 and day=5 and hour in (13,14,15) ;
returns data loss starting at hour 13
hive (wmf_raw)> select percent_lost from webrequest_sequence_stats_hourly where webrequest_source= "upload" and year=2018 and month=11 and day=5 and hour in (13,14,15) ;
More precise counts per host:
select hostname, count_different, percent_different from webrequest_sequence_stats where webrequest_source= "upload" and year=2018 and month=11 and day=5 and hour in (13) order by count_different limit 10000000 ;
Checking records on webrequest (wmf_raw) without dt we get:
select * from webrequest where year=2018 and month=11 and day=5 and hour=13 and dt="-" and webrequest_source='upload' limit 10;
These returns a bunch of records with accept header WebP
- webrequest_upload is getting more requests that it has for a while:
But some of those requests are defective, seems we are double logging some requests
@BBlack: could it be possible that the extra fetch for the .webp variant (when not there) adds a log in the Varnish shared memory log that gets picked up by Varnishkafka? In this case it might not have a timestamp field, generating errors for us (every time that a hot image is fetched and its .webp variant is not there).
So I confirmed with
sudo varnishlog -n frontend -q 'ReqUrl ~ .*webp' (on cp3035)
that the .webp fetch that fails logs Timestamp:Restart and not Timestamp:End.
Example that I followed:
- Request for /wikipedia/commons/thumb/b/b4/Flag_of_Turkey.svg/22px-Flag_of_Turkey.svg.png, that doesn't have the .webp variant. Ends with Timestamp:Restart
- Right after, same request for 22px-Flag_of_Turkey.svg.png but correctly ending with Timestamp:End.
So my solution would be to filter out requests with Timestamp:Restart at varnishkafka level.
Exactly, the way the webp support works if that it restarts the Varnish transaction after rewriting the request URL. This is a Varnish feature that hasn't been used much in production before. Sorry for the mess it created in the kafka pipeline, I had not anticipated that it could cause something like this. We can easily turn the feature off if you guys have some fixing to do, and turn it back on later.
We can easily turn the feature off if you guys have some fixing to do, and turn it back on later.
Thank you. We are going to try to see if our fix (which is ignoring some requests) fixes issue, we just deployed it and will monitor fleet for the next hour. If it were not to work we will ping you.
To keep archive happy: Failed refinement hours were 13->19 (included) on webrequest_source = 'upload', 2018-11-05.
I refined them manually 2018-11-06 and marked them as done manually as well.