Mon, May 20
Ian Marlier participated in the site maps project as his own personal initiative, but it has always been out of scope for the Performance Team. And his knowledge of that project left with him, so we're not better equipped than anyone else to do something about this.
Is CPT going to look into this? It seems more in the CPT team scope than performance's.
Wed, May 15
gilles@stat1005:~$ apt-cache policy python-thumbor-wikimedia python-thumbor-wikimedia: Installed: (none) Candidate: 2.5-1+deb10u1 Version table: 2.5-1+deb10u1 1001 1001 http://apt.wikimedia.org/wikimedia buster-wikimedia/main amd64 Packages
Tue, May 14
The upside break is actually a lot more visible for onLoad, and there the timing is clearer, looking very much like it coincides with the backport:
Looked at the SAL around the time of the mobile spike, I noticed that @Krinkle backported a startup module change:
The week-to-week difference for Mobile is quite dramatic:
I'd like to understand this bug better before rolling back the package for coal. It's not a big deal per se if coal is a little behind events. While there is "consumer lag" (is that the amount of uncommited events that are more recent than the last committed one by the client?), it doesn't seem to be growing. In practice when you go to https://performance.wikimedia.org/ the metrics (processed by coal) look quite up to date. If there is indeed real lag, it doesn't seem to be an amount that's impactful.
I added a short blurb about backfilling data on the runbook.
When using start-timestamp while the primary instance is running, you simply need to use a different consumer group. Otherwise the command fails anyway, complaining that there's already a subscriber for that consumer group. So the manual backfill and the ongoing process don't interfere with each other.
I think the retention is worse than we though, so it looks like it's too late.
It doesn't seem to be working, I've had it running for a while and it's still not filling the gap. This will require manual investigation. Or maybe it's too late already and the oldest data in Kafka is too recent?
Ah, the schema option is "append", so I'm bound to waste time processing the other schemas...
I don't see a gap anymore for coal.saveTiming not for coal.firstPaint, so I'm going to restart the command with it only looking at the NavigationTiming schema, so it doesn't waste time processing already processed events for the other schemas:
Currently attempting to reprocess that timeframe with the following command:
Yes, having the thumbnails you wanted primed on a dedicated page visited once works just fine.
@Miriam I've shared Filippo's google doc that contains the data with you.
Mon, May 13
Thu, May 2
I've discovered yet another bug/big shortcoming, this time on desktop. The Multimedia Viewer bottom panel scroll animation is another source of a ton of small LayoutJank events. I've filed another bug about that: https://bugs.chromium.org/p/chromium/issues/detail?id=958828
As it stands, the bug experienced on the mobile site is creating so much noise that it's pretty much useless to look at that data. Which is probably why all the top offenders in terms of summed fraction for a pageview were from the mobile site. I'll have to redo the investigation of worst offenders while looking only at the desktop site.
While investigating worst offenders in terms of summed fraction, I discovered 2 bugs ( or at least very big shortcomings) of the API, filed upstream:
Wed, May 1
Indeed, the observer can work without the reporting endpoint. This will be sufficient for this origin trial. And we have a pretty good idea now of what kind of pipeline we'll need when we want to collect actual Reporting API reports.
We might even be able to disable the endpoint for now, I'll check if the observer gets the report without the report-to header being set.
Hah, actually a workaround is to set up a ReportingObserver, and then we could ship them with EventLogging. Obviously this only works for reports like sync-xhr where the page actually works. But this allows us to look at sync-xhr without having to set up the whole infrastructure required to capture all report types via the reporting endpoint.
The API is definitely working, we're getting reports of sync XHRs. Unfortunately since the reports are sent as POSTS, varnishlog alone can't let us inspect the contents, as it's unable to look at POST request bodies (because those aren't cacheable, by definition). In order to see what we're actually getting, we need to set up a Varnish backend for this.
Submitted an upstream issue: https://github.com/w3c/webappsec-feature-policy/issues/305
Now looking at UC Browser on iOS.
Tue, Apr 30
Starting with Opera Mini, which requires enabling the "Mini" option (by default it's using "Turbo"), I see that:
- They disable JS and reprocess images (default).
- Their "one column" option completely messes with the layout.
This required to add some new necessary hiera values on Beta, that were added recently to production.
You can find the data in /home/gilles/articlequality/datasets on stat1007
I'm swamped with my main work on the Performance team, so this has been on the backburner, sorry. If someone else is keen to pick up work on it, I'm more than happy to point to what's been done so far.
With coal logging fixed and the python-kafka library updated, I think the issue that caused the breakage should be fixed. If not, we'll be better prepared to understand it next time it happens.
@Ottomata all our services are good now, you can go ahead with upgrading EventLogging and Hadoop.
I've tracked down the root cause of the issue: https://github.com/dpkp/kafka-python/issues/1774
Fixed navtiming for now, I'll investigate further to make sure that this is a proper fix and not a hack. Right now I'm not sure that the metadata call did happen before we call partitions_for_topic.
It seems to be breaking navtiming (coal is fine, though):
Mon, Apr 29
Is private-dev problematic because it's a WMCS VM? @MoritzMuehlenhoff do you have any clue?
I've deployed the Stretch version of 3d2png on both hosts, and that file renders fine.
I'm going to guess that this host has an old version of 3d2png installed with node binaries compiled against Jessie instead of Stretch. libpng12-0 doesn't exist in Stretch: https://packages.debian.org/jessie/libpng12-0
Installed nodejs-legacy manually without issue on that host. Not sure why Puppet failed to do so. I see that the package was already installed properly on the -02 host.
Thumbor appears to be running fine on that host right now, and I see the private-dev rule in /etc/firejail/thumbor.profile
Fri, Apr 26
Yes, both 3D rendering of STL files and "smart cropping" (face & feature detection).
I think it's important for us to know how these work exactly, otherwise we can't serve them a specific banner, and we have to know if they respect Cache-Control: no-transform.
Thu, Apr 25
Why not just upgrade to a newer version? Surely it's not the only bugfix in the past year.
I inspected the code carefully and couldn't find what might have gone wrong with our code.
Wed, Apr 24
Let's look at the distribution by amount of layout jank events per pageview, for pageviews that get jank.
With the purging and the time that has gone by, we should be at a point where 100% of desktop ruwiki are getting the origin trial. It looks like 99.5% of our desktop pageviews get layout jank:
@jijiki this is all good to go, successfully built on buster.thumbor.eqiad.wmflabs python-thumbor-community-core and thumbor need tiny patches (you'll find them in there). And python-thumbor-wikimedia needs the upgrade/patch above. Manhole build as-is.