Sat, May 27
I can try writing that this summer, right now I'm swamped until the end of the quarter.
Fri, May 26
Already collected by a cookie that makes it into EventLogging, it's just not exposed in Grafana.
@BBlack one way to verify that the performance improvement we're seeing is "real" would be to turn BBR off for a bit. That being said, it will still be a simulated slow connection and that alone doesn't tell us the effect in the real world, if any.
Could be, yes. The BBR improvement could be verified by turning it off. Let's discuss that on the BBR task.
Trying it out in my own browser, it does look like a deliberate change, as the first section is nicer to read at the top than the infobox on a width-constrained browser.
Same for Sweden. Before:
Something that might be noteworthy, looking at the Facebook article.
There is an apparent performance improvement that coincides in timing, but on a simulated slow internet connection:
No visible impact in RUM (looking at firstPaint in NavigationTiming). But if it only affects large articles, it's not that surprising, as we saw with the logo preload.
Looking at the code, it seems like an iframe for Youtube, and Commons does have a URL to be put in an iframe, so that's perfect. I think we need the actual Commons video player because:
Fair enough, re: our information leaking to google. I think that's a clear deal breaker.
It does, thanks.
Wed, May 24
It still states:
We do host all our recorded meetings on youtube in addition to commons, there would some convenience in being able to link to that... We should ask people who clean up phabricator spam/moderate it if they'd be ok with the risk of opening that up. @Aklapper what do you think?
refreshFileHeaders is still dreadfully slow even with batching and not needing to look at metadata, unfortunately. I think I'll focus on implementing the fallback in Thumbor first...
Another option is allowing youtube embedding...
Well, a higher limit (or unlimited in that case) could require special rights. But I imagine that requires upstream work, right?
Possibly a regression, then, both .ogv and .webm files I try show an error when the upload reaches 100%:
Tue, May 23
Sun, May 21
Assuming the above change works and we only need to run refreshFileHeaders for the migration, let's estimate how long it would take for commons if we simply ran the migration script on Terbium.
Sat, May 20
I've just realized that the existing migration scripts probably don't care about oldimage... which means that those wouldn't get the X-Content-Dimensions header. The cost of the migration thus gets even worse, since thumbnails can be generated for those old revisions.
Since the current migration scripts are very slow on Terbium when applied to all file types, we probably have to start looking at alternatives:
Fri, May 19
Wed, May 17
I've checked and PdfHandler already handles rotation like a champ, all the way to X-Content-Dimensions. I think the change I've made in core for JPGs is all we need to do, as the remaining formats I haven't mentioned here don't have metadata rotation features.
TIFF support for orientation on Linux seems to be mostly broken, including on Mediawiki. Any orientation that should swap dimensions (eg. 90 CW) is interpreted as a different orientation value that doesn't swap dimensions. In this sea of brokenness, this means that for X-Content-Dimensions we should ignore metadata orientation for TIFFs (single or multipage). I.e. we don't need to anything more than what we're doing currently.
Tue, May 16
I didn't write similar integration tests for DjVu, because those depend on command line tools that aren't present by default. However I manually crafted a djvu document with a rotated page and the rotation is applied by djvutoxml, which means that Mediawiki doesn't need to do anything special.
I've just realized that X-Content-Dimensions doesn't apply EXIF rotation, which is necessary as the Swift headers alone won't tell if the current document/page is soft-rotated...
Mon, May 15
Thu, May 11
Wed, May 10
FYI we usually link to the RAIL guidelines because they're easy to understand, but they're based on research that's been around for some time about what feels instantaneous, etc. It's less digestible to link to research papers, but I'll do some homework and dig up the actual research on the subject. I personally judge RAIL as informative only, not rules to follow. In most things our team deals with, for example time-to-content on pageload, lower is always better, so the quest to improve never stops and things like RAIL are irrelevant. It's only a useful thermometer when things are so slow they don't feel instantaneous anymore. But again that's more a talking point than a goal or a rule. The definition of what's instantaneous does vary between sources, and it evolves with time. People are more impatient with their devices now than they were a decade ago. Always healthy to review research on the subject, though, so I'll do it since we haven't looked at that for some time.
Deployed on testwiki, works for most formats, except:
We could have RL report how much it found in local storage and treat it the same as cache hits in resource timing. But we won't get the gzipped size in that case. I'm not sure what you want to count, though, request count or actual size.
400m misses means 154 requests per second. It would at least triple the load on Thumbor. Might be possible if/once we've repurposed all existing image scalers to Thumbor.
Got confirmation that varnish entries for originals normally expire in 24 hours, but it doesn't really matter anyway since Thumbor consumes directly from Swift. Still, it will be convenient for debugging to just check the headers of original links from random files on migrated wikis a day after their migration.
In production, the migration steps also require purging, otherwise the original can be served by Varnish and doesn't get cleared with those 2 jobs. Unfortunately using the general purge means that thumbnails get purged as well, which is undesirable. Purging all thumbnails for Commons would be problematic in terms of purge load. We can afford to wait for the cached originals to expire organically, I think, I just have to confirm how long that is in the worst case.
Tue, May 9
The top 100 most requested sizes represent 91.21% of all requests. The remaining long tail (any size not in that 100 whitelist) represents 68% of the storage size.
You mean when the page itself is a cache hit? Not worrying about the resources on the page? I'm not sure that even that is possible without some usual guess work about times that represent a cache hit vs a network call:
Right :) But I meant as RUM. The Obama article is quite an extreme and it would have been nice to know how much it improved time-to-logo for users in practice.
Some more food for thought.
Anyway, to close the discussion on this particular task, I think now enough time has passed since yesterday's SWAT to declare that the logo preloading has absolutely nothing to do with the fetchStart/firstPaint change. I don't think I'm going to bother backporting the second part with the patch that reactivates it, it'll just go out with the train this week.
Yeah I think that's just the global traffic really dipping, which has made things super confusing. This was really the perfect storm of: us deploying a huge amount of changes after a week of deployment was skipped, a new Chrome version ramping up, traffic in Turkey getting blocked and what looks like a global traffic dip that Turkey alone can't explain.
Reading the Navigation TIming v2 spec, I think the argument to use fetchStart as the origin is confirmed by the breakdown of what happens in case of a redirect. When a redirect happens, after recording the redirect duration, it goes back to updating fetchStart for the new page it's been redirected to and start again from there ("step 9").
Mon, May 8
I've been staring at Pivot for a while and I've noticed something interesting. The report rate for fetchStart seems to follow the pageview trend for Chrome. And right now we're experiencing a dip in traffic similar to the christmas period, and that pattern is found in both datasets. What's most striking is the huge spike we had over last summer on Chrome specifically, which is found in both:
So far, with the logo preload turned off for a couple of hours, no change in the metrics. We'll see tomorrow, but I think this confirms that the logo preload had nothing to do with the fetchStart/firstPaint change.