Alright, I think that answers it, it's more straightforward than I anticipated. The bot needs I have are actually much smaller than I first thought, and I'll probably be running it off of WMCS.
It worked with a different name. I guess the name "performanceteam" was problematic. It could be that there's a unix user group named that way or something.
I'll try again, calling it something different
As expected, here it is: https://analytics.wikimedia.org/datasets/performance/autonomoussystems/
For the desired shell name in the signup form, if that's what it is, I'm pretty sure I put "performanceteam". Since I'm stuck in this limbo of temp password, I can't log into the account to look anything up.
I've created a performance/autonomoussystems folder on /srv/published-datasets on stat1004, which should get picked up and published on https://analytics.wikimedia.org/datasets/ eventually.
Where do I look that up?
So I've figured out how I can have a bot push arbitrary file contents to a gerrit change on performance/docroot, but there is ont big caveat: it's not creating a git commit per se, and thus it's not running the git hooks that get jekyll to run... I don't think it's reasonable to have node and all the potentially risky dependencies it's going to pull run on the stat machines merely to run jekyll and generate the change we need.
Here's how simple the code is for publishing the change to gerrit with that library:
I've figured out how to do that locally with the pygerrit2 python library and it's fairly straightforward. Unfortunately that library is only available in buster, dammit, I will have to backport it.
User id 13687 on wikitech
@Nuria my plan is to have the report generated by a cron python script on a stat machine, and have the resulting CSV then git-pushed to the performance/docroot repo (static site where the report will be viewable). Does that sound sane to you? Is there any precedent to doing something like this?
Tue, Jan 15
Looking at synthetic and RUM data, it doesn't look like this had a visible effect on global metrics. This was an expected possible outcome, as it's possible that the browser's heuristics were already doing the right thing.
The explanation is that I reset the sampling rate of navigation timing for ruwiki back to the default on 2019-01-10. This was after making it 10x its normal rate on 2018-12-20. Which has an inverse effect on metrics, that went unnoticed. It's easier to see the effect when looking at the desktop metrics at those dates.
Failing thumbnails tend to be costly to reattempt, which means repeated requests to those get rate-limited (429). Generally speaking, short of a software upgrade, a failing thumbnail isn't going to work the next time it's requested, hence the use of poolcounter or memcache-based throttling.
Mon, Jan 14
Fri, Jan 11
Following @ayounsi's request, I've put together per-DC real user monitoring performance metrics using the following Hive query:
@Krinkle's suggestions sound good to me
Thu, Jan 10
- We need to ask the eswiki community's permission before running the perceived performance microsurvey on that wiki.
On staff IRC there was a discussion of rural vs urban. However I think that since mobile is worth focusing on, there's no mapping between IP addresses and city location for mobile. Without asking users to share their location (which we won't do), there's no way to assess whether they're in a rural or urban area. This means that we'll have to stick with national rankings, which shouldn't inform local decisions about which ISP is best, since that can vary greatly based on location, but if an effort is made by an ISP to improve their service, it should surface in our updated rankings month-to-month or year-to-year.
Initial results show that this might work, but I need to wait until the extra data has been collected before I can claim success on devising a fair ranking. Right now for January I only have 12-151 samples per ISP for the US for example (which is logical since pretty much only ruwiki was contributing to the dataset so far) and with that amount I'm seeing big variations in median transferSize. The amount of samples to generate the ranking needs to be large enough that the median transferSize is in the same ballpark for all ranked ISPs. If we don't get there with more data, it's a peculiar finding that users might look at bigger articles depending on which ISP they're subscribed to (could be a rural vs urban thing if that's the case).
Maybe some subfolders for the ones we have a lot of dashboard for, like NavTiming?
Mon, Jan 7
I thought about Parsoid, but first let's see if it's beneficial at all for merely reading articles rendered by MediaWiki on a web browser?
Thu, Dec 20
I wasn't aware that the latest plan was to use ATS for TLS termination.
Could be a puppet variable too, to make the filtering block conditional.
Is there an nginx "site" or config specific to varnish termination?
@BBlack would you miss x-cache, x-cache-status and x-varnish if those were completely removed at the TLS termination level? Some of that information is now exposed as part of the Server-Timing header (which unlike others, can be parsed by JS).
Plain nginx config has the ability to remove the headers, but it can't do so conditionally...
Expanding the amount of WebP thumbnails we served is blocked on T211661: Automatically clean up unused thumbnails in Swift
If the files are already on Beta, purge them, make sure your browser cache is cleared for these images, and you'll get thumbnails generated with librsvg 2.40.20-3
Dec 19 2018
After brainstorming this more, since Nginx TLS termination is going to remain for the foreseeable future, even after we move backend caches to ATS, it makes this effort simpler to blacklist specific debugging header at the Nginx level and have the debug "gate" there. This way the list of filtered headers is defined in one central location and is guaranteed not to mess with Varnish log data collection.
The data is making it into the ServerTiming schema as expected now, as verified on Hive.
Will post blog post soon, either on the perf calendar or our own blog if it doesn't get picked up for that.
Dec 18 2018
It sounds like by default depending on some heuristics the browser might block text on image decoding, presumably to flush them both in the same paint. Imho this sounds like async decoding would be a win, as it would allow text or other things to get flushed in situations where it would have been blocked by image decoding.
Dec 17 2018
If they're happening in production, yes.
Dec 14 2018
Emailed the 2 Google engineers who wrote most of the WebP container spec about the idea of extending WebP to allow AV1 as a codec to encode frames. One of them seems to be working on AV1 anyway. Based on the spec, it seems like a very straightforward thing to do, and Chrome already has AV1 decoding for video.
I've figured out a way to test this with the latest ffmpeg and libaom, and holy crap the hype is real. When I hit the same DSSIM fidelity for the AV1 "thumbnail" as the JPG and WEBP ones from production, I get the following. Note that the embedded images in this comment are lossless PNG conversions of the actual thumbnails, provided here to eyeball visual differences.
Dec 13 2018
If there's already a cookie planned to enable that, I think that's the easiest thing to use here. Server-Timing has limited cross-browser support at this point: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Server-Timing
Dec 11 2018
NavigationTiming doesn't store article id, but it stores the revision id. For a popular article like https://ru.wikipedia.org/wiki/Россия which is part of the oversampled articles, it gives a recognisable revision id. Last edit was on the 8th, with revision id 96715436. Let's look at how often navtiming was recorded for it on the 9th and on the 10th once per-page oversampling was enabled (mid-day):
I'm not sure that it worked or that the 5 articles picked made a big difference, there's no visible uptake in ruwiki survey responses:
@Anomie very good point, I think it will be very hard for someone to find out about such a whitelist. Things will work for them on Vagrant and possibly on Beta, leading to unexpected breakage upon production deployment when their new header is just missing.
Dec 10 2018
The object isn't sealed, isn't frozen, is extensible, and if I output it after modifying it, it has the new property. But the new property doesn't survive JSON.stringify, because the new property isn't included by the object's toJSON():
Dec 5 2018
Nov 30 2018
Please reopen if it's seen again.
Doesn't work, I see nothing in hive and this happens a bunch:
Nov 29 2018
Sure, 'till next time ;)
Nov 27 2018
In order to allow future work on this based on the CPU benchmark results, I need to expand the scope of the CPU benchmark beyond the perception survey. I think a sub-sampling ratio of NavTiming samples would make sense.
Unless I'm querying logstash incorrectly, I don't see any occurence of the original fatal in the last 7 days.
Alright, I think this is the closest we can get to running something like the future ATS setup:
Uploaded the test images to mediawiki.org, so that we can have them served from upload.wikimedia.org on a separate connection:
There's already an ATS cluster we can hit internally. The config in ATS is unified for text and upload.
Nov 26 2018
Looking at potential callers, it seems to be coming from the "Mark all changes as seen" button on Special:Watchlist. Which, when clicked, after firing the API call we're seeing, triggers an update of the UI by fetching the changes list from the server. I'm guessing that this is probably the reason why the API call tries to do all this work synchronously. I think this calls for the API call doing chunking and still being synchronous, if we want to avoid drastic UI+API changes.
If I'm following the code correctly based on the API parameters, it ends up calling setNotificationTimestampsForUser in WatchedItemStore, which seems to update all rows at once for the user calling this API in the watchlist table, setting the wl_notificationtimestamp column to null. It's doing the whole thing inside the POST, as the return status of the API call seems to depend on the success or failure of that update call.
@Anomie you were the last to touch this API (ApiSetNotificationTimestamp) for MCR. Does it look to you like something that could be caused by recent MCR changes?
Some recent instances in api.log:
Nov 22 2018
There's one thing that's definitely bad in our results, though, which is that since it's interleaving everything, the high priority stuff is sharing bandwidth with low priority things. Some of that is happening in Pat's "good" example as well, though, with the first hidden image competing with the high priority things.
It looks like we're doing fine: https://www.webpagetest.org/result/181122_MY_402f279db0f28ab1957bb10b5d551d61/3/details/#waterfall_view_step1