@Tgr added this on a related mail thread:
Wed, Sep 20
If I recall correctly, ResourceLoader client code on desktop already looks at a list of modules needed in a given page, checks client side caches, and fetches the remaining modules from the RL API (in a single call), and caches those modules separately in localstorage. Given that this discussion is making no reference to this, I am getting the impression that this understanding might be wrong. Could you clarify?
@Fjalapeno, that comment touches on 1), but as I said to me it looks like the API focused discussion has moved to 2). Either way, I am not sure we need a new API for either 1) or 2).
FTR, this is the graph with the alert I mentioned: https://grafana.wikimedia.org/dashboard/db/restbase?panelId=12&fullscreen&orgId=1
Tue, Sep 19
At today's team sync we agreed with @Pchelolo's proposal:
I honestly don't have a strong preference between the other "hearted" tasks. Given that all of them are fairly low volume, would it make sense to just deploy all of the hearted ones in the next wave?
Mon, Sep 18
It sounds like there are two separate questions:
I strongly support @Tgr's access request as well.
Added the "fetch metrics from graphite / prometheus" option.
Thu, Sep 14
Looks like adding the JSON_UNESCAPED_UNICODE flag should do it: http://php.net/manual/en/function.json-encode.php
Wed, Sep 13
Given the useful information we have in this task, I am proposing to widen the scope beyond the first job, towards generally coordinating the order of migrating individual jobs. @mobrovac, does that sound reasonable to you?
We briefly discussed this during today's sync meeting. While there are ways to set up targeted processing priorities for specific jobs (by wiki, type, or other criteria), we realized that there will likely be less of a need for this in the new setup. The Redis job queue divides processing throughput evenly between projects. This makes it relatively likely for individual projects to accumulate large backlogs, which would then need manual intervention (re-prioritization) to address.
Raised priority, as this is a) blocking the migration to the Kafka job queue backend (T157088), and b) is likely already causing performance and possibly reliability issues in the current job queue.
Tue, Sep 12
As far as I can tell, the page image(s) are handled as part of deferred linksUpdate processing. This means that the updates would be executed after the main web request, but on the same PHP thread that handled the original edit request.
Considering the scalability limits of Cassandra's schema synchronization we see in production, I think it would be good to reduce the number of storage groups more aggressively. Perhaps something like this?
Mon, Sep 11
Update from our month-end check-in:
@bearND, MediaWiki's section edit feature is implemented without knowledge of a DOM, so <div> wrappers do not suppress edit sections. Example: https://en.wikipedia.org/wiki/User:GWicke/TestSections with source
I believe it was the pageimages designation for those articles I mentioned above. Not exactly sure what happened on wiki since the revisions have been deleted from public archives (and I don't have the permission to view it).
Just to clarify what exactly happened here: The offending edits were adding an image to the featured page itself, and also nominated that image to be the pageimage?
@Ottomata, from a cursory look at those connectors, it looks like they all aim to capture all SQL updates (update, insert, delete). They don't seem to be targeted at emitting specific semantic events, such as the ones we are interested in for EventBus. This is where the SQL comment idea could help, by letting us essentially embed the events we want to have emitted in the statement, rather than trying to reverse-engineer an event from raw SQL statement(s).
Looking at the three custom changes we did on top of upstream in https://github.com/wikimedia/swagger-ui/commits/master, it seems that the build process we ran after each did not update the source map. However, the gulpfile defineds "dist" to be part of the default task (see https://github.com/wikimedia/swagger-ui/blob/master/gulpfile.js#L188). Perhaps we "just" forgot to check in the updated source maps?
In terms of document structure, the behavior in line two (add section around <div>-wrapped heading) seems to make sense. I think it also matches edit section behavior, which should ignore the <div> completely (as it is not DOM-based).
Fri, Sep 8
From a practical perspective, I think the biggest question is how common clients behave these days when must-revalidate is omitted, and the client cache timeout expires. My memory on this is rather foggy, but I *think* in the dark ages behavior in that area was inconsistent, with early IE versions not re-validating even when they were online. If we can verify that all browsers we care about do the right thing (check as if must-revalidate was set when connected), then dropping must-revalidate in the headers would be harmless.
We already support fetching specific HTML sections by ID in the REST API (see https://en.wikipedia.org/api/rest_v1/#!/Page_content/get_page_html_title), but until consistent <section> wrapping with a sensible granularity & perhaps a predictable section ID for the lead section are implemented in Parsoid (T114072), this is not as useful in practice as it could be.
This proposed optimization is similar to something I implemented in Parsoid's HTML5 serializer. In that case, we switch between single & double quotes for HTML attributes depending on whether the attribute value contains more single quotes or double quotes. This had a very significant impact on Parsoid HTML size, mainly because it has many JSON values embedded in attributes.
@Pchelolo, based on our previous conversation about this I am assuming that the bulk of the task is a very large list of pages. Is this correct?
Thu, Sep 7
Facebook actually heavily relies on SQL comments to pass event information to binlog tailer daemons (see the TAO paper). We currently use those SQL comments only to mark the source of a SQL query (PHP function), but could potentially add some annotations that would make it easy to generically extract & export such events into individual Kafka topics.
Starting a new section when encountering a new heading of the same level is expected behavior, in line with MediaWiki section edit behavior. When encountering a heading of a higher level (higher number, lower prominence), the sectioning code I wrote in parsoid-utils creates a nested section. This is in line with typical HTML5 section and page outline semantics: https://developer.mozilla.org/en-US/docs/Web/Guide/HTML/Using_HTML_sections_and_outlines.
Rebased PR now ready at https://github.com/wikimedia/change-propagation/pull/203.
I don't have strong views on how to scale metrics and log collection. In any case, we have been doing this remotely for a while now (using standard formats like gelf for logs), so whether things are aggregated per pod or more centrally doesn't make a big difference to the services themselves.
Wed, Sep 6
This service would replace the current electron pdf renderer as well on the medium/long run, right?
Thanks for the update & clarity on the timeline, @ovasileva! It is much appreciated.
Tue, Sep 5
Thu, Aug 31
@Tgr, at first sight it looks like there are reasonable python bindings for headless Chrome as well. Combined with the PDF post-processing library you have been testing, I could see a simple python service doing both pre/postprocessing and actual rendering work well. The service portion of either option is trivial in any case, and all the heavy lifting is in the libraries & Chrome.
I updated https://gerrit.wikimedia.org/r/#/c/295027/ to apply on current master. This removes CDN purges from HTMLCacheUpdate, and only performs them after RefreshLinks, and only if nothing else caused a re-render since.
No Samsung spares would be surprising, given our last conversation on the topic in April, and from what I remember about the stock back then.
Since you asked for bikeshedding.. How about
The replication issues discussed in T163337 could play a role in duplication / keeping old jobs alive.
Wed, Aug 30
I just looked into HTMLCacheUpdate jobs executed in the last 15 hours, and the number of really old jobs still being executed (presumably retried or respawned) is greater than I would expect with a retry limit of 3 (or 2?):
HTMLCacheUpdate root job timestamp distribution, jobs executed within the last 15 hours:
A possible contribution to the backlog building could be the infinite retry / immortal job problem described in T73853. Looking for old htmlCacheUpdate root jobs from April still executing over four months later (!) via grep htmlCacheUpdate runJobs.log | grep -c 'rootJobTimestamp=201704' in mwlog1001:/srv/mw-log yields 9208 executions, just today. Interestingly, jobs from May, June, and July are much less common (hundreds). Considering that HTMLCacheUpdateJob basically only updates touched timestamps in the DB, and then quickly fires off CDN purges, seeing anything but zero ancient jobs might mean that T73853 is not actually resolved yet. To actually establish whether this significantly contributes to the current backlog, we would need to look at the distribution of rootJobTimestamp values for htmlCacheUpdates from July, especially for the period since the backlog growth really started around the 8th.
@Krinkle, are you saying that we are confident that jobs are no longer retried for more times than the retry limit would nominally allow?
Signed JSON blobs are kind of what JWTs are designed for. There are good libraries for validation.
I personally am not sure whether the startup issues are caused by the same underlying issue as the hangs, or not. I would imagine that an electron worker process restarting could run into similar hangs as on service startup.
Tue, Aug 29
Bumped priority, as support for flagged revisions is important for serious reading use cases. There is also an opportunity to piggy-back on current storage schema migration efforts.
@Jdlrobson, in general MW can (and does) certainly fetch data from the REST API. However, there are some potential issues if we wanted to fetch the summary on each parse or skin render:
Mon, Aug 28
@ovasileva, thank you for the update. Does this mean that OCG will be switched off by the end of September, or end of October?
Categories are page metadata, and the default desktop rendering is done by the skin. Other experiences will want to display categories in different ways, which is also facilitated by separating the category data from its formatting.
Description from the original mail for slightly more detail:
Aug 21 2017
The Electron render service currently requires manual attention every few days, so we should address the reliability issues sooner rather than later.