Fri, Aug 10
Thu, Aug 9
The symptoms described by the user seems to suggest there is a problem in the Apache config of the site. Asked them to paste their configs here.
PR #1046 removes the support for it.
- pre-generation for mobile end points is enabled only for WPs
- the mobile end points have been removed from the public API for non-WP projects
- the data tables for the others and commons storage groups have been truncated~
Wed, Aug 8
Thank you, @Pchelolo !
I like this idea, with the exception of making these classes JSON-serialisable. These objects may (and probably will) deliver more information than needed for our events, so we are really looking for a subset here, i.e. we should require they be EventBus-serialisable. Given this year's Platform Evolution programme's aim at rethinking interfaces inside MW, this can be part of that.
Tue, Aug 7
+1 on creating such an event, it sounds like a useful piece of information for clients to have/be able to react to.
One more data point: most of these failures come from wmf-15 code (less than 1% of the messages come from wmf-14 and wmf-13 combined), which seems to suggest a change in wmf-15 is causing this. Alas, after going through the diff, I couldn't find any (obvious) candidates.
Mon, Aug 6
IMHO, relying on client libraries for validation is not really an option if we want to ensure the well-functioning of the platform, given its stated openness. In EventBus we currently have server-side validation which is an aspect that I think we should keep (whether in the current form or a different one).
A quick investigation of merged commits for includes/jobqueue/jobs/ThumbnailRender.php, includes/file/File.php and related includes/media/* files (which are used in the generation of the URL did not turn up any recent changes.
Fri, Aug 3
I would second the idea of switching the MCS' storage to key-value, at least in the short term, in this way reducing the storage capacity needs.
Raising the priority since there have been more than 20k such messages in the past 24h.
There should be progress on this before we enter the full production stage, since keeping Chromium instances working while Proton thinks the resources are free can quickly lead to resource starvation scenarios on our scale.
Thu, Aug 2
Indeed, the discussion is probably out of the scope of this ticket.
We currently have nsp run as part of npm test which automatically makes Jenkins run the test. When Jenkins gets npm v6+, we can then have npm audit run as part of the test.
I assume the task description implies the topic would get multiple messages every week, and that the total data size would be ~3GB (as opposed to one 3GB message). If so, LGTM. Note that we have snappy compression enabled in main, so the producer can simply send plain messages and they would be compressed on the fly. One question here: instead of burst 3GB of data into Kafka in one go, is there a possibility of spacing the messages out a bit to ensure the normal functioning of the Kafka cluster?
Wed, Aug 1
The caveat with the maximum number of topics is that Kafka has no hard limit on it because it depends on zookeeper, so effectively it can support as many topics as zk can support znodes.
I would also suggest using X-Request-ID which uniquely identifies a single request.
Nginx and Varnish already attach the X-Client-IP header to incoming requests (cf. this sample request), so all you have to do is actually use the provided header.
I'd hold off with this for the time being. @akosiaris what do you think?
Tue, Jul 31
This is no longer an issue: Mathoid has been moved to our k8s infrastructure (nominally it still exists on SCB, but it's not used there at all).
Mon, Jul 30
The errors have completely disappeared as of this morning UTC.
Not happening any longer after fixing T186750: Reset RESTBase deployment-prep environment.
The deployment-cassandra3-0x nodes have been removed and with the resolution of T186750: Reset RESTBase deployment-prep environment these errors will not appear again.
Resolved via T186750: Reset RESTBase deployment-prep environment
Fri, Jul 27
This has finally been resolved for good. Here's a summary/post-mortem for clarity and posterity.
Thu, Jul 26
Self-merging to deploy.
The public API is now exposed. Resolving.