Mon, Jan 27
Sun, Jan 26
Thu, Jan 23
@Dzahn can we please ensure this procedure is finished before next week?
Wed, Jan 22
Tue, Jan 21
Fri, Jan 17
We should complete this work before we perform the MediaWiki switchover.
@akosiaris didn't we just dismiss this cluster last week?
We should also remove all the stale docker images.
Retagged for release engineering as they're managing blubber.
I am not convinced this is a great idea.
This task is generic enough, in this form, that I'm not sure wether it's a good idea.
Thu, Jan 16
The total number of images present on the registry is 1003. I'm going to slowly remove most of the old ones in the coming week.
Wed, Jan 15
Tue, Jan 14
One problem I see with this is - proxy IPs regularly get banned by third-party services by accident. So having multiple *external* IPs, and being able to switch between them, is a plus.
Mon, Jan 13
I think the main question to answer is "does it make sense to create a safe regex evaluation service?".
Sat, Jan 11
Thu, Jan 9
Hi everyone, serviceops needs to dismiss the scb cluster by april 2020. So work on the kubernetes migration should be prioritized accordingly.
This happens because parsoid/js is still deployed in production, so it also gets monitoring.
Wed, Jan 8
Hi all, as you have noticed, we were working on a resolution and things should be ok now. An incident report will be published available at a later time at https://wikitech.wikimedia.org/wiki/Incident_documentation
An incident report will be published later on wikitech at https://wikitech.wikimedia.org/wiki/Incident_documentation
Hi, thanks for your report!
Tue, Jan 7
This isn't going to happen until some effort is put in making scap's management of data saner.
Dec 23 2019
Yes, this is yet another WTF resolved by killing parsoid-js.
Dec 20 2019
Dec 19 2019
Port reservations are for now indicated here: https://wikitech.wikimedia.org/wiki/Service_ports
Dec 18 2019
I think most of the issues described here have been in the meantime solved by the implementation of the code stewardship review process and a list of developers/maintainers. @Pchelolo @Eevans @Clarakosi any opinions?
Dec 17 2019
All jobs have recovered. Not closing the task as we need to reduce concurrency again.
Dec 16 2019
Thank you @mobrovac for the help!
I see sometimes an error like the following:
This happens after the key is offered for authentication.
Again, looking specifically to recentChangesUpdate:
In the meantime, it seems that processing of recentChangesUpdate gets completely outdated for some reason. Let's start with the logfile that began on the 12th:
Looking at cpjobqueue logs, it's clear it's getting 500 responses at least to some of the requests:
Specifically, we have a 6 million jobs backlog on recentChangesUpdate, evergrowing since Dec 11th. I don't see any of them in the logs for JobExecutor, so I guess the problem has to do with changepropagation.
Dec 13 2019
Dec 12 2019
It appears to me that we try to send something on that udp socket with send() after we called close(); This only happens on the debug servers? If so, various things are peculiar there:
While it should be easy to swap nginx for envoy, we need to also convert profile::services_proxy to use envoy at the same time.
Dec 10 2019
Good news is we only need to do a rolling restart in eqiad, not in codfw, where we still don't use the ca for peer connections
Mcrouter can't be configured to listen both on a unix socket and on a TCP port. This means, apart from how cumbersome the change is going to be if we want to do it, that we'd need to change our architecture and have separate "mcrouter proxies" for cross-DC replication.
Dec 9 2019
This project was supposed to be orphaned and removed in the last purge by Andrew. Please remove it completely at your earliest convenience.
Dec 6 2019
Dec 5 2019
I'm pretty sure using unix sockets would improve performance, it did for sure when we were on HHVM. It's pretty easy to test this effect as we now collect timing data per-server, so we could change the configuration conditionally on one api and one appserver, and check if there is a positive performance impact pretty quickly.
Dec 3 2019
Except that's the name of the software we're packaging https://github.com/Wikia/poolcounter-prometheus-exporter so I think there's little we should do there
Dec 2 2019
This has been resolved for some time: