Sat, Jan 19
Fri, Jan 18
Thu, Jan 17
Wed, Jan 16
Just a note that for today's promotion of group1 to 1.33.0-wmf.13 (T206667), I segmented the group1 error-log dashboard to have a view of just these timeout errors and a view excluding them. It was very helpful in keeping on both the rise in timeouts and side effects or unrelated errors. I plan on saving the dashboards and adding links in the train docs.
Tue, Jan 15
Cutting the branch.
Sat, Jan 12
Thu, Jan 10
Today's all-wiki deployment went much like yesterday's: We saw an increase in the MediaWiki error rate due to a flood of "timed out" errors and what seemed like a side effect of 500s caused by nginx/apache/hhvm socket timeouts. The rate increases lasted from approximately 20:11 until 20:34 UTC at which point they subsided to pre-deployment levels.
Wed, Jan 9
Tue, Jan 8
Dec 20 2018
Dec 19 2018
Dec 18 2018
Dec 17 2018
Dec 15 2018
Dec 14 2018
Dec 13 2018
@thcipriani +1 to the proposal in general. I think it adds clarity to artifact definition and to the purpose of copies. Just some notes on config structs and implementation of sane defaults.
Dec 12 2018
Latest image for use with initial chart implementation is docker-registry.wikimedia.org/wikimedia/blubber:20181212233039-production.
Dec 10 2018
One refactoring option I can think of would be to make the config for copying of project/application files explicit and use a sane default.
Dec 2 2018
I just merged the final patch related to T210030: RedisBagOStuff is broken on beta that removed the stale redis server entries from the labs related mediawiki-config, and the instances are terminated.
Dec 1 2018
I'm also seeing MW fatals resulting from the inoperable redis servers and the mediawiki-config for redis_lock that still references them.
AFAICT, this and related errors are no longer occurring. It's not yet clear exactly what the underlying issue might have been, but in troubleshooting with @thcipriani yesterday we noticed a few things that might have had an impact. Since @thcipriani graciously jumped in to take care of the actual fixes, he might want/need to correct some of this information. :)
Nov 30 2018
Looking at @Krenair 's log of replication lag, there's no indication of lag at any given time since the first entry ("Thu Nov 29 20:13:01 UTC 2018").
Nov 29 2018
Sorry, I posted the wrong server's output. :) Editing...
I'm also wondering how MediaWiki detects replication lag. Does it look at all DB hosts? And which MaridDB variables does it inspect? Is it possible that it's looking at deployment-db03 and errantly detecting it as a lagged replica?
Looking at SHOW SLAVE STATUS\G on deployment-db04 shows no current lag according to seconds_behind_master, but I'm looking into where we might see its historical values. I thought there was a prometheus collector set up for beta's maria DBs, but I'm not whether it monitors master/slave replication status.
Nov 27 2018
Nov 26 2018
@Legoktm could you clarify whether this task tracks progress of programatic extension installation or of programatic enablement/disablement? The title says (and comments seem to relate to) the latter but the description links to a section of the feedback article for the former.
Nov 22 2018
Nov 20 2018
Nov 19 2018
Nov 17 2018
Nov 16 2018
Oct 23 2018
Oct 22 2018
Another example of how to implement a decent YAML to JSON layer might be the Golang OpenAPI library's YAML parser.
@greg, we noticed this in the triage meeting you weren't at. Would you care to adjust priority? ;-)
@hashar, would you please triage priority and placement on the RelEng Kanban board? Thanks!
@hashar, this is marked as "Done" in RelEng Kanban. Any update?
@hashar, any update? Moving to backlog until we have one.