cc: everyone who has been active in the toolsbeta project
The future parser has few complaints, so we're ready to move on to actual upgrade testing.
Current theory is that this happens when the labs-private repo is in the process of being rebased.
Mon, Oct 16
Thu, Oct 12
Here's my latest attempt to describe what works. Once the concerned patches are merged I'll try to get this down on wikitech someplace.
Horizon can't quite delete everything yet, so I generally delete everything that Horizon can see first and then use the 'delete' link in wikitech. Horizon is /close/ to being able to do everything but it needs a bit of work.
It looks to me like this is filter in maintain-views.yaml via logging_whitelist. However, I don't see that 'pagetranslation' has ever been in that list (or at least not since 2016-10-12 which is when the history becomes murky.)
Wed, Oct 11
When I refreshed puppet on the affected host, it included this diff:
Tue, Oct 10
ok -- I was expecting this table to be present in enwiki. If it's wikidata-specific then we're probably done. @Ladsgroup can you confirm?
I've run maintain-views, but the wb_terms table isn't getting replicated at all. I don't see any evidence of filtering in the sanitarium files but I may be looking in the wrong place... @Marostegui, any ideas?
ok! I've raised the quota to 4 IPs. Lets' leave this task open and you can nudge me when you're ready to clean up.
We don't support CamelCase in project names, so I've created a project called 'mwstake'. @MarkAHershberger is a project admin and can add other users or admins as needed.
Do you already have ram/CPU quota to create the additional instances? Is it really just the IPs that are holding you back?
Approved, will do shortly
This seems to have been caused by https://gerrit.wikimedia.org/r/#/c/382415/, which has now been reverted.
Mon, Oct 9
Fri, Oct 6
I'm trying to reproduce the tools puppet compiler described here. A few things have clearly changed since this was last built... The hiera setup I seem to need looks like this:
Thu, Oct 5
I ran the export and import by hand just now, and I think we're getting the complete wiki.
I've directed shinken-wm to talk in #wikimedia-cloud-feed.
A --current dump is 8.6M, a --full dump is 7.2G. So doing --full may not be practical.
Is there anything I can do to nudge this along, short of 'clone Jaime'?
These boxes are up and installed and seem ok. Actual service implementation is T168470
Adding a regex validation to the instance name in Horizon turns out to be non-trivial in the current version.
rabbit is now much quieter, so this is /maybe/ better. Closing for now, optimistically.
Wikitech is dumped using
Wed, Oct 4
Every labvirt is now running Linux labvirt1008 4.4.0-81-generic #104~14.04.1-Ubuntu SMP Wed Jun 14 12:45:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Tue, Oct 3
To add new files, copy them to download-01.download.eqiad.wmlabs:/srv/public_files/
In order to keep big files off of NFS, I've created a static download site for things like this. Your file, for example, is now:
Approved! @chasemp will help with the specifics of performance testing.
Closed by accident or vandalism
Closed in error, best I can tell
Sun, Oct 1
Here's the latest mcelog. Without timestamps it's hard to correlate this to the failures but still seems bad.
The last syslog before reboot was at Oct 1 01:21:01. It was down for many hours and didn't page because I downtimed it during the hardware replacement and didn't clear the downtime before putting it back into service :( There's nothing in the syslog or kernel log to indicate distress.
Fri, Sep 29
I've rebuilt labvirt1015, 1017 and 1018 (and the labtestvirts) with 4.4.0-81. So now all of our virt nodes are running that kernel except for 1016, which needs an evacuation before I mess with it.
This is as fixed as it's going to be. Any time there's a designate outage I need to run the dnsleaks script to clean up.
An equivalent to 375941 was merged as part of a larger refactor, and I'm pretty sure this is adequate for the problem.
There are only a few left that are broken, and I've emailed all the owners.
Tue, Sep 26
I've confirmed that nova detects name collisions between 'camelcase' and 'CamelCase'. So this isn't especially urgent. There's still a potential race if the users get really luck and create overlapping named instances at exactly the same time.
Mon, Sep 25
which due to requirements of mw-vagrant isnt possible for me
Fri, Sep 22
As far as I can see, the docs only describe setting ca_server once, for agents, in the [main] block. I am missing an explanation of why we would set it twice, and what setting it in [agent] does vs. what setting it in [master] does.
This has almost totally stopped happening; when it does happen it's usually for a good (but new) reason. So I don't think this bug itself is useful anymore.
Thu, Sep 21
Done via GRANT SHOW VIEW ON *.* TO 's53508'@'%' on labsdb1001 and labdb1003
Wed, Sep 20
This is modestly different, but needs to be retitled. T42022 is about public http APIs, this is about internal services which can break despite the public APIs functioning.
I've added fullstack success % to the above graph. We still need to add some auto-cleanup functions to the fullstack test to keep accurate numbers.
Tue, Sep 19
I've moved labvirt1018 to 4.4.0-83 but can't reproduce this issue.
Is this something that could be done within the existing ores project?
@Reedy definitely no need to cherry-pick if this is getting pushed out today :)
Was network connectivity lost to the server at large or to the VMs running on that labvirt instance?
Sep 17 2017
@Reedy I'm on holiday and so only got as far as seeing that that one use-case produces the problem. I don't know immediately how to find all the mismatches, although there are quite a few:
That file is produced on silver via this command:
Sep 16 2017
I re-imaged labvir1015 and 1017. They're now running 4.4.0-93-generic and rebooting fine. Do I know what just happened here? I do not.
I just upgraded labvirt1015 and 1017 to -93 and rebooted, and both lost network config just like we saw with -83. So something very bad is going on here. I'm going to re-image 1015 and see where I get.
Sep 14 2017
I have some api uptime stats at https://grafana.wikimedia.org/dashboard/db/wmcs-api-uptimes?orgId=1