That spreadsheet is also available as https://docs.google.com/spreadsheets/d/1TRimo0kT_YzlXl_RD3Z7zOZHdj5Piev31ALIKku7Y8g
Actually, now that I've said that thing about pulling records from wmflabsdotorg, let me just add a library call for that. Stay tuned...
I'm assuming that A records are the only thing the GUI will care about... it's trivial to add queries for other record types if that proves useful.
Thu, Apr 27
I have three different sets of proposed answers:
This is now available on all db hosts except for 1001, which is misbehaving.
I merged the puppet change. Next we need to update things with
The particular maintenance that prompted this task is now complete. We still need to improve here, though.
(obviously I vote for option 1, 'switch from PHP5 to HHVM for serving wikitech' since that one is forward and the others are backwards)
So the reason this doesn't make MediaWiki hhvm-only is because normal non-WMF deploys of mediawiki don't include twemproxy?
When the metaservice is fully down then puppet errors out. If instead the metadata service is responding and returning an empty string for the project id... I can't think of how/why that would happen since the empty string would have to be embedded in an otherwise correct structure.
Worst case scenario: What if every one of those instances suddenly partitioned and filled every bit of allocated space?
/dev/sdb1 2.2T 381G 1.9T 18% /var/lib/nova/instances
/dev/sdb1 2.2T 1.2T 1.1T 53% /var/lib/nova/instances
/dev/sdb1 2.2T 1.5T 748G 67% /var/lib/nova/instances
/dev/sdb1 2.2T 1.4T 810G 64% /var/lib/nova/instances
/dev/sdb1 2.2T 1.5T 770G 66% /var/lib/nova/instances
/dev/sdb1 2.2T 1.5T 764G 66% /var/lib/nova/instances
/dev/sdb1 2.2T 1.5T 745G 67% /var/lib/nova/instances
/dev/sdb1 2.2T 1.8T 485G 79% /var/lib/nova/instances
/dev/sdb1 2.2T 1.6T 627G 72% /var/lib/nova/instances
/dev/mapper/tank-data 4.1T 1.4T 2.7T 34% /var/lib/nova/instances
/dev/mapper/tank-data 4.1T 1.7T 2.5T 41% /var/lib/nova/instances
/dev/mapper/tank-data 4.1T 2.1T 2.0T 52% /var/lib/nova/instances
/dev/mapper/tank-data 4.1T 1.9T 2.2T 47% /var/lib/nova/instances
/dev/mapper/tank-data 4.1T 90G 4.0T 3% /var/lib/nova/instances
Looks good to me. I added clarification about short-term failovers since that's the most frequent use case: https://wikitech.wikimedia.org/w/index.php?title=Nova_Resource%3ATools%2FAdmin&type=revision&diff=1757725&oldid=1757724
Nope, I never heard anything back.
The stacktrace doesn't implicate anything that's not running on all the other wikis... does anyone have a theory for why this is happening on wikitech and not elsewhere? Is it a php vs hhvm thing?
I updated the password so that's all good now.
For a quick start, I'm attaching a spreadsheetthat shows instance allocated usage (from the instance flavor) next to actual instance disk space usage (via 'du' on the labvirts.) It shows us actually consuming about 50% of allocated disk space.
Wed, Apr 26
The answer to the question:
I've updated https://wikitech.wikimedia.org/wiki/Wikitech-static with some vague maintenance instructions.
The root password to this host no longer works. Either someone fancied it up to use keys, or we need to do a rescue as per https://support.rackspace.com/how-to/reset-your-server-password/
In our config, we have disk_allocation_ratio=1.5
Maintenance tasks are just:
I still can't reproduce this, even creating the exact set of breakages that were present when the issue appeared in prod.
We're pretty sure that the only Labs thing affected by this is instance creation. I've disabled instance creation for now, with https://gerrit.wikimedia.org/r/#/c/350414/ for Horizon and a live hack in OSM on silver.
Tue, Apr 25
Just turning off various dns services (including mdns) does not reproduce this issue.
Another thing I'd suggest is that we get Stretch available to users before we start pushing them off Trusty. Jessie isn't in support for much longer than Trusty so we should jump ahead rather than migrating our users over and over.
(No real objection to granting this resource request, if you're sure it's actually going to be useful to you)
Unless there's a public interface for e.g. Hadoop you won't be able to route there from a Labs VM. Having a public IP doesn't help with that.
Oh, hm, in theory the DNS record still knows the IP, maybe I can dig it out from there.
This might turn out to be hard -- sink gets notified of instance deletion but not until after the instance's IP has been freed, so we don't have a good way of following up within sink to find associated proxies. We'd have to include identifying metadata (e.g. instance ID) on the proxy api side in order to correlate an instance with a proxy after the fact.
It should be possible to gather lifespan stats for existing instances and see what people are creating these days. Also, there are some half-measures that we definitely can take immediately:
For the most part I'd expect DNS to fail over gracefully -- in the cases where it doesn't, that's misbehavior (or misconfig) on the part of the clients. That is https://phabricator.wikimedia.org/T119660.
As soon as we disable Trusty we'll also be violating 'cattle, not pets' for most of our users. It will mean that anytime they need to recreate an instance they will also have to learn how to configure a new OS and adapt their work to run there.
Mon, Apr 24
Email sent to labs-announce, subject "Does anyone care about service groups?"
The result of bd808's work is https://tools.wmflabs.org/openstack-browser/proxy/
Sat, Apr 22
OK, 1417 and 1419 are now up and pooled as well.
Fri, Apr 21
-1417 and -1419 are now in the process of puppetizing. This is a note to myself to remember to queue those two over the weekend.
I think this is fixed -- Horizon now uses the mediawiki 2fa plugin for verification, same as wikitech.
I don't know what this means. I've never needed two-factor authentication on wikitech.wikimedia.org. I can log in there just fine.
OK, sounds good. I'll try to do some capacity assessment and bring this up at our next meeting.
So, this question relates to both storage needs and also the appropriateness of Labs use: Is this giant storage use something persistent and valuable, or more like a scratch-pad? That is, if we create a 250Gb instance today and then in 2019 you need a 400 Gb instance to handle growth, can you just throw out the old instance and make a new one? Or is the actual storage on the old instance valuable and hard to reproduce such that you'd have to copy or save the file somehow?
Thu, Apr 20
root@MISC m5[keystone]> SELECT COUNT(*) FROM token;
1 row in set (0.02 sec)
Wed, Apr 19
I figured a big dump that users can search is better than a search widget since it's not that much data -- but, y'know, either way.
This page is just a proof of concept (not live-updating) but is this the kind of thing we're talking about?
@Legoktm are you able to actually produce this issue, or are you still digging back into that previous occurrence?
If anything it's most likely that OSM is adding a user to a group when the user is already in the group -- much of T150091 involved duplicating OSM behavior in a keystone callback. That said, in my tests it handled the duplication of effort without complaint.
root@MISC m5[keystone]> SELECT COUNT(*) FROM token;
There were about 550,000 tokens found by the queries in those two added crons: novaobserver and novaadmin tokens too young to expire but more than .1 day old. I deleted them all just now to give the cron half a chance. We'll see if 'limit 10,000' once an hour is enough to keep things tidied up.
I haven't tracked down the relationship, but I suspect this issue is a symptom of token overload, addressed in T163259. That bug should be fixed (albeit poorly) -- does this one look better to you as well?
Tue, Apr 18
On labtest I tried purging all tokens more than 1 day old, and performance gains were considerable. Project deletion took 2.5 minutes with 7-day-old tokens but only 20 seconds with 1-day-old tokens.
And I removed a bunch of other spare role definitions (e.g. 'observer' and 'admin'.)
...and now I'm going through and removing all members from project entries that are not 'novaadmin.' Novaadmin stays just to keep the schema happy. I'm not sure if we could remove the 'groupofnames' type from project entries; I don't want to risk it.