Long ago, the Wikimedia Operations team made the decision to phase out use of Ubuntu servers in favor of Debian. It's a long, slow process that is still ongoing, but in production Trusty is running on an ever-shrinking minority of our servers.
One of our quarterly goals was "Define a metric to track OpenStack system availability". Despite the weak phrasing, we elected to not only pick something to measure but also to actually measure it.
The current physical servers for the <wiki>_p Wiki Replica databases are at the end of their useful life. Work started over a year ago on a project involving the DBA team and cloud-services-team to replace these aging servers (T140788). Besides being five years old, the current servers have other issues that the DBA team took this opportunity to fix:
- Data drift from production (T138967)
- No way to give different levels of service for realtime applications vs analytics queries
- No automatic failover to another server when one failed
24% of Wikipedia edits over a three month period in 2016 were completed by software hosted in Cloud Services projects. In the same time period, 3.8 billion Action API requests were made from Cloud Services. We are the newly formed Cloud Services team at the Foundation, which maintains a stable and efficient public cloud hosting platform for technical projects relevant to the Wikimedia movement. -- https://blog.wikimedia.org/2017/09/11/introducing-wikimedia-cloud-services/
Back in year zero of Wikimedia Labs, shockingly many services were confined to a single box. A server named 'virt0' hosted the Wikitech website, Keystone, Glance, Ldap, Rabbitmq, ran a puppetmaster, and did a bunch of other things.
(reposted with minor edits from https://lists.wikimedia.org/pipermail/labs-l/2017-July/005036.html)
Debian Stretch was officially released on Saturday, and I've built a new Stretch base image for VPS use in the WMF cloud. All projects should now see an image type of 'debian-9.0-stretch' available when creating new instances.
Back in the dark ages of Labs, all instance puppet configuration was handled using the puppet ldap backend. Each instance had a big record in ldap that handled DNS, puppet classes, puppet variables, etc. It was a bit clunky, but this monolithic setup allowed @yuvipanda to throw together a simple but very useful tool, 'watroles'. Watroles answered two questions:
The first very visible step in the plan to rename things away from the term 'labs' happened around 2017-06-05 15:00Z when IRC admins made the #wikimedia-labs irc channel on Freenode invite-only and setup an automatic redirect to the new #wikimedia-cloud channel.
When @Ryan_Lane first built OpenStackManager and Wikitech, one of the first features he added was an interface to setup project-wide sudo policies via ldap.
For nearly a year, Horizon has supported instance management. It is altogether a better tool than the Special:NovaInstance page on Wikitech -- Horizon provides more useful status information for VMs, and has much better configuration management (for example changing security groups for already-running instances.)
I've just installed a new public base image, ' debian-9.0-stretch (experimental)' and made it available for all projects. It should appear in the standard 'Source' UI in Horizon any time you create a new VM.
Andrew will be upgrading our Openstack install from version 'Kilo' to version 'Liberty' on Tuesday the 2nd. The upgrade is scheduled to take up to three hours. Here's what to expect:
If you are exclusively a user of tool labs, you can ignore this post. If you use or administer another labs project, this REQUIRES ACTION ON YOUR PART.
The Kubernetes ('k8s') backend for Tool Labs webservices is open to
beta testers from the community as a replacment for Grid Engine
horizon (OpenStack Dashboard) is the canonical implementation of OpenStack’s Dashboard, which provides a web based user interface to OpenStack services.
tools-login.wmflabs.org is on a new bastion host with twice
the RAM and CPU of the old one. This should hopefully provide a better
bandaid against it getting overloaded up. More discussion about a
longer term solution at https://phabricator.wikimedia.org/T131541
On Tuesday, 2016-04-05, we'll be upgrading Kubernetes to 1.2 and using
a different deployment method as well. While this should have no user
facing impact (ideally!) the following things might be flaky for a
period of time on that day: