Joe (Giuseppe Lavagetto)
Spy

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 5:57 AM (129 w, 5 d)
Availability
Available
LDAP User
Giuseppe Lavagetto
MediaWiki User
Unknown

Recent Activity

Today

Joe updated subscribers of T161675: Re-think puppet management for deployment-prep.
Wed, Mar 29, 6:27 AM · User-Joe, Beta-Cluster-Infrastructure, Release-Engineering-Team, Labs, Puppet
Joe edited the description of T161675: Re-think puppet management for deployment-prep.
Wed, Mar 29, 6:27 AM · User-Joe, Beta-Cluster-Infrastructure, Release-Engineering-Team, Labs, Puppet
Joe created T161675: Re-think puppet management for deployment-prep.
Wed, Mar 29, 6:25 AM · User-Joe, Beta-Cluster-Infrastructure, Release-Engineering-Team, Labs, Puppet
Joe added a comment to T114104: pybal doesn't fully manage LVS table leaving stale services (on IP change).

The real solution for this is to dedicate real developer time to pybal to move it to use a FSM and a netlink-based python ipvs client.

Wed, Mar 29, 6:15 AM · Traffic, Operations, Pybal

Mon, Mar 27

Joe added a comment to T97530: SCB services should not use a proxy for our domains.

This is now a blocker (sort-of) for the current work on using DNS for discovery: in fact as soon as I switched the parameter for the restbase url to the discovery one (so restbase.svc.codfw.wmnet to restbase.discovery.wmnet, both resolving to the same IP) cxserver and mobileapps started complaining and investigation showed me the issue were the requests that were being directed to the proxy instead of being direct.

Mon, Mar 27, 12:58 PM · Services (done), User-Ryasmeen, VisualEditor, User-mobrovac, Operations, service-template-node, Graphoid, Citoid
Joe added a comment to T154658: Prepare and improve the datacenter switchover procedure.

What still needs to be done:

Mon, Mar 27, 7:09 AM · DC-Switchover-Prep-Q3-2016-17, Epic, Wikimedia-Multiple-active-datacenters, Operations
Joe edited the description of T154658: Prepare and improve the datacenter switchover procedure.
Mon, Mar 27, 6:44 AM · DC-Switchover-Prep-Q3-2016-17, Epic, Wikimedia-Multiple-active-datacenters, Operations

Thu, Mar 23

Joe added a comment to T156023: Check the size of every cluster in codfw to see if it matches eqiad's capacity.

@elukey looking at the numbers, the only slightly worrying situation is for apis in codfw: if we lose ROW B we lose more than half of the capacity. We might want to add apis in row a or row c once we get new hardware.

Thu, Mar 23, 1:29 PM · Patch-For-Review, DC-Switchover-Prep-Q3-2016-17, Epic, Wikimedia-Multiple-active-datacenters, Operations
Joe added a comment to T161159: Cannot access the database: Can't connect to MySQL server on '10.192.48.41' (111) (10.192.48.41).

@Marostegui is it depooled from mediawiki-config? If not, we might want to do it so.

Thu, Mar 23, 7:13 AM · DBA

Tue, Mar 21

Joe edited the description of T149085: Kubernetes TODO List.
Tue, Mar 21, 10:22 AM · Kubernetes-production-experiment, Prod-Kubernetes
Joe added a comment to T156924: Allow integration of data from etcd into the MediaWiki configuration.

Something like that seems fine, though at first glance it doesn't seem to leverage JSON quite as much as it could. I'll try to whip something up with an EtcdConfig class in my next WIP patch. Editing somewhat deeply nested dictionary values via the command line will be interesting.

Tue, Mar 21, 7:54 AM · MediaWiki-Platform-Team, Patch-For-Review, Wikimedia-Multiple-active-datacenters, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, MediaWiki-Configuration, Operations, Wikimedia-Developer-Summit (2017)

Sun, Mar 19

Joe added a comment to T160229: Back up of Commons files.

I am wondering how is this task related to "skills required: javascript, css, etc" at all?

Apparently this task was filed according to a standard format under the assumption that some volunteer or intern could potentially do something about it. If it's in WMF operations realm, then the task description should be adapted.

Sun, Mar 19, 10:05 AM · Operations, Datasets-General-or-Unknown, Community-Wishlist-Survey-2016, Commons

Thu, Mar 16

Joe added a comment to T159850: JobQueue Redis codfw replicas periodically lagging .

@akosiaris are you sure about that? If replica is broken the rdb file is transferred and from what I see only some are larger than 500 MB.

Thu, Mar 16, 9:19 AM · Patch-For-Review, User-Elukey, Operations

Wed, Mar 15

jcrespo awarded T156924: Allow integration of data from etcd into the MediaWiki configuration a Mountain of Wealth token.
Wed, Mar 15, 12:05 PM · MediaWiki-Platform-Team, Patch-For-Review, Wikimedia-Multiple-active-datacenters, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, MediaWiki-Configuration, Operations, Wikimedia-Developer-Summit (2017)
Joe added a comment to T156924: Allow integration of data from etcd into the MediaWiki configuration.

Thinking of a general way to represent any mediawiki-config variable left me with awkward, over complicated objects like the following:

Wed, Mar 15, 12:03 PM · MediaWiki-Platform-Team, Patch-For-Review, Wikimedia-Multiple-active-datacenters, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, MediaWiki-Configuration, Operations, Wikimedia-Developer-Summit (2017)

Tue, Mar 14

Joe added a comment to T156924: Allow integration of data from etcd into the MediaWiki configuration.

Assuming there are decent and simple libraries, having cache-aside APC with a logical TTL in front of etcd seems viable. Writing a spare JSON version on re-fetch might be useful as a fallback if etcd is down and hhvm was restarted or something, but if we assume etcd is truly HA, then that's *probably* overkill.

I'd would want to know:

  • The topology of our etcd. Is it cross-DC or local-only and replicated somehow?
Tue, Mar 14, 5:55 PM · MediaWiki-Platform-Team, Patch-For-Review, Wikimedia-Multiple-active-datacenters, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, MediaWiki-Configuration, Operations, Wikimedia-Developer-Summit (2017)
Joe updated subscribers of T159412: Convert all of our site.pp/roles to the role/profile paradigm.

@Ciencia_Al_Poder care to explain why did you remove the "easy" tag?

Tue, Mar 14, 7:22 AM · Labs, Traffic, Technical-Debt, Puppet, Operations

Fri, Mar 10

Joe added parent tasks for T160178: MediaWiki Datacenter Switchover automation: T156100: DNS: dynamically generate entries for service discovery, T156924: Allow integration of data from etcd into the MediaWiki configuration.
Fri, Mar 10, 3:34 PM · Patch-For-Review, DC-Switchover-Prep-Q3-2016-17, Epic, Wikimedia-Multiple-active-datacenters, Operations
Joe added a subtask for T156100: DNS: dynamically generate entries for service discovery: T160178: MediaWiki Datacenter Switchover automation.
Fri, Mar 10, 3:34 PM · Patch-For-Review, Wikimedia-Multiple-active-datacenters, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, MediaWiki-Configuration, Operations, Wikimedia-Developer-Summit (2017)
Joe added a subtask for T156924: Allow integration of data from etcd into the MediaWiki configuration: T160178: MediaWiki Datacenter Switchover automation.
Fri, Mar 10, 3:34 PM · MediaWiki-Platform-Team, Patch-For-Review, Wikimedia-Multiple-active-datacenters, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, MediaWiki-Configuration, Operations, Wikimedia-Developer-Summit (2017)
Joe renamed T160178: MediaWiki Datacenter Switchover automation from "Switchover automation" to "MediaWiki Datacenter Switchover automation".
Fri, Mar 10, 3:33 PM · Patch-For-Review, DC-Switchover-Prep-Q3-2016-17, Epic, Wikimedia-Multiple-active-datacenters, Operations
Joe created T160178: MediaWiki Datacenter Switchover automation.
Fri, Mar 10, 3:33 PM · Patch-For-Review, DC-Switchover-Prep-Q3-2016-17, Epic, Wikimedia-Multiple-active-datacenters, Operations

Wed, Mar 8

Joe moved T159687: etcd switchover/enhancements from Backlog to Doing on the User-Joe board.
Wed, Mar 8, 3:13 PM · Patch-For-Review, User-Joe, Operations
Joe closed T152977: conftool service removal bugs as "Resolved".
Wed, Mar 8, 1:54 PM · User-Joe, Operations-Software-Development, Operations
Joe added a comment to T152977: conftool service removal bugs.

This is now solved with the latest version of conftool

Wed, Mar 8, 1:54 PM · User-Joe, Operations-Software-Development, Operations
Joe added a comment to T159163: PuppetDB is auto-deactivating hosts.

@akosiaris you are correct, but I think that's inevitable.

Wed, Mar 8, 1:21 PM · Puppet, Operations

Tue, Mar 7

Joe added a comment to T158730: Automate WMF wiki creation.

The point is that updating dblists via gerrit and running scap is one of the avoidable steps in the task description. I imagine etcd would have structured data about each wiki, and the canonical map from domain name to wiki ID. To figure out exactly what structured data should be in there, we need to survey all the services in my list above, but for mediawiki-config it is dblist membership (e.g. $wikiTags in CommonSettings.php line 165) and wikiversions.json.

Tue, Mar 7, 5:34 AM · Services (watching), Patch-For-Review, MediaWiki-Configuration, Release-Engineering-Team

Mon, Mar 6

Joe triaged T159687: etcd switchover/enhancements as "Normal" priority.
Mon, Mar 6, 11:50 AM · Patch-For-Review, User-Joe, Operations
Joe added a comment to T159687: etcd switchover/enhancements.

Just to give some context: it might be possible to try to have a true multi-dc cluster for etcd, but that will need:

Mon, Mar 6, 10:53 AM · Patch-For-Review, User-Joe, Operations
Joe created T159687: etcd switchover/enhancements.
Mon, Mar 6, 10:38 AM · Patch-For-Review, User-Joe, Operations
Joe added a comment to T159618: Job queue rising to nearly 3 million jobs.

@Lydia_Pintscher not really, I'm monitoring the jobqueue and it's constantly decreasing in size. We should be ok.

Mon, Mar 6, 9:33 AM · Wikidata, MediaWiki-JobQueue, Operations
Joe added a comment to T156924: Allow integration of data from etcd into the MediaWiki configuration.

I don't think it will be possible to have a local process (other than scap) write code to the disk after we deploy RepoAuthoritative. I did a few commits to improve RA support last year, and Faidon is keen to see that work be completed.

As such, a configuration mechanism that doesn't require rebuilding the RA repo will be very useful to have. Rebuild time could be an hour or more.

The above cases are all generated + code-reviewed, or manually edited, or created centrally by (in)direct human command. We may not want to have MediaWiki write a PHP file. In which JSON is a natural pick.

We'll have to have JSON or some other config language (preferably supporting comments) in mediawiki-config if we're going to support RepoAuthoritative with fast config changes. It will need to have APC in front of it unless we want to have a substantial deserialization overhead on every request. That's what we do for extension.json.

Mon, Mar 6, 7:11 AM · MediaWiki-Platform-Team, Patch-For-Review, Wikimedia-Multiple-active-datacenters, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, MediaWiki-Configuration, Operations, Wikimedia-Developer-Summit (2017)
Joe added a comment to T156924: Allow integration of data from etcd into the MediaWiki configuration.

and caching in APC sounds like it would simplify things a lot. But what constitutes simple discovery? Would DNS be appropriate for all state we want MediaWiki to read from etcd instead of manually scap'ed files, or just a subset? I imagine it could work for all of it, but not sure TXT fields are a good fit for everything. Might add more complexity. Presumably that would mean we centrally populate those DNS entries from etcd/confd?

Mon, Mar 6, 7:05 AM · MediaWiki-Platform-Team, Patch-For-Review, Wikimedia-Multiple-active-datacenters, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, MediaWiki-Configuration, Operations, Wikimedia-Developer-Summit (2017)
Joe added a comment to T156924: Allow integration of data from etcd into the MediaWiki configuration.

As a general comment on the rest of the thread:

Mon, Mar 6, 6:59 AM · MediaWiki-Platform-Team, Patch-For-Review, Wikimedia-Multiple-active-datacenters, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, MediaWiki-Configuration, Operations, Wikimedia-Developer-Summit (2017)
Joe added a comment to T156924: Allow integration of data from etcd into the MediaWiki configuration.
  • Given the way PHP works, we can't have one thread per server watching etcd, but we would need to make a request to etcd for every request HHVM gets. This is highly suboptimal and etcd is not guaranteed to be able to take such an abuse (in the order of 20K recursive reads/s probably). At least, I never tried and I would recommend against it.
  • etcd is resilient to failures of a single machine but is not built for being a low-latency service even for reads. It might well add 10-20 ms on average to every mediawiki request

No, you would have an APC cache in front of etcd, with locking so that there is no cache stampede from multiple threads when the cache is absent. If the polling interval is 10s, with say 150 servers, that is 15 req/s, and a 10s polling interval is still faster than scap. Even if all servers requested their configuration simultaneously, it would only tie up the server for perhaps a few tens of milliseconds.

  • Every request to MediaWiki would depend crucially on etcd being available, while we designed the rest of our apps to be able to work correctly if etcd goes down for any reason. Of course there are workarounds, but still...

No, the APC cache entry would have no expiry, so if etcd is down at the end of a polling interval, it could just use the stale cache. There would only be a hard dependency when HHVM restarts. It doesn't seem more risky than our dependency on DNS. It is supposed to be a highly-available service.

The advantage this has over etcd/confd/fs/APC is that it is simpler, there are fewer moving parts.

Mon, Mar 6, 6:55 AM · MediaWiki-Platform-Team, Patch-For-Review, Wikimedia-Multiple-active-datacenters, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, MediaWiki-Configuration, Operations, Wikimedia-Developer-Summit (2017)

Fri, Mar 3

Joe added a comment to T139372: Set up oresrdb redis node in codfw.

So I thought a bit about it and come up with the following alternative solution

Fri, Mar 3, 5:04 PM · Patch-For-Review, Operations, Revision-Scoring-As-A-Service-Backlog
Joe added a comment to T156924: Allow integration of data from etcd into the MediaWiki configuration.

An example of an output file is:

Fri, Mar 3, 11:26 AM · MediaWiki-Platform-Team, Patch-For-Review, Wikimedia-Multiple-active-datacenters, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, MediaWiki-Configuration, Operations, Wikimedia-Developer-Summit (2017)
Joe added a comment to T149617: Integrating MediaWiki (and other services) with dynamic configuration.

So, I just found out that the dns cache feature we were supposedly using in HHVM has been removed from it for some time, so while we have the ini setting in our setup, it's not doing much of an effect.

Fri, Mar 3, 10:45 AM · Patch-For-Review, Wikimedia-Multiple-active-datacenters, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, MediaWiki-Configuration, Operations, Wikimedia-Developer-Summit (2017)
Joe added a comment to T158730: Automate WMF wiki creation.

@tstarling I agree, dblists is one of the things that could be stored in etcd and read from there. On the other hand, it's such a simple and relatively stable list that we could also decide to maintain this as a simple configuration file that we distribute across the cluster in a standard format, and we expect every application to read from disk.

Fri, Mar 3, 6:58 AM · Services (watching), Patch-For-Review, MediaWiki-Configuration, Release-Engineering-Team
Joe added a comment to T156924: Allow integration of data from etcd into the MediaWiki configuration.

Yes, I'd rather implement this with an EtcdConfig class or something instead of adding more complicated logic to mediawiki-config. mediawiki-config would just point MW to etcd for the list of servers instead of having code to read JSON, cache in APC, etc. And MW would then read from etcd (or JSON I guess), cache in APC. We could probably do the same mtime check that we do for extension.json for updating the cache when the source JSON files change.

Fri, Mar 3, 6:51 AM · MediaWiki-Platform-Team, Patch-For-Review, Wikimedia-Multiple-active-datacenters, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, MediaWiki-Configuration, Operations, Wikimedia-Developer-Summit (2017)

Thu, Mar 2

Joe added a comment to T152562: Port fundraising stats off Ganglia.

@Jgreen any idea when it will happen? (all FR to jessie, I mean).

Thu, Mar 2, 8:06 AM · Operations, fundraising-tech-ops
Joe created T159412: Convert all of our site.pp/roles to the role/profile paradigm.
Thu, Mar 2, 7:44 AM · Labs, Traffic, Technical-Debt, Puppet, Operations
Joe created T159411: Uniform cluster nomenclature across puppet.
Thu, Mar 2, 7:37 AM · Labs, Traffic, Technical-Debt, Puppet, Operations
Joe added a comment to T139372: Set up oresrdb redis node in codfw.

Again regarding precaching (which is surely duplicable): do we *really* need it?

Thu, Mar 2, 6:17 AM · Patch-For-Review, Operations, Revision-Scoring-As-A-Service-Backlog

Wed, Mar 1

Joe added a comment to T139372: Set up oresrdb redis node in codfw.

in case of active/active configuration the caches will be split though I don't think that's a problem in practice.

I tend to disagree. ORES caches scores based on revision. So if someone makes a request and it gets cached in codfw, it will be useless when someone else makes request to eqiad. I highly encourage replication of redis caches (the broker part is okay without replication)

Wed, Mar 1, 5:25 PM · Patch-For-Review, Operations, Revision-Scoring-As-A-Service-Backlog

Tue, Feb 28

Joe moved T156924: Allow integration of data from etcd into the MediaWiki configuration from Backlog to Doing on the User-Joe board.
Tue, Feb 28, 3:51 PM · MediaWiki-Platform-Team, Patch-For-Review, Wikimedia-Multiple-active-datacenters, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, MediaWiki-Configuration, Operations, Wikimedia-Developer-Summit (2017)
Joe moved T156922: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc) from Backlog to Doing on the User-Joe board.
Tue, Feb 28, 3:51 PM · User-Joe, Patch-For-Review, Performance-Team, DC-Switchover-Prep-Q3-2016-17, Epic, Wikimedia-Multiple-active-datacenters, Operations
Joe added a project to T156922: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc): User-Joe.
Tue, Feb 28, 3:38 PM · User-Joe, Patch-For-Review, Performance-Team, DC-Switchover-Prep-Q3-2016-17, Epic, Wikimedia-Multiple-active-datacenters, Operations
Joe added a comment to T156922: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc).

I took what @Krinkle did in his patchset, fixed a couple of things in order to implement the "clone mode" and be able to simulate the full procedure:

Tue, Feb 28, 3:01 PM · User-Joe, Patch-For-Review, Performance-Team, DC-Switchover-Prep-Q3-2016-17, Epic, Wikimedia-Multiple-active-datacenters, Operations

Mon, Feb 27

Joe moved T149617: Integrating MediaWiki (and other services) with dynamic configuration from Blocked on others to Doing on the User-Joe board.
Mon, Feb 27, 9:16 AM · Patch-For-Review, Wikimedia-Multiple-active-datacenters, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, MediaWiki-Configuration, Operations, Wikimedia-Developer-Summit (2017)

Feb 24 2017

Joe closed T155823: Expand conftool to support multiple objects via a schema definition., a subtask of T149617: Integrating MediaWiki (and other services) with dynamic configuration, as "Resolved".
Feb 24 2017, 5:28 PM · Patch-For-Review, Wikimedia-Multiple-active-datacenters, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, MediaWiki-Configuration, Operations, Wikimedia-Developer-Summit (2017)
Joe closed T155823: Expand conftool to support multiple objects via a schema definition. as "Resolved".
Feb 24 2017, 5:28 PM · DC-Switchover-Prep-Q3-2016-17, Patch-For-Review, Wikimedia-Multiple-active-datacenters, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, MediaWiki-Configuration, Operations
Joe added a comment to T155823: Expand conftool to support multiple objects via a schema definition..

The code is done and a package has been created, although still only in experimental. This task can be considered resolved though.

Feb 24 2017, 5:28 PM · DC-Switchover-Prep-Q3-2016-17, Patch-For-Review, Wikimedia-Multiple-active-datacenters, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, MediaWiki-Configuration, Operations
Joe closed T156009: Create an etcd cluster in codfw as "Resolved".
Feb 24 2017, 5:27 PM · Patch-For-Review, User-Joe, DC-Switchover-Prep-Q3-2016-17, Epic, Wikimedia-Multiple-active-datacenters, Operations
Joe closed T156009: Create an etcd cluster in codfw, a subtask of T154658: Prepare and improve the datacenter switchover procedure, as "Resolved".
Feb 24 2017, 5:27 PM · DC-Switchover-Prep-Q3-2016-17, Epic, Wikimedia-Multiple-active-datacenters, Operations

Feb 22 2017

Joe added a comment to T158239: Improve GettingStarted data storage strategy.

Answering some of my questions:

Feb 22 2017, 3:05 PM · User-Elukey, MediaWiki-extensions-GettingStarted, Collaboration-Team-Triage
Joe added a comment to T158239: Improve GettingStarted data storage strategy.

Thanks a lot Aaron for opening this task. I am a bit ignorant about this extension and I don't get the "this extension does reads/writes from any datacenter". From the ops point of view, the Redis cluster in codfw is used only for replication from eqiad, so we don't expect any live traffic to reach that pool directly. Is it a bad assumption?

It basically maintains a cache of which pages are in which categories, so it can use Redis SRANDMEMBER to select random pages from the categories.

It expects:

  1. To write to Redis master when page edits are made (so from the master data center), or from maintenance/populate_categories.php..
  2. Read from Redis slave, which tracks the master, on any web request.
  3. Redis persistence. It doesn't expect Redis to empty out. If it does, we need to re-run the maintenance script.

    Most of the key stuff is in https://phabricator.wikimedia.org/diffusion/EGST/browse/master/RedisCategorySync.php (including the hook listeners). Ping me on IRC or reply here if I can help explain.
Feb 22 2017, 2:51 PM · User-Elukey, MediaWiki-extensions-GettingStarted, Collaboration-Team-Triage
Joe added a comment to T94239: Scap is lacking a license.

My preference for standalone tools is always the GPL v3, because there is no reason for people to use it in different contexts

Feb 22 2017, 6:58 AM · Scap, Software-Licensing, WMF-Legal, Documentation

Feb 20 2017

GitHub <noreply@github.com> committed rMSCP6091c057b77c: Merge pull request #161 from Ladsgroup/cp_disable_ores (authored by Joe).
Merge pull request #161 from Ladsgroup/cp_disable_ores
Feb 20 2017, 5:19 PM

Feb 17 2017

Joe added a comment to T125735: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out.

Hi! I'm the one who suggested most of those timeout changes. Some have different historical reasons, but I think we can safely raise the connect timeout for the jobrunners (NOT for the common appservers).

Feb 17 2017, 7:26 AM · User-Elukey, Operations, Wikimedia-log-errors

Feb 15 2017

Joe moved T155823: Expand conftool to support multiple objects via a schema definition. from Blocked on others to Doing on the User-Joe board.
Feb 15 2017, 4:40 PM · DC-Switchover-Prep-Q3-2016-17, Patch-For-Review, Wikimedia-Multiple-active-datacenters, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, MediaWiki-Configuration, Operations
Joe added a comment to T156922: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc).

warmup is one of the things that bounds our read-only time during the switchover, in that case we could start warming up wikis sorted by e.g. their pageviews to further shorten the acceptable read-only time.

That would significantly complicate the script as well as the actual switchover process. You'd have to deploy many changes to mw-config during the switchover to gradually read-only more and more wikis. The warmup script, meanwhile, takes less than a minute to run. I doubt we'd be reasonably saving any time considering the gradual read-only switching would have to be done manually and is about saving a subset of 50 seconds time.

Indeed, it seems a whole lot of effort for small gains over 50s. Do you know if we could simulate a warmup (and a wipe beforehand) in codfw given how it is configured today in mediawiki?

Feb 15 2017, 11:50 AM · User-Joe, Patch-For-Review, Performance-Team, DC-Switchover-Prep-Q3-2016-17, Epic, Wikimedia-Multiple-active-datacenters, Operations
Joe added a comment to T143349: Deprecate precise instances in Labs by 2017-03-31.

I think we should simply drop 5.3 from the CI tests, then. I wasn't aware that the PHP versions had to be co-installable, which makes a custom 5.3 build for trusty a far more complicated endeavour.

Feb 15 2017, 6:23 AM · Patch-For-Review, Labs-Infrastructure, Labs

Feb 14 2017

Joe added a comment to T156023: Check the size of every cluster in codfw to see if it matches eqiad's capacity.

Also note that while for videoscalers and jobrunners it is advisable to reimage, in the other cases a simple change of role in puppet is ok.

Feb 14 2017, 11:22 AM · Patch-For-Review, DC-Switchover-Prep-Q3-2016-17, Epic, Wikimedia-Multiple-active-datacenters, Operations
Joe added a comment to T156023: Check the size of every cluster in codfw to see if it matches eqiad's capacity.

If the above counts are consistent, I'd to:

  1. reimage 3 appservers (40 cores) as api_appservers
  2. reimage 2 appservers (40 cores) as imagescalers
  3. reimage 1 appserver (40 cores) as jobrunner
  4. reimage 2 appservers (32 cores) as videoscalers

Seems sane to me to balance things a bit more in codfw

Feb 14 2017, 11:22 AM · Patch-For-Review, DC-Switchover-Prep-Q3-2016-17, Epic, Wikimedia-Multiple-active-datacenters, Operations

Feb 10 2017

Joe added a comment to T127976: Graphite DC fail-over / per-DC setup.

So basically either the connection is kept open on the client side and the name is never looked up again, or the applications cache dns indefinitely.

Feb 10 2017, 11:02 AM · Patch-For-Review, codfw-rollout-Jan-Mar-2016, codfw-rollout
Joe added a comment to T155098: Rework job queue usage for TimedMediaHandler (video scalers).

The prioritized queue is working well, but I'll probably raise the number of non-prioritized workers today as we're now underutilizing the systems.

Feb 10 2017, 6:54 AM · MW-1.29-release (WMF-deploy-2017-02-07_(1.29.0-wmf.11)), MW-1.29-release (WMF-deploy-2017-02-14_(1.29.0-wmf.12)), Patch-For-Review, TimedMediaHandler-Transcode

Feb 9 2017

Joe added a comment to T156009: Create an etcd cluster in codfw.

The codfw cluster is getting replicated data from eqiad under /eqiad.wmnet/conftool.

Feb 9 2017, 7:36 AM · Patch-For-Review, User-Joe, DC-Switchover-Prep-Q3-2016-17, Epic, Wikimedia-Multiple-active-datacenters, Operations
Joe added a comment to T156922: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc).

Another interesting possibility we might want to explore:

Feb 9 2017, 7:23 AM · User-Joe, Patch-For-Review, Performance-Team, DC-Switchover-Prep-Q3-2016-17, Epic, Wikimedia-Multiple-active-datacenters, Operations

Feb 6 2017

Joe closed T157206: ORES Overloaded (particularly 2017-02-05 02:25-02:30) as "Resolved".
Feb 6 2017, 4:42 PM · MW-1.29-release (WMF-deploy-2017-01-31_(1.29.0-wmf.10)), MW-1.29-release (WMF-deploy-2017-02-07_(1.29.0-wmf.11)), Wikimedia-Incident, Patch-For-Review, Revision-Scoring-As-A-Service, Operations, Revision-Scoring-As-A-Service-Backlog, ORES
Joe added a reverting commit for rMSCDac11ebec974e: ORES: reduce concurrency, disable various wikis: rMSCD5f932a398de5: Revert "ORES: reduce concurrency, disable various wikis".
Feb 6 2017, 3:27 PM
Joe committed rMSCD5f932a398de5: Revert "ORES: reduce concurrency, disable various wikis" (authored by Joe).
Revert "ORES: reduce concurrency, disable various wikis"
Feb 6 2017, 3:27 PM
Joe added a comment to T157206: ORES Overloaded (particularly 2017-02-05 02:25-02:30).

Looking into it better, the api user wasn't a red herring after all; I am going to ban the use of oresscores from the mw api since:

Feb 6 2017, 12:08 PM · MW-1.29-release (WMF-deploy-2017-01-31_(1.29.0-wmf.10)), MW-1.29-release (WMF-deploy-2017-02-07_(1.29.0-wmf.11)), Wikimedia-Incident, Patch-For-Review, Revision-Scoring-As-A-Service, Operations, Revision-Scoring-As-A-Service-Backlog, ORES
Joe added a comment to T157206: ORES Overloaded (particularly 2017-02-05 02:25-02:30).

scratch what I said; the counter for etwiki is most likely broken.

Feb 6 2017, 11:47 AM · MW-1.29-release (WMF-deploy-2017-01-31_(1.29.0-wmf.10)), MW-1.29-release (WMF-deploy-2017-02-07_(1.29.0-wmf.11)), Wikimedia-Incident, Patch-For-Review, Revision-Scoring-As-A-Service, Operations, Revision-Scoring-As-A-Service-Backlog, ORES
Joe added a comment to T157206: ORES Overloaded (particularly 2017-02-05 02:25-02:30).

So, graphing ores.*.scores_request.*.count it shows most requests seem to come from etwiki, investigating this further. RechentChanges suggests this is not coming from any form of bot activity.

Feb 6 2017, 11:41 AM · MW-1.29-release (WMF-deploy-2017-01-31_(1.29.0-wmf.10)), MW-1.29-release (WMF-deploy-2017-02-07_(1.29.0-wmf.11)), Wikimedia-Incident, Patch-For-Review, Revision-Scoring-As-A-Service, Operations, Revision-Scoring-As-A-Service-Backlog, ORES
Joe added a comment to T157206: ORES Overloaded (particularly 2017-02-05 02:25-02:30).

From my further analysis of logs:

Feb 6 2017, 11:21 AM · MW-1.29-release (WMF-deploy-2017-01-31_(1.29.0-wmf.10)), MW-1.29-release (WMF-deploy-2017-02-07_(1.29.0-wmf.11)), Wikimedia-Incident, Patch-For-Review, Revision-Scoring-As-A-Service, Operations, Revision-Scoring-As-A-Service-Backlog, ORES
Joe committed rMSCDd917c2b92ec6: ORES: reduce concurrency, disable various wikis (authored by Joe).
ORES: reduce concurrency, disable various wikis
Feb 6 2017, 9:36 AM
Joe committed rMSCDac11ebec974e: ORES: reduce concurrency, disable various wikis (authored by Joe).
ORES: reduce concurrency, disable various wikis
Feb 6 2017, 9:36 AM
Joe added a comment to T157206: ORES Overloaded (particularly 2017-02-05 02:25-02:30).

So after taking a quick look at ORES's logs: around 70% of requests come from changepropagation for "precaching". Also

Feb 6 2017, 7:45 AM · MW-1.29-release (WMF-deploy-2017-01-31_(1.29.0-wmf.10)), MW-1.29-release (WMF-deploy-2017-02-07_(1.29.0-wmf.11)), Wikimedia-Incident, Patch-For-Review, Revision-Scoring-As-A-Service, Operations, Revision-Scoring-As-A-Service-Backlog, ORES
Joe added a comment to T157206: ORES Overloaded (particularly 2017-02-05 02:25-02:30).

Before raising the number of workers for ORES:

Feb 6 2017, 7:24 AM · MW-1.29-release (WMF-deploy-2017-01-31_(1.29.0-wmf.10)), MW-1.29-release (WMF-deploy-2017-02-07_(1.29.0-wmf.11)), Wikimedia-Incident, Patch-For-Review, Revision-Scoring-As-A-Service, Operations, Revision-Scoring-As-A-Service-Backlog, ORES

Feb 2 2017

Joe added a comment to T156922: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc).

Correct me if I'm wrong, but I think the Main page call can be skipped for all non-standard-wiki-serving machines, so API, image/video scalers; also: do we really need to warm up APC for all of the wikis? Or could we target only the ones doing 99% of the traffic (which I guess are way less than that?).

Feb 2 2017, 11:23 AM · User-Joe, Patch-For-Review, Performance-Team, DC-Switchover-Prep-Q3-2016-17, Epic, Wikimedia-Multiple-active-datacenters, Operations

Feb 1 2017

Joe added a comment to T125069: Create a service location / discovery system for locating local/master resources easily across all WMF applications.

Duplicate of T149617

Feb 1 2017, 5:26 PM · Services (next), User-Joe, Services-next, User-mobrovac, Operations, codfw-rollout, codfw-rollout-Jan-Mar-2016
Joe closed T125069: Create a service location / discovery system for locating local/master resources easily across all WMF applications as "Resolved".
Feb 1 2017, 5:25 PM · Services (next), User-Joe, Services-next, User-mobrovac, Operations, codfw-rollout, codfw-rollout-Jan-Mar-2016
Joe created T156924: Allow integration of data from etcd into the MediaWiki configuration.
Feb 1 2017, 4:45 PM · MediaWiki-Platform-Team, Patch-For-Review, Wikimedia-Multiple-active-datacenters, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, MediaWiki-Configuration, Operations, Wikimedia-Developer-Summit (2017)
Joe created T156922: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc).
Feb 1 2017, 4:11 PM · User-Joe, Patch-For-Review, Performance-Team, DC-Switchover-Prep-Q3-2016-17, Epic, Wikimedia-Multiple-active-datacenters, Operations

Jan 30 2017

Joe added a comment to T156009: Create an etcd cluster in codfw.

The cluster in codfw is installed and tested to work correctly with conftool. The performance of the cluster using nginx as a TLS/proxy auth seems to be much better too.

Jan 30 2017, 11:25 AM · Patch-For-Review, User-Joe, DC-Switchover-Prep-Q3-2016-17, Epic, Wikimedia-Multiple-active-datacenters, Operations
Joe added a comment to T149617: Integrating MediaWiki (and other services) with dynamic configuration.

Yes, I am just unsure how / to who I can attribute the template design. That's what is blocking me at the moment.

Jan 30 2017, 8:02 AM · Patch-For-Review, Wikimedia-Multiple-active-datacenters, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, MediaWiki-Configuration, Operations, Wikimedia-Developer-Summit (2017)
Joe closed T149408: Asynchronous processing in production: one queue to rule them all as "Resolved".
Jan 30 2017, 7:47 AM · Analytics, User-mobrovac, EventBus, Services (watching), Performance-Team, MediaWiki-JobQueue, ChangeProp, Operations, Wikimedia-Developer-Summit (2017)
Joe closed T149408: Asynchronous processing in production: one queue to rule them all, a subtask of T147937: Facilitate Wikidev'17 main topic "How to manage our technical debt", as "Resolved".
Jan 30 2017, 7:46 AM · User-greg, Release-Engineering-Team, Wikimedia-Developer-Summit
Joe added a comment to T149408: Asynchronous processing in production: one queue to rule them all.

https://commons.wikimedia.org/wiki/File:Asynchronous_processing_on_the_WMF_cluster.pdf is the uploaded file.

Jan 30 2017, 7:40 AM · Analytics, User-mobrovac, EventBus, Services (watching), Performance-Team, MediaWiki-JobQueue, ChangeProp, Operations, Wikimedia-Developer-Summit (2017)
Joe added a comment to T149617: Integrating MediaWiki (and other services) with dynamic configuration.
Jan 30 2017, 7:03 AM · Patch-For-Review, Wikimedia-Multiple-active-datacenters, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, MediaWiki-Configuration, Operations, Wikimedia-Developer-Summit (2017)

Jan 27 2017

Joe added a comment to T149408: Asynchronous processing in production: one queue to rule them all.

Slides for the starting the discussion available here https://docs.google.com/presentation/d/1DCofLYbP1dWnTb1JWNNnsb0Zp_da8sBhDzlwjCXRoq8/edit?usp=sharing

I'll upload those to commons after the Developer Summit.

Did this happen? If not, could you please do so? :)

I moved the notes to https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/Asynchronous_processing

Jan 27 2017, 11:32 AM · Analytics, User-mobrovac, EventBus, Services (watching), Performance-Team, MediaWiki-JobQueue, ChangeProp, Operations, Wikimedia-Developer-Summit (2017)

Jan 26 2017

Joe updated subscribers of T156356: Pages with an &stable=1 in their URL could not be viewed or edited.

@hashar rolled back to wmf.8 and I can confirm the pages I was looking at now render correctly.

Jan 26 2017, 11:13 AM · Patch-For-Review, Release-Engineering-Team, Operations
Joe added projects to T156356: Pages with an &stable=1 in their URL could not be viewed or edited: MediaWiki-Releasing, Release-Engineering-Team.
Jan 26 2017, 10:43 AM · Patch-For-Review, Release-Engineering-Team, Operations
Joe added a comment to T156356: Pages with an &stable=1 in their URL could not be viewed or edited.

I can reproduce the problem. Any idea since when is this happening?

It obviously isn't broken in wmf.8, otherwise de.wikipedia would be broken, too, which would not go unnoticed. So, it's a bug in wmf.9.

Jan 26 2017, 10:41 AM · Patch-For-Review, Release-Engineering-Team, Operations
Joe added a comment to T156356: Pages with an &stable=1 in their URL could not be viewed or edited.

The error is the following:

Jan 26 2017, 10:41 AM · Patch-For-Review, Release-Engineering-Team, Operations
Joe added a comment to T156356: Pages with an &stable=1 in their URL could not be viewed or edited.

I can reproduce the problem. Any idea since when is this happening?

Jan 26 2017, 10:33 AM · Patch-For-Review, Release-Engineering-Team, Operations

Jan 24 2017

Joe claimed T156009: Create an etcd cluster in codfw.
Jan 24 2017, 8:36 AM · Patch-For-Review, User-Joe, DC-Switchover-Prep-Q3-2016-17, Epic, Wikimedia-Multiple-active-datacenters, Operations
Joe moved T125069: Create a service location / discovery system for locating local/master resources easily across all WMF applications from Doing to Blocked on others on the User-Joe board.
Jan 24 2017, 8:35 AM · Services (next), User-Joe, Services-next, User-mobrovac, Operations, codfw-rollout, codfw-rollout-Jan-Mar-2016
Joe closed T147402: Investigate ways to deploy docker to production as "Resolved".
Jan 24 2017, 8:35 AM · Kubernetes-production-experiment, Prod-Kubernetes, User-Joe, Operations