Joe (Giuseppe Lavagetto)
Spy

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 5:57 AM (137 w, 6 d)
Availability
Available
LDAP User
Giuseppe Lavagetto
MediaWiki User
Unknown

Recent Activity

Wed, May 24

Joe added a comment to T165519: rack and setup mw1307-1348 .

@Cmjohnson I suggest we do the following:

Wed, May 24, 2:18 PM · User-Elukey, User-Joe, Operations, ops-eqiad
Joe added a comment to T165519: rack and setup mw1307-1348 .

Another option is not to care much how the current distribution goes but to just evenly distribute servers across rows, and then go on and rebalance the whole cluster.

Wed, May 24, 9:45 AM · User-Elukey, User-Joe, Operations, ops-eqiad
Joe added a comment to T165519: rack and setup mw1307-1348 .

Here is my proposal regarding these systems:

Wed, May 24, 9:39 AM · User-Elukey, User-Joe, Operations, ops-eqiad
Joe added a comment to T147718: RFC: New puppet code organization paradigm/coding standards.

I have a question about the new profile guidelines:

Profile classes should only have parameters that default to an explicit hiera calls with no fallback value.

Why no fallback defaults?

Wed, May 24, 5:57 AM · Patch-For-Review, RfC, Puppet, Operations
Joe added a comment to T147718: RFC: New puppet code organization paradigm/coding standards.

Also, if configuration of profiles can only be done via hiera, doesn't that mean any module parameter that we may want to override needs to be specified as a profile parameter?

Wed, May 24, 5:52 AM · Patch-For-Review, RfC, Puppet, Operations

Tue, May 23

Joe added a comment to T166081: rack/setup/install conf1004-conf1006.

Racking request is just that these new machines go in different rows. They can even be in the racks of the other conf* systems as those old systems will be eventually decommissioned.

Tue, May 23, 3:13 PM · ops-eqiad, User-Joe, Operations

Mon, May 22

Joe moved T147204: Update confd package from Blocking others to Backlog on the User-Joe board.
Mon, May 22, 4:43 PM · User-Joe, Beta-Cluster-reproducible, Operations
Joe moved T165519: rack and setup mw1307-1348 from Backlog to Blocking others on the User-Joe board.
Mon, May 22, 4:43 PM · User-Elukey, User-Joe, Operations, ops-eqiad
Joe added a project to T165519: rack and setup mw1307-1348 : User-Joe.
Mon, May 22, 4:43 PM · User-Elukey, User-Joe, Operations, ops-eqiad
Joe added projects to T166066: Integrate the puppet compiler in the puppet CI pipeline: Operations, Puppet.
Mon, May 22, 3:44 PM · Puppet, Operations
Joe created T166066: Integrate the puppet compiler in the puppet CI pipeline.
Mon, May 22, 3:44 PM · Puppet, Operations

Sun, May 21

Joe added a comment to T163337: Job queue corruption after codfw switch over (Queue worth, duplicate runs).

@aaron another interesting open bug that might be worth reviewing: https://github.com/antirez/redis/issues/1525 ("EVAL replicated + conditionals about key existence = replication bug.")

It leverages TTLs though that we don't use, so probably it will not resolve much, but good to keep it as reference.

Sun, May 21, 9:37 AM · Patch-For-Review, Wikimedia-Incident, MediaWiki-JobQueue, Scoring-platform-team, User-Ladsgroup, MediaWiki-extensions-ORES, ORES, MediaWiki-Watchlist, codfw-rollout
Joe added a comment to T156199: Low-latency current revision storage.

Not going deeper in reasoning on the requirements that this tickets assume to be true (I'm not sure all of those are justified, but that's another topic) I would say that the "application-level TTLs" options seems the best way to go for a few reasons:

Sun, May 21, 9:14 AM · User-mobrovac, Cassandra, Services (designing), Wikimedia-Incident, RfC, RESTBase

Mon, May 15

Joe renamed T165024: Upgrade calico to 2.2, document build process. from Upgrade calico to 2.1, document build process. to Upgrade calico to 2.2, document build process..
Mon, May 15, 9:51 AM · Patch-For-Review, User-Joe, kubernetes, Goal, Operations
Joe moved T165024: Upgrade calico to 2.2, document build process. from Backlog to Doing on the User-Joe board.
Mon, May 15, 9:49 AM · Patch-For-Review, User-Joe, kubernetes, Goal, Operations
Joe moved T159687: etcd switchover/enhancements from Doing to Backlog on the User-Joe board.
Mon, May 15, 9:49 AM · Patch-For-Review, User-Joe, Operations

Fri, May 12

Joe added a comment to T125735: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out.

After deploying https://gerrit.wikimedia.org/r/353247 to labs I have observed no relevant changes to TCP metrics of any of the following:

  • deployment-jobrunner02
  • deployment-mediawiki05
  • deployment-redis01

    From prometheus-beta is easy to select a graph and plot it. As example, here's the list of ESTABLISHED TCP connections for the hosts mentioned above:
  • deployment-jobrunner02
  • mediawiki05
  • deployment-redis01

    After a chat with @hashar it seems that the RedisConnectionPool.php class (that offers persistent connections) is used only by mediawiki, not by the jobrunner service (that instanciate Redis() without any trace of pconnect).

    I tried to live hack on deployment-prep.mediawiki05 jobqueue-beta.php changing 'persistent' => defined( 'MEDIAWIKI_JOB_RUNNER' ) to 1 but nothing has really changed.

    My understanding of the job queues is still very very bad, so I have these (possibly dumb) questions for @Krinkle or @aaron:
  • Do you have any idea about what I am doing wrong with https://gerrit.wikimedia.org/r/#/c/353247 ? Is my understanding correct about the fact that this setting is only for mediawiki app/api servers and not jobrunners?
  • Should we patch the jobrunner service to offer persistent connections to Redis too?

    Sorry for pushing this but in my opinion we'd need to come up with a good plan forward, let me know your thoughts.
Fri, May 12, 1:08 PM · Patch-For-Review, User-Elukey, Operations, Wikimedia-log-errors

Thu, May 11

Joe added a comment to T165024: Upgrade calico to 2.2, document build process..

I am re-doing our calico-containers repository from scratch, importing a version from upstream and managing the now-minimal changes to the Dockerfiles with quilt. This will make it easier to build calicoctl (the debian package) properly.

Thu, May 11, 11:18 AM · Patch-For-Review, User-Joe, kubernetes, Goal, Operations
Joe updated the task description for T165024: Upgrade calico to 2.2, document build process..
Thu, May 11, 11:08 AM · Patch-For-Review, User-Joe, kubernetes, Goal, Operations
Joe created T165024: Upgrade calico to 2.2, document build process..
Thu, May 11, 10:03 AM · Patch-For-Review, User-Joe, kubernetes, Goal, Operations
Joe closed T163565: Install conftool on deployment masters, a subtask of T104352: Make scap able to depool/repool servers via the conftool API, as Resolved.
Thu, May 11, 6:25 AM · releng-201617-q4, Scap (Scap3-MediaWiki-MVP), scap2, Operations, HHVM, Performance-Team
Joe closed T163565: Install conftool on deployment masters as Resolved.
Thu, May 11, 6:25 AM · User-Joe, Patch-For-Review, Operations, Scap (Scap3-MediaWiki-MVP), Deployment-Systems
Joe closed T163565: Install conftool on deployment masters, a subtask of T125629: Depool proxies temporarily while scap is ongoing to avoid taxing those nodes, as Resolved.
Thu, May 11, 6:25 AM · Scap (Scap3-MediaWiki-MVP), scap2, Operations
Joe added a comment to T163565: Install conftool on deployment masters.

@Joe: That all seems reasonable. I don't particularly want to duplicate logic in scap unless it makes the most sense for that logic to live in scap.

This task is mostly concerned with implementing deplooling for mediawiki deployments but the functionality in scap could handle the process with other services as well. If this isn't the right approach then I think we could use some guidance from you on how to get this working.

Thu, May 11, 6:05 AM · User-Joe, Patch-For-Review, Operations, Scap (Scap3-MediaWiki-MVP), Deployment-Systems
Joe requested changes to D600: Create a wrapper around conftool for our pooling/depooling needs.

I think the basic idea for the patch is good, I think the implementation can be improved as it is not currently doing what it's intended to do.

Thu, May 11, 6:01 AM · Release-Engineering-Team

Tue, May 9

Joe added a comment to T164793: Parsoid deploy failed.

so, after some digging, I found out that conf2002.codfw.wmnet had, for some reason, auth enabled on etcd (while we now just proxy through nginx) and for some reason only had the root user available. The most probable cause is me doing something wrong when disabling auth in eqiad during the conversion of that cluster.

Tue, May 9, 3:09 PM · User-Joe, Scap, Parsoid
Joe claimed T164793: Parsoid deploy failed.
Tue, May 9, 2:23 PM · User-Joe, Scap, Parsoid

Wed, May 3

Joe added a comment to T164376: [Discuss] Split ORES scores in datacenters based on wiki.

Yes, my only doubt with this proposal is exactly that we want to be active/active but to being able to serve all the traffic from a single datacenter.

Wed, May 3, 2:54 PM · Traffic, Scoring-platform-team-Backlog, ORES, Operations, ChangeProp
Joe added a comment to T164177: switchdc: Improve wgReadOnly message.

The manual change + commit + deploy of the MW configuration might actually not be needed anymore, it depends on T163398. If that change lands in production before the switchback the related tasks in Switchdc will be updated to use conftool to change those values, hence that hardcoded part will go away anyway.

I thought this is to be removed at some point, didn't notice we're already on it. Great! So yeah, that should be the preferred way of course.

Wed, May 3, 9:31 AM · Patch-For-Review, Operations, codfw-rollout, Operations-Software-Development
Joe merged T164287: Degraded RAID on restbase1018 into T163280: Degraded RAID on restbase1018.
Wed, May 3, 9:02 AM · Operations
Joe merged task T164287: Degraded RAID on restbase1018 into T163280: Degraded RAID on restbase1018.
Wed, May 3, 9:02 AM · ops-eqiad, Operations
Joe merged T164342: Degraded RAID on restbase1018 into T163280: Degraded RAID on restbase1018.
Wed, May 3, 9:01 AM · Operations
Joe merged task T164342: Degraded RAID on restbase1018 into T163280: Degraded RAID on restbase1018.
Wed, May 3, 9:01 AM · ops-eqiad, Operations
Joe closed T164202: Degraded RAID on restbase1018 as Resolved.
Wed, May 3, 8:59 AM · ops-eqiad, Operations
Joe added a comment to T163292: Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet .

I just recreated the RAID arrays and rebooted the system with the new disk in place. @Eevans I'd let you re-start puppet and attend cassandra. Of course, the data in /srv are gone for good.

Wed, May 3, 8:58 AM · Patch-For-Review, Operations, Cassandra, Services (doing)

Tue, May 2

Joe added a comment to T159687: etcd switchover/enhancements.

I converted the etcd cluster in eqiad to use nginx for auth/TLS, moved to ecdsa certs with the correct SANs, and started replication codfw => eqiad.

Tue, May 2, 7:58 AM · Patch-For-Review, User-Joe, Operations

Mon, May 1

Tbayer awarded T163438: VisualEditor broken on wikitech when codfw is primary: "Error loading data from server: apierror-visualeditor-docserver-http: HTTP 500." a Evil Spooky Haunted Tree token.
Mon, May 1, 5:02 AM · codfw-rollout, wikitech.wikimedia.org, Labs

Sat, Apr 29

Joe added a comment to T163337: Job queue corruption after codfw switch over (Queue worth, duplicate runs).

Note that the only keys that use a TTL are the root job de-duplication hash/timestamp keys. Everything else is normally just directly deleted (lpop,hdel,...) via Lua on ack() or by the JobChron Lua script (which uses similar commands). None of the main data structures have native redis expiration (which uses the special master deletion logic).

Sat, Apr 29, 11:25 AM · Patch-For-Review, Wikimedia-Incident, MediaWiki-JobQueue, Scoring-platform-team, User-Ladsgroup, MediaWiki-extensions-ORES, ORES, MediaWiki-Watchlist, codfw-rollout
Joe added a comment to T163337: Job queue corruption after codfw switch over (Queue worth, duplicate runs).

An additional case I'm going to study in more detail:

Sat, Apr 29, 7:42 AM · Patch-For-Review, Wikimedia-Incident, MediaWiki-JobQueue, Scoring-platform-team, User-Ladsgroup, MediaWiki-extensions-ORES, ORES, MediaWiki-Watchlist, codfw-rollout

Fri, Apr 28

Joe added a project to T163565: Install conftool on deployment masters: User-Joe.
Fri, Apr 28, 5:32 AM · User-Joe, Patch-For-Review, Operations, Scap (Scap3-MediaWiki-MVP), Deployment-Systems

Thu, Apr 27

Joe added a comment to T163337: Job queue corruption after codfw switch over (Queue worth, duplicate runs).

Just to err on the side of caution, I reviewed all the code of JobQueueRedis and of the JobChron, and I found no obvious parts of our LUA scripts that could cause replication to break, like non-deterministic statements.

Thu, Apr 27, 10:50 AM · Patch-For-Review, Wikimedia-Incident, MediaWiki-JobQueue, Scoring-platform-team, User-Ladsgroup, MediaWiki-extensions-ORES, ORES, MediaWiki-Watchlist, codfw-rollout
Joe added a comment to T163337: Job queue corruption after codfw switch over (Queue worth, duplicate runs).

Also let me add a few remarks on the redis replication:

Thu, Apr 27, 9:59 AM · Patch-For-Review, Wikimedia-Incident, MediaWiki-JobQueue, Scoring-platform-team, User-Ladsgroup, MediaWiki-extensions-ORES, ORES, MediaWiki-Watchlist, codfw-rollout
Joe added a comment to T163337: Job queue corruption after codfw switch over (Queue worth, duplicate runs).

So, I just re-ran showJobs in eqiad and codfw and there were big discrepancies, as expected

--- wasat.before	2017-04-27 11:45:20.725345007 +0200
+++ terbium.before	2017-04-27 11:45:31.189344455 +0200
@@ -1,15 +1,14 @@
-categoryMembershipChange: 3 queued; 52 claimed (3 active, 49 abandoned); 0 delayed
-cdnPurge: 0 queued; 0 claimed (0 active, 0 abandoned); 32 delayed
+categoryMembershipChange: 3398 queued; 30062 claimed (406 active, 29656 abandoned); 0 delayed
+cdnPurge: 32648 queued; 57664 claimed (959 active, 56705 abandoned); 21 delayed
Thu, Apr 27, 9:56 AM · Patch-For-Review, Wikimedia-Incident, MediaWiki-JobQueue, Scoring-platform-team, User-Ladsgroup, MediaWiki-extensions-ORES, ORES, MediaWiki-Watchlist, codfw-rollout
Joe added a comment to T163337: Job queue corruption after codfw switch over (Queue worth, duplicate runs).

On April 12 a (supposedly harmless) test was performed in codfw using the switchdc script. It's probably not a coincidence then that the jobs we saw running unexpectedly today relate to events from that day. Presumably something happened there that led to jobs being set to some state somewhere and being triggered a second time today.

I just thoroughly reviewed the switchdc logs from April 12 and no action was taken on eqiad's redises or jobrunners. So I think this is a red herring.

That's exactly why I think it isn't a red herring. It ran on codfw, where the jobqueue redis is a replica of eqiads. It's conceivable to me that these tests started something that caused jobs after that point to be marked in a certain way (in codfw). Then on April 19, the minute step "Stop MediaWiki maintenance in the old master DC" was run, this marking stopped - presumably whatever state was left behind was now normalised. Then a few minutes later when the job queue was enabled in codfw, all jobs previously "marked" by codfw started running, again. (For the first time, from codfw's perspective)

Thu, Apr 27, 9:41 AM · Patch-For-Review, Wikimedia-Incident, MediaWiki-JobQueue, Scoring-platform-team, User-Ladsgroup, MediaWiki-extensions-ORES, ORES, MediaWiki-Watchlist, codfw-rollout
Joe added a comment to T163337: Job queue corruption after codfw switch over (Queue worth, duplicate runs).

I finally managed to find some differences in one redis replication set, and those are all related to big data structures like enwiki:jobqueue:refreshLinks:l-unclaimed and enwiki:jobqueue:refreshLinks:z-claimed or enwiki:jobqueue:refreshLinks:h-idBySha1.

Thu, Apr 27, 9:21 AM · Patch-For-Review, Wikimedia-Incident, MediaWiki-JobQueue, Scoring-platform-team, User-Ladsgroup, MediaWiki-extensions-ORES, ORES, MediaWiki-Watchlist, codfw-rollout
Joe added a comment to T163337: Job queue corruption after codfw switch over (Queue worth, duplicate runs).

Ok so, I started using a (slightly modified) version of the script presented here:

Thu, Apr 27, 6:54 AM · Patch-For-Review, Wikimedia-Incident, MediaWiki-JobQueue, Scoring-platform-team, User-Ladsgroup, MediaWiki-extensions-ORES, ORES, MediaWiki-Watchlist, codfw-rollout
Joe added a comment to T163337: Job queue corruption after codfw switch over (Queue worth, duplicate runs).

I don't see anything weird ongoing (except from 2004's ttls but could be a red herring) from the Redis replication point of view, maybe terbium's view of the queues is not correct?

Thu, Apr 27, 5:36 AM · Patch-For-Review, Wikimedia-Incident, MediaWiki-JobQueue, Scoring-platform-team, User-Ladsgroup, MediaWiki-extensions-ORES, ORES, MediaWiki-Watchlist, codfw-rollout

Wed, Apr 26

Joe updated subscribers of T148506: Rack and setup new eqiad row D switch stack (EX4300/QFX5100).

For the record, @akosiaris and me switched etcd client traffic to codfw to allow relocating conf1003 with ample time.

Wed, Apr 26, 9:31 AM · Patch-For-Review, Operations, ops-eqiad, netops
Joe added a comment to T159687: etcd switchover/enhancements.

All clients have been successfully switched to codfw, and replication has been stopped; I tested depooling and pooling back a client (to test again that nginx-based auth works) and everything seems working flawlessly for now.

Wed, Apr 26, 7:00 AM · Patch-For-Review, User-Joe, Operations
Joe updated the task description for T159687: etcd switchover/enhancements.
Wed, Apr 26, 6:58 AM · Patch-For-Review, User-Joe, Operations
Joe added a comment to T163565: Install conftool on deployment masters.

I don't think scap should interact with conftool by itself, unless it reproduces what the restart-<service> scripts are doing right now, which is:

Wed, Apr 26, 6:53 AM · User-Joe, Patch-For-Review, Operations, Scap (Scap3-MediaWiki-MVP), Deployment-Systems
Joe updated the task description for T159687: etcd switchover/enhancements.
Wed, Apr 26, 6:35 AM · Patch-For-Review, User-Joe, Operations

Apr 21 2017

Joe added a comment to T159687: etcd switchover/enhancements.

I've set up the replica and prepared changes for most next steps. When I'm back on Wednesday morning, we can decide if we want to failover to the new cluster directly or just do it in case something bad happens with the network maintenance and the eqiad cluster, and perform the switchover at a later date.

Apr 21 2017, 6:20 PM · Patch-For-Review, User-Joe, Operations
Joe updated the task description for T159687: etcd switchover/enhancements.
Apr 21 2017, 6:18 PM · Patch-For-Review, User-Joe, Operations
Joe added a comment to T156924: Allow integration of data from etcd into the MediaWiki configuration.

To summarize, I think it is possible to test EtcdConfig in beta at this point with the limited deployment I made.

Apr 21 2017, 6:03 PM · Availability (Multiple-active-datacenters), MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)), MediaWiki-Platform-Team, Patch-For-Review, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, Operations
Joe added a comment to T156924: Allow integration of data from etcd into the MediaWiki configuration.

I managed to get a bare-bones working installation of conftool in deployment-prep.

Apr 21 2017, 6:01 PM · Availability (Multiple-active-datacenters), MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)), MediaWiki-Platform-Team, Patch-For-Review, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, Operations
Joe moved T156924: Allow integration of data from etcd into the MediaWiki configuration from Blocking others to Doing on the User-Joe board.
Apr 21 2017, 5:15 PM · Availability (Multiple-active-datacenters), MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)), MediaWiki-Platform-Team, Patch-For-Review, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, Operations
Joe added a comment to T163418: jobqueue is full of refreshlinks duplicates after the switchover..

@GWicke I have seen the same job being re-executed multiple times (after succeeding) when I ran runJobs.php from the command-line to remove some pressure; I'm sure more cases can be found in the logs.

Apr 21 2017, 9:27 AM · User-Joe, codfw-rollout, Patch-For-Review, MediaWiki-JobRunner, MediaWiki-JobQueue, Operations
Joe added a comment to T163337: Job queue corruption after codfw switch over (Queue worth, duplicate runs).

While it's possible the underlying cause for this bug is specific to ORES (T163337#3196493 mentions a few log entries relating to ORES precaching on April 12), it is also possible that this was a problem with the Job Queue in general. To me that seems more likely, especially given T163418: jobqueue is full of refreshlinks duplicates after the switchover..

Apr 21 2017, 4:29 AM · Patch-For-Review, Wikimedia-Incident, MediaWiki-JobQueue, Scoring-platform-team, User-Ladsgroup, MediaWiki-extensions-ORES, ORES, MediaWiki-Watchlist, codfw-rollout
Joe added a comment to T163337: Job queue corruption after codfw switch over (Queue worth, duplicate runs).
Apr 21 2017, 4:25 AM · Patch-For-Review, Wikimedia-Incident, MediaWiki-JobQueue, Scoring-platform-team, User-Ladsgroup, MediaWiki-extensions-ORES, ORES, MediaWiki-Watchlist, codfw-rollout

Apr 20 2017

Joe triaged T163438: VisualEditor broken on wikitech when codfw is primary: "Error loading data from server: apierror-visualeditor-docserver-http: HTTP 500." as High priority.
Apr 20 2017, 12:38 PM · codfw-rollout, wikitech.wikimedia.org, Labs
Joe created T163438: VisualEditor broken on wikitech when codfw is primary: "Error loading data from server: apierror-visualeditor-docserver-http: HTTP 500.".
Apr 20 2017, 12:38 PM · codfw-rollout, wikitech.wikimedia.org, Labs
Joe lowered the priority of T163418: jobqueue is full of refreshlinks duplicates after the switchover. from Unbreak Now! to High.
Apr 20 2017, 12:04 PM · User-Joe, codfw-rollout, Patch-For-Review, MediaWiki-JobRunner, MediaWiki-JobQueue, Operations
elukey awarded T163418: jobqueue is full of refreshlinks duplicates after the switchover. a Heartbreak token.
Apr 20 2017, 11:47 AM · User-Joe, codfw-rollout, Patch-For-Review, MediaWiki-JobRunner, MediaWiki-JobQueue, Operations
Joe added a comment to T163418: jobqueue is full of refreshlinks duplicates after the switchover..

The queue is down to 250K jobs, and I am confident all the old refreshlinks jobs have been removed. I'm leaving the ticket open at lower priority as I need to still take a look at this.

Apr 20 2017, 11:45 AM · User-Joe, codfw-rollout, Patch-For-Review, MediaWiki-JobRunner, MediaWiki-JobQueue, Operations
Joe added a comment to T163418: jobqueue is full of refreshlinks duplicates after the switchover..

FTR, the queue is dropping fast, as the number of processed jobs. I'll de-deploy my hack as soon as I'm confident I killed all the rogue refreshlinks links.

Apr 20 2017, 10:52 AM · User-Joe, codfw-rollout, Patch-For-Review, MediaWiki-JobRunner, MediaWiki-JobQueue, Operations
Joe moved T159687: etcd switchover/enhancements from Backlog to Doing on the User-Joe board.
Apr 20 2017, 10:50 AM · Patch-For-Review, User-Joe, Operations
Joe moved T156924: Allow integration of data from etcd into the MediaWiki configuration from Doing to Blocking others on the User-Joe board.
Apr 20 2017, 10:50 AM · Availability (Multiple-active-datacenters), MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)), MediaWiki-Platform-Team, Patch-For-Review, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, Operations
Joe moved T163418: jobqueue is full of refreshlinks duplicates after the switchover. from Backlog to Doing on the User-Joe board.
Apr 20 2017, 10:50 AM · User-Joe, codfw-rollout, Patch-For-Review, MediaWiki-JobRunner, MediaWiki-JobQueue, Operations
Joe moved T149617: Integrating MediaWiki (and other services) with dynamic configuration from Doing to Blocking others on the User-Joe board.
Apr 20 2017, 10:50 AM · Availability (Multiple-active-datacenters), Patch-For-Review, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, MediaWiki-Configuration, Operations, Wikimedia-Developer-Summit (2017)
Joe updated the task description for T159687: etcd switchover/enhancements.
Apr 20 2017, 10:47 AM · Patch-For-Review, User-Joe, Operations
Joe added a project to T163418: jobqueue is full of refreshlinks duplicates after the switchover.: User-Joe.
Apr 20 2017, 10:24 AM · User-Joe, codfw-rollout, Patch-For-Review, MediaWiki-JobRunner, MediaWiki-JobQueue, Operations
Joe added a comment to T163418: jobqueue is full of refreshlinks duplicates after the switchover..

I'm testing this live hack:

Apr 20 2017, 9:15 AM · User-Joe, codfw-rollout, Patch-For-Review, MediaWiki-JobRunner, MediaWiki-JobQueue, Operations
Joe added a comment to T73853: Retry counts not working / jobs re-executed beyond retry limits.

Raised the priority as this keeps biting us again and again, see T163418

Apr 20 2017, 8:43 AM · WMF-deploy-2015-08-25_(1.26wmf20), WMF-deploy-2015-07-28_(1.26wmf16), Patch-For-Review, MediaWiki-JobQueue
Joe raised the priority of T73853: Retry counts not working / jobs re-executed beyond retry limits from Normal to High.
Apr 20 2017, 8:42 AM · WMF-deploy-2015-08-25_(1.26wmf20), WMF-deploy-2015-07-28_(1.26wmf16), Patch-For-Review, MediaWiki-JobQueue
Joe updated the task description for T163418: jobqueue is full of refreshlinks duplicates after the switchover..
Apr 20 2017, 8:41 AM · User-Joe, codfw-rollout, Patch-For-Review, MediaWiki-JobRunner, MediaWiki-JobQueue, Operations
Joe triaged T163418: jobqueue is full of refreshlinks duplicates after the switchover. as Unbreak Now! priority.
Apr 20 2017, 8:38 AM · User-Joe, codfw-rollout, Patch-For-Review, MediaWiki-JobRunner, MediaWiki-JobQueue, Operations
Joe created T163418: jobqueue is full of refreshlinks duplicates after the switchover..
Apr 20 2017, 8:38 AM · User-Joe, codfw-rollout, Patch-For-Review, MediaWiki-JobRunner, MediaWiki-JobQueue, Operations

Apr 19 2017

Joe added a comment to T163337: Job queue corruption after codfw switch over (Queue worth, duplicate runs).

On possible explanation is this is due to jobs that already were running in eqiad before the switchover and got killed before acknowledging, due to the way we stop jobrunners in the switchover procedure.

Apr 19 2017, 5:51 PM · Patch-For-Review, Wikimedia-Incident, MediaWiki-JobQueue, Scoring-platform-team, User-Ladsgroup, MediaWiki-extensions-ORES, ORES, MediaWiki-Watchlist, codfw-rollout
Joe added a comment to T163337: Job queue corruption after codfw switch over (Queue worth, duplicate runs).

@Ladsgroup is this happening for new jobs enqueued now?

Apr 19 2017, 5:26 PM · Patch-For-Review, Wikimedia-Incident, MediaWiki-JobQueue, Scoring-platform-team, User-Ladsgroup, MediaWiki-extensions-ORES, ORES, MediaWiki-Watchlist, codfw-rollout
Joe added a comment to T163278: Four different PHP/HHVM versions on the cluster.

terbium will be upgraded to jessie as soon as we've switched over, for the record.

Apr 19 2017, 10:50 AM · Operations

Apr 12 2017

Joe added a comment to T162462: Standalone puppet masters are broken (uninstallable packages).

It has stopped happening since those last lines I 've pasted above (something by cron? logrotate?). I 'll keep an eye for it though

Apr 12 2017, 9:43 AM · Patch-For-Review, Labs, Operations
Joe claimed T162780: ocg1003 partitions are severely misconfigured.
Apr 12 2017, 9:05 AM · Operations
Joe created T162780: ocg1003 partitions are severely misconfigured.
Apr 12 2017, 9:05 AM · Operations

Apr 7 2017

Joe added a comment to T162462: Standalone puppet masters are broken (uninstallable packages).

Our production puppetmasters run on 3.8, several clients have been tested, and the agent should have minimal differences.

Apr 7 2017, 5:21 PM · Patch-For-Review, Labs, Operations
Joe added a comment to T161684: HHVM 3.18 deadlocks after 4-6 hours (stuck in in HPHP::Treadmill::getAgeOldestRequest() ).

@MoritzMuehlenhoff I'm ok with a wider but limited rollout, but at least until the dc switchover and rollback are done, I'd prefer to stick with the known evil of 3.12 on most systems.

Apr 7 2017, 7:04 AM · HHVM, Operations
Joe added a comment to T149006: elastic2020 is powered off and does not want to restart.

@Gehel @RobH I spoke again yesterday with the HP Engineer that did help me on the lvs2002(T162099) issue about this case and after going over the log and taking the into consideration what i fond about the Hard drive warring that the previous HP Engineer didn't take time to investigate, here is what he thinks:

Hi Paul,

As discussed on the call, I noticed that you were using Intel SSDs on the server and these SSDs do not support HPE diagnostics on them and therefore we are unable to pull any details about the SSD from our Smart Storage Administrator. This issue you are facing could possibly be caused by the SSDs you are using but there is no way for HPE to confirm that since these are Intel SSDs.

Thanks & Regards,

Apr 7 2017, 5:45 AM · Patch-For-Review, Discovery-Search (Current work), Discovery, ops-codfw, DC-Ops, Operations, Elasticsearch

Apr 3 2017

Joe added a comment to T161684: HHVM 3.18 deadlocks after 4-6 hours (stuck in in HPHP::Treadmill::getAgeOldestRequest() ).

The current performance loss seems less significant though (e.g. compare mw1261 with HHVM 3.18 and stat_cache disabled to mw1266 with HHVM and stat_cache enabled in Grafana's server dashboard). The machines are identical hardware-wise.

Apr 3 2017, 8:26 PM · HHVM, Operations
Joe added a comment to T162035: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser.

@stjn I cannot reproduce your case and we should've fixed the largest underlying reason for such problems. Do you still experience the same issue?

Apr 3 2017, 3:51 PM · Patch-For-Review, Traffic, Operations, media-storage, User-Urbanecm
Joe added a comment to T100793: [RFC] Define the on-disk and live structure of etcd pool data.

This has been practically superseded by so many specific tickets it doesn't really make much sense anymore.

Apr 3 2017, 6:43 AM · Services (watching), User-Joe, RfC, Operations, discovery-system, services-tooling
Joe closed T100793: [RFC] Define the on-disk and live structure of etcd pool data as Declined.
Apr 3 2017, 6:42 AM · Services (watching), User-Joe, RfC, Operations, discovery-system, services-tooling
Joe moved T159687: etcd switchover/enhancements from Doing to Backlog on the User-Joe board.
Apr 3 2017, 6:42 AM · Patch-For-Review, User-Joe, Operations
Joe closed T156100: DNS: dynamically generate entries for service discovery as Resolved.
Apr 3 2017, 6:39 AM · Availability (Multiple-active-datacenters), Patch-For-Review, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, MediaWiki-Configuration, Operations, Wikimedia-Developer-Summit (2017)
Joe closed T156100: DNS: dynamically generate entries for service discovery, a subtask of T149617: Integrating MediaWiki (and other services) with dynamic configuration, as Resolved.
Apr 3 2017, 6:38 AM · Availability (Multiple-active-datacenters), Patch-For-Review, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, MediaWiki-Configuration, Operations, Wikimedia-Developer-Summit (2017)
Joe updated the task description for T154658: Prepare and improve the datacenter switchover procedure.
Apr 3 2017, 6:23 AM · Availability (Multiple-active-datacenters), DC-Switchover-Prep-Q3-2016-17, Epic, Operations
Joe added a comment to T149617: Integrating MediaWiki (and other services) with dynamic configuration.

Status as of now:

Apr 3 2017, 6:20 AM · Availability (Multiple-active-datacenters), Patch-For-Review, Services (watching), Performance-Team, discovery-system, User-Joe, User-mobrovac, MediaWiki-Configuration, Operations, Wikimedia-Developer-Summit (2017)
Joe triaged T162013: etcd cluster in codfw has raft consensus issues as Low priority.
Apr 3 2017, 6:12 AM · User-Joe, Operations
Joe created T162013: etcd cluster in codfw has raft consensus issues.
Apr 3 2017, 6:12 AM · User-Joe, Operations

Mar 29 2017

Joe added a comment to T161675: Re-think puppet management for deployment-prep.

Will this repo just be a different site.pp for beta node definitions + hieradata? Or was something broader envisioned?

Mar 29 2017, 6:16 PM · Release-Engineering-Team (Next), User-Joe, Beta-Cluster-Infrastructure, Labs, Puppet
Joe added a comment to T161675: Re-think puppet management for deployment-prep.

on the deployment-prep puppetmaster, define a disk-based hiera hierarchy to mimic 1:1 what we have in production

I think this would need to be a git repo that is tracked in gerrit or diffusion primarily because we can't guarantee that any particular OpenStack VM will be durable. If something happens to the disk image itself or to the labvirt host that currently holds the image we would need to be able to rebuild the puppetmaster and put all the custom config back.

Mar 29 2017, 3:57 PM · Release-Engineering-Team (Next), User-Joe, Beta-Cluster-Infrastructure, Labs, Puppet