Joe (Giuseppe Lavagetto)
Spy

Projects (19)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 5:57 AM (141 w, 6 d)
Availability
Available
LDAP User
Giuseppe Lavagetto
MediaWiki User
Unknown

Recent Activity

Yesterday

Joe added a comment to T97192: HHVM request timeouts not working; support lowering the API request timeout per request.

@Joe, has this been fixed with 3.18?

Wed, Jun 21, 7:37 AM · User-Joe, Operations, Services (watching), Wikimedia-Incident, Incident-20150423-Commons, HHVM, RESTBase, Parsoid, Availability, Performance, MediaWiki-API
Joe added a comment to T165105: Wiley requests for DOI and some other publishers don't work in production.

Hey sorry I was on vacation last week.

Wed, Jun 21, 7:29 AM · Patch-For-Review, Services (blocked), Operations, User-mobrovac, Citoid, VisualEditor
Joe added a project to T168462: codfw row A switch upgrade: User-Joe.
Wed, Jun 21, 7:14 AM · User-Joe, Traffic, netops, Operations
Joe updated the task description for T168462: codfw row A switch upgrade.
Wed, Jun 21, 7:13 AM · User-Joe, Traffic, netops, Operations

Tue, Jun 20

Joe added a project to T163337: Job queue corruption after codfw switch over (Queue growth, duplicate runs): User-Joe.
Tue, Jun 20, 1:36 PM · User-Joe, Performance-Team, User-Elukey, Wikimedia-Incident, MediaWiki-JobQueue, Scoring-platform-team, User-Ladsgroup, ORES, codfw-rollout
Joe placed T168360: Gerrit constantly throws HTTP 500 error when reviewing patches (due to "Too many open files") up for grabs.
Tue, Jun 20, 8:35 AM · Release-Engineering-Team (Kanban), Patch-For-Review, Operations, Gerrit
Joe added a comment to T168360: Gerrit constantly throws HTTP 500 error when reviewing patches (due to "Too many open files").

So what I did:

Tue, Jun 20, 8:35 AM · Release-Engineering-Team (Kanban), Patch-For-Review, Operations, Gerrit
Joe added a comment to T168360: Gerrit constantly throws HTTP 500 error when reviewing patches (due to "Too many open files").
Caused by: java.nio.file.FileSystemException: .../_1ut5m.nvm: Too many open files
Tue, Jun 20, 8:30 AM · Release-Engineering-Team (Kanban), Patch-For-Review, Operations, Gerrit
Joe claimed T168360: Gerrit constantly throws HTTP 500 error when reviewing patches (due to "Too many open files").
Tue, Jun 20, 8:29 AM · Release-Engineering-Team (Kanban), Patch-For-Review, Operations, Gerrit

Mon, Jun 19

Joe moved T167130: Decom mw1170-mw1179, and replace them with new systems. from Doing to Blocked on others on the User-Joe board.
Mon, Jun 19, 12:56 PM · User-Joe, ops-eqiad, Operations
Joe added a comment to T168271: Decommission mw1170-mw1179.

@Cmjohnson please proceed to decom/derack these servers and rack new ones in their place.

Mon, Jun 19, 12:50 PM · Patch-For-Review, hardware-requests, User-Joe, ops-eqiad, Operations
Joe updated the task description for T168271: Decommission mw1170-mw1179.
Mon, Jun 19, 12:49 PM · Patch-For-Review, hardware-requests, User-Joe, ops-eqiad, Operations
Joe updated the task description for T168271: Decommission mw1170-mw1179.
Mon, Jun 19, 12:47 PM · Patch-For-Review, hardware-requests, User-Joe, ops-eqiad, Operations
Joe created T168271: Decommission mw1170-mw1179.
Mon, Jun 19, 11:00 AM · Patch-For-Review, hardware-requests, User-Joe, ops-eqiad, Operations
Joe moved T167130: Decom mw1170-mw1179, and replace them with new systems. from Backlog to Doing on the User-Joe board.
Mon, Jun 19, 10:27 AM · User-Joe, ops-eqiad, Operations

Fri, Jun 9

Joe added a comment to T167048: Services need external monitoring.

both maps and restbase are now monitored at the load-balancers of the SSL terminators in all datacenters. Resolving.

Fri, Jun 9, 3:48 PM · Patch-For-Review, User-Joe, Operations, Services (next), User-mobrovac, monitoring

Thu, Jun 8

Joe added a comment to T167048: Services need external monitoring.

@faidon at first I was thinking of implementing the checks on the LVS host (in the end, the puppettization is mostly the same), but I thought the nrpe checks on caches to be better just because it would monitor each cache host and not round-robin every host in a pool. It might also help seeing problems on individual caches.

Thu, Jun 8, 6:07 AM · Patch-For-Review, User-Joe, Operations, Services (next), User-mobrovac, monitoring

Wed, Jun 7

Joe added a comment to T167048: Services need external monitoring.

In order to do that, I want to do a local nrpe check on the cache edge servers, calling the SSL terminator, so that we cover as many logical layers as possible. Sadly, there is a bug in service-checker that I need to fix before this can go live, but apart from that it should be pretty straightforward.

Wed, Jun 7, 3:04 PM · Patch-For-Review, User-Joe, Operations, Services (next), User-mobrovac, monitoring
Joe added a project to T167269: Make security updates of docker images manageable: Operations.
Wed, Jun 7, 10:17 AM · Operations, User-Joe, Prod-Kubernetes (Experiment), Kubernetes
Joe added a parent task for T167269: Make security updates of docker images manageable: T162043: Define a process to keep images up-to-date on similar standards as the rest of production.
Wed, Jun 7, 10:17 AM · Operations, User-Joe, Prod-Kubernetes (Experiment), Kubernetes
Joe added a subtask for T162043: Define a process to keep images up-to-date on similar standards as the rest of production: T167269: Make security updates of docker images manageable.
Wed, Jun 7, 10:17 AM · Kubernetes, Goal, Operations
Joe added a subtask for T162042: Prepare and maintain base container images: T167269: Make security updates of docker images manageable.
Wed, Jun 7, 10:08 AM · Release-Engineering-Team (Watching / External), Patch-For-Review, User-mobrovac, Services (designing), Kubernetes, Goal, Operations
Joe added a parent task for T167269: Make security updates of docker images manageable: T162042: Prepare and maintain base container images.
Wed, Jun 7, 10:08 AM · Operations, User-Joe, Prod-Kubernetes (Experiment), Kubernetes
Joe updated the task description for T167269: Make security updates of docker images manageable.
Wed, Jun 7, 9:57 AM · Operations, User-Joe, Prod-Kubernetes (Experiment), Kubernetes
Joe updated the task description for T167269: Make security updates of docker images manageable.
Wed, Jun 7, 9:50 AM · Operations, User-Joe, Prod-Kubernetes (Experiment), Kubernetes
Joe created T167269: Make security updates of docker images manageable.
Wed, Jun 7, 9:45 AM · Operations, User-Joe, Prod-Kubernetes (Experiment), Kubernetes

Tue, Jun 6

Joe added a comment to T167048: Services need external monitoring.

I would start monitoring restbase on text-lb and maps on text-upload.

Tue, Jun 6, 2:26 PM · Patch-For-Review, User-Joe, Operations, Services (next), User-mobrovac, monitoring
Joe moved T167048: Services need external monitoring from Backlog to Doing on the User-Joe board.
Tue, Jun 6, 2:24 PM · Patch-For-Review, User-Joe, Operations, Services (next), User-mobrovac, monitoring
Joe added a comment to T144169: Flake8 for python files without extension in puppet repo.

After some discussion in https://gerrit.wikimedia.org/r/#/c/323559/ I've changed my vote to "automatically discover python files".

There is a mixture of conventions in python scripts inside puppet.git: underscore vs dash, extension vs no extension but having pep/flake tests is important even without explicit opt-in, the tests can be non-voting until the violations are fixed.
Re: naming, I think an obvious convention would be to keep the name in puppet the same as what's installed on the filesystem, especially for things expected to be in PATH.

Tue, Jun 6, 8:22 AM · Patch-For-Review, Continuous-Integration-Config, Operations, Operations-Software-Development

Mon, Jun 5

Joe added a project to T166806: Server side upload for Yann: Operations.
Mon, Jun 5, 9:04 AM · Patch-For-Review, Operations, media-storage, Commons, Wikimedia-Site-requests

Thu, Jun 1

Joe added a comment to T166345: wmf/1.30.0-wmf.2 performance issue for Wikipedias.

(Continued investigation using the data we did capture in SAL and Logstash)

  • Alerts for HHVM response times taking 10+ seconds only happened in the inactive data center (codfw).
Thu, Jun 1, 6:28 AM · Performance-Team-notice, Patch-For-Review, Services (watching), Operations, Performance-Team, Wikimedia-General-or-Unknown, MediaWiki-General-or-Unknown

Wed, May 31

Joe added a project to T165760: Deploy Recommendation API as a service: User-Joe.
Wed, May 31, 2:59 PM · Patch-For-Review, User-mobrovac, Services (doing), User-Joe, Recommendation-API
Joe added a comment to T165760: Deploy Recommendation API as a service.

So I have quite a few questions regarding this:

Wed, May 31, 2:59 PM · Patch-For-Review, User-mobrovac, Services (doing), User-Joe, Recommendation-API
Joe created D671: Add Makefile.
Wed, May 31, 2:12 PM · Release-Engineering-Team
Joe closed T166552: Switch etcd back to eqiad, document switchover procedure as Resolved.
Wed, May 31, 12:57 PM · User-Joe, Operations
Joe added a comment to T166552: Switch etcd back to eqiad, document switchover procedure.

All done, the play-by-play is how I executed the switchover. I'll write up some more documentation, and close the ticket as resolved.

Wed, May 31, 9:42 AM · User-Joe, Operations
Joe added a comment to T166552: Switch etcd back to eqiad, document switchover procedure.
  1. Merge https://gerrit.wikimedia.org/r/356138
  2. sudo cumin 'R:class = role::configcluster and *.codfw.wmnet' 'run-puppet-agent' (begins read-only)
  3. sudo cumin 'R:class = role::configcluster' 'disable-puppet "etcd replication switchover"'
  4. Merge https://gerrit.wikimedia.org/r/#/c/356139,
  5. sudo cumin 'R:class = role::configcluster and *.eqiad.wmnet' 'run-puppet-agent -e "etcd replication switchover"' (stops replica in eqiad)
  6. Merge https://gerrit.wikimedia.org/r/#/c/356136/ and update dns
  7. sudo cumin 'conf2002.codfw.wmnet' 'python /home/oblivian/switch_replica.py conf1001.eqiad.wmnet conftool' (sets the replication index in codfw)
  8. sudo cumin 'R:class = role::configcluster and *.codfw.wmnet' 'run-puppet-agent -e "etcd replication switchover"' (starts replica in codfw)
  9. Merge https://gerrit.wikimedia.org/r/356341
  10. sudo cumin 'R:class = role::configcluster and *.eqiad.wmnet' 'run-puppet-agent' (ends read-only)
  11. Merge and deploy https://gerrit.wikimedia.org/r/#/c/356137/
Wed, May 31, 9:00 AM · User-Joe, Operations
Joe updated the task description for T166552: Switch etcd back to eqiad, document switchover procedure.
Wed, May 31, 8:42 AM · User-Joe, Operations
Joe added a comment to T166552: Switch etcd back to eqiad, document switchover procedure.

The simple script to set the replication index in codfw before starting replication:

Wed, May 31, 8:29 AM · User-Joe, Operations
Joe updated the task description for T166552: Switch etcd back to eqiad, document switchover procedure.
Wed, May 31, 8:25 AM · User-Joe, Operations
Joe added a comment to T166038: Sync internal nutcracker package with Debian package.

The only thing we have added to 0.4.1 is https://github.com/wikimedia/operations-debs-nutcracker/commit/37fb9a2b939821c6d704ba09b7d80bcc88961224, which is useful if we raise the log verbosity but don't want details on every connection.

Wed, May 31, 7:27 AM · User-Joe, Operations
Joe moved T166081: rack/setup/install conf1004-conf1006 from Backlog to Blocked on others on the User-Joe board.
Wed, May 31, 7:22 AM · Patch-For-Review, User-Elukey, ops-eqiad, User-Joe, Operations

Tue, May 30

Joe added a comment to T166345: wmf/1.30.0-wmf.2 performance issue for Wikipedias.

Since the problem presented itself only after ~ 15 minutes after the deploy, it could be that something that we were able to cache in WANCache before is now somehow uncacheable and thus very expensive to compute.

Tue, May 30, 4:44 PM · Performance-Team-notice, Patch-For-Review, Services (watching), Operations, Performance-Team, Wikimedia-General-or-Unknown, MediaWiki-General-or-Unknown
Joe added a comment to T165105: Wiley requests for DOI and some other publishers don't work in production.

In fact, I suppose the problem is our proxy IP in eqiad has been banned. From the proxy machine

Tue, May 30, 2:33 PM · Patch-For-Review, Services (blocked), Operations, User-mobrovac, Citoid, VisualEditor
Joe added a comment to T165105: Wiley requests for DOI and some other publishers don't work in production.

Actually it was a dumb comment - the log I pasted clearly reported TCP_MISS/302, so I'm not sure what's happening. Investigating further.

Tue, May 30, 2:28 PM · Patch-For-Review, Services (blocked), Operations, User-mobrovac, Citoid, VisualEditor
Joe added a comment to T165105: Wiley requests for DOI and some other publishers don't work in production.

So the problem is the eqiad proxy cached a redirect to localhost, likely sent by the remote host during an outage

Tue, May 30, 2:20 PM · Patch-For-Review, Services (blocked), Operations, User-mobrovac, Citoid, VisualEditor
Joe claimed T166552: Switch etcd back to eqiad, document switchover procedure.
Tue, May 30, 11:51 AM · User-Joe, Operations
Joe created T166552: Switch etcd back to eqiad, document switchover procedure.
Tue, May 30, 11:50 AM · User-Joe, Operations
Joe added a comment to T161710: Automate RESTBase blacklisting.

Thank you @Joe for getting to this!

A couple of questions:

  1. To access it in Change-Prop I would just need to get the reds::shards::jobqueue::<%= site =>::changeprop-1 and 2 in CP config, right? No additional configuration needed?
Tue, May 30, 6:27 AM · Services (done), ChangeProp, RESTBase

Mon, May 29

Joe added a project to T166038: Sync internal nutcracker package with Debian package: User-Joe.
Mon, May 29, 4:28 PM · User-Joe, Operations
Joe added a comment to T134811: Consider REST with SSL (HyperSwitch/Cassandra) for session storage.

I've seen there hasn't been much going on on this task, but I want to have the opportunity to say that I don't think it's a good idea to put a software created for other purposes (HS) to serve our sessions.

Specifically, we want to create a dedicated service that is simple and specialized enough to be able to add features like (first things off the top of my head) brute-force-attack detection or token generation.

HyperSwitch is a generic REST API router library with added candy like pluggable (route) filters and a request template compiler. It itself does not expose any functionality, so it seems like a perfect fit for what we want here, given that the service's API will be relatively simple (get/set), leaving us room to focus on the actual issues that you mentioned.

Mon, May 29, 2:55 PM · Availability (Multiple-active-datacenters), Operations, Services, Performance-Team
Joe added a comment to T159922: pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003.

The problem (pdfrender hanging at startup) just showed up again on scb1002, and it seems there is no way to get around that race condition at the moment (no amount of waiting is ok).

Mon, May 29, 7:51 AM · Patch-For-Review, Operations, Services, Electron-PDFs
Joe added a comment to T134811: Consider REST with SSL (HyperSwitch/Cassandra) for session storage.

I've seen there hasn't been much going on on this task, but I want to have the opportunity to say that I don't think it's a good idea to put a software created for other purposes (HS) to serve our sessions.

Mon, May 29, 6:56 AM · Availability (Multiple-active-datacenters), Operations, Services, Performance-Team
Joe closed T165024: Upgrade calico to 2.2, document build process. as Resolved.
Mon, May 29, 6:50 AM · Patch-For-Review, User-Joe, Kubernetes, Goal, Operations
Joe closed T165024: Upgrade calico to 2.2, document build process., a subtask of T162039: Prepare to service applications from kubernetes, as Resolved.
Mon, May 29, 6:50 AM · Kubernetes, Goal, Operations

Fri, May 26

Joe added a comment to T166345: wmf/1.30.0-wmf.2 performance issue for Wikipedias.

I believe this would merit an incident report to find out exactly what happened?

Fri, May 26, 6:25 AM · Performance-Team-notice, Patch-For-Review, Services (watching), Operations, Performance-Team, Wikimedia-General-or-Unknown, MediaWiki-General-or-Unknown
Joe added a comment to T166345: wmf/1.30.0-wmf.2 performance issue for Wikipedias.

To be more clear: there is exactly 0% probability this was caused by something different than the release of -wmf2 to the wikipedias. The issue started at 19:20 UTC and finished the instant the train was rolled back.

Fri, May 26, 6:13 AM · Performance-Team-notice, Patch-For-Review, Services (watching), Operations, Performance-Team, Wikimedia-General-or-Unknown, MediaWiki-General-or-Unknown
Joe added a comment to T166345: wmf/1.30.0-wmf.2 performance issue for Wikipedias.

So to summarize this succintly, I'll post the list of request on a random appserver that took more than 6 seconds to render on a random appserver:

Fri, May 26, 6:07 AM · Performance-Team-notice, Patch-For-Review, Services (watching), Operations, Performance-Team, Wikimedia-General-or-Unknown, MediaWiki-General-or-Unknown

Wed, May 24

Joe added a comment to T165519: rack and setup mw1307-1348 .

@Cmjohnson I suggest we do the following:

Wed, May 24, 2:18 PM · User-Elukey, User-Joe, Operations, ops-eqiad
Joe added a comment to T165519: rack and setup mw1307-1348 .

Another option is not to care much how the current distribution goes but to just evenly distribute servers across rows, and then go on and rebalance the whole cluster.

Wed, May 24, 9:45 AM · User-Elukey, User-Joe, Operations, ops-eqiad
Joe added a comment to T165519: rack and setup mw1307-1348 .

Here is my proposal regarding these systems:

Wed, May 24, 9:39 AM · User-Elukey, User-Joe, Operations, ops-eqiad
Joe added a comment to T147718: RFC: New puppet code organization paradigm/coding standards.

I have a question about the new profile guidelines:

Profile classes should only have parameters that default to an explicit hiera calls with no fallback value.

Why no fallback defaults?

Wed, May 24, 5:57 AM · Patch-For-Review, RfC, Puppet, Operations
Joe added a comment to T147718: RFC: New puppet code organization paradigm/coding standards.

Also, if configuration of profiles can only be done via hiera, doesn't that mean any module parameter that we may want to override needs to be specified as a profile parameter?

Wed, May 24, 5:52 AM · Patch-For-Review, RfC, Puppet, Operations

May 23 2017

Joe added a comment to T166081: rack/setup/install conf1004-conf1006.

Racking request is just that these new machines go in different rows. They can even be in the racks of the other conf* systems as those old systems will be eventually decommissioned.

May 23 2017, 3:13 PM · Patch-For-Review, User-Elukey, ops-eqiad, User-Joe, Operations

May 22 2017

Joe moved T147204: Update confd package from Blocking others to Backlog on the User-Joe board.
May 22 2017, 4:43 PM · User-Joe, Beta-Cluster-reproducible, Operations
Joe moved T165519: rack and setup mw1307-1348 from Backlog to Blocking others on the User-Joe board.
May 22 2017, 4:43 PM · User-Elukey, User-Joe, Operations, ops-eqiad
Joe added a project to T165519: rack and setup mw1307-1348 : User-Joe.
May 22 2017, 4:43 PM · User-Elukey, User-Joe, Operations, ops-eqiad
Joe added projects to T166066: Integrate the puppet compiler in the puppet CI pipeline: Operations, Puppet.
May 22 2017, 3:44 PM · Puppet, Operations
Joe created T166066: Integrate the puppet compiler in the puppet CI pipeline.
May 22 2017, 3:44 PM · Puppet, Operations

May 21 2017

Joe added a comment to T163337: Job queue corruption after codfw switch over (Queue growth, duplicate runs).

@aaron another interesting open bug that might be worth reviewing: https://github.com/antirez/redis/issues/1525 ("EVAL replicated + conditionals about key existence = replication bug.")

It leverages TTLs though that we don't use, so probably it will not resolve much, but good to keep it as reference.

May 21 2017, 9:37 AM · User-Joe, Performance-Team, User-Elukey, Wikimedia-Incident, MediaWiki-JobQueue, Scoring-platform-team, User-Ladsgroup, ORES, codfw-rollout
Joe added a comment to T156199: Low-latency current revision storage.

Not going deeper in reasoning on the requirements that this tickets assume to be true (I'm not sure all of those are justified, but that's another topic) I would say that the "application-level TTLs" options seems the best way to go for a few reasons:

May 21 2017, 9:14 AM · User-mobrovac, Cassandra, Services (designing), Wikimedia-Incident, RfC, RESTBase

May 15 2017

Joe renamed T165024: Upgrade calico to 2.2, document build process. from Upgrade calico to 2.1, document build process. to Upgrade calico to 2.2, document build process..
May 15 2017, 9:51 AM · Patch-For-Review, User-Joe, Kubernetes, Goal, Operations
Joe moved T165024: Upgrade calico to 2.2, document build process. from Backlog to Doing on the User-Joe board.
May 15 2017, 9:49 AM · Patch-For-Review, User-Joe, Kubernetes, Goal, Operations
Joe moved T159687: etcd switchover/enhancements from Doing to Backlog on the User-Joe board.
May 15 2017, 9:49 AM · Patch-For-Review, User-Joe, Operations

May 12 2017

Joe added a comment to T125735: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out.

After deploying https://gerrit.wikimedia.org/r/353247 to labs I have observed no relevant changes to TCP metrics of any of the following:

  • deployment-jobrunner02
  • deployment-mediawiki05
  • deployment-redis01

    From prometheus-beta is easy to select a graph and plot it. As example, here's the list of ESTABLISHED TCP connections for the hosts mentioned above:
  • deployment-jobrunner02
  • mediawiki05
  • deployment-redis01

    After a chat with @hashar it seems that the RedisConnectionPool.php class (that offers persistent connections) is used only by mediawiki, not by the jobrunner service (that instanciate Redis() without any trace of pconnect).

    I tried to live hack on deployment-prep.mediawiki05 jobqueue-beta.php changing 'persistent' => defined( 'MEDIAWIKI_JOB_RUNNER' ) to 1 but nothing has really changed.

    My understanding of the job queues is still very very bad, so I have these (possibly dumb) questions for @Krinkle or @aaron:
  • Do you have any idea about what I am doing wrong with https://gerrit.wikimedia.org/r/#/c/353247 ? Is my understanding correct about the fact that this setting is only for mediawiki app/api servers and not jobrunners?
  • Should we patch the jobrunner service to offer persistent connections to Redis too?

    Sorry for pushing this but in my opinion we'd need to come up with a good plan forward, let me know your thoughts.
May 12 2017, 1:08 PM · Patch-For-Review, User-Elukey, Operations, Wikimedia-log-errors

May 11 2017

Joe added a comment to T165024: Upgrade calico to 2.2, document build process..

I am re-doing our calico-containers repository from scratch, importing a version from upstream and managing the now-minimal changes to the Dockerfiles with quilt. This will make it easier to build calicoctl (the debian package) properly.

May 11 2017, 11:18 AM · Patch-For-Review, User-Joe, Kubernetes, Goal, Operations
Joe updated the task description for T165024: Upgrade calico to 2.2, document build process..
May 11 2017, 11:08 AM · Patch-For-Review, User-Joe, Kubernetes, Goal, Operations
Joe created T165024: Upgrade calico to 2.2, document build process..
May 11 2017, 10:03 AM · Patch-For-Review, User-Joe, Kubernetes, Goal, Operations
Joe closed T163565: Install conftool on deployment masters, a subtask of T104352: Make scap able to depool/repool servers via the conftool API, as Resolved.
May 11 2017, 6:25 AM · releng-201617-q4, Scap (Scap3-MediaWiki-MVP), scap2, Operations, HHVM, Performance-Team
Joe closed T163565: Install conftool on deployment masters as Resolved.
May 11 2017, 6:25 AM · User-Joe, Patch-For-Review, Operations, Scap (Scap3-MediaWiki-MVP), Deployment-Systems
Joe closed T163565: Install conftool on deployment masters, a subtask of T125629: Depool proxies temporarily while scap is ongoing to avoid taxing those nodes, as Resolved.
May 11 2017, 6:25 AM · Scap (Scap3-MediaWiki-MVP), scap2, Operations
Joe added a comment to T163565: Install conftool on deployment masters.

@Joe: That all seems reasonable. I don't particularly want to duplicate logic in scap unless it makes the most sense for that logic to live in scap.

This task is mostly concerned with implementing deplooling for mediawiki deployments but the functionality in scap could handle the process with other services as well. If this isn't the right approach then I think we could use some guidance from you on how to get this working.

May 11 2017, 6:05 AM · User-Joe, Patch-For-Review, Operations, Scap (Scap3-MediaWiki-MVP), Deployment-Systems
Joe requested changes to D600: Create a wrapper around conftool for our pooling/depooling needs.

I think the basic idea for the patch is good, I think the implementation can be improved as it is not currently doing what it's intended to do.

May 11 2017, 6:01 AM · Release-Engineering-Team

May 9 2017

Joe added a comment to T164793: Parsoid deploy failed.

so, after some digging, I found out that conf2002.codfw.wmnet had, for some reason, auth enabled on etcd (while we now just proxy through nginx) and moreover only had the root user available. The most probable cause is me doing something wrong when disabling auth in eqiad during the conversion of that cluster.

May 9 2017, 3:09 PM · User-Joe, Scap, Parsoid
Joe claimed T164793: Parsoid deploy failed.
May 9 2017, 2:23 PM · User-Joe, Scap, Parsoid

May 3 2017

Joe added a comment to T164376: [Discuss] Split ORES scores in datacenters based on wiki.

Yes, my only doubt with this proposal is exactly that we want to be active/active but to being able to serve all the traffic from a single datacenter.

May 3 2017, 2:54 PM · Traffic, Scoring-platform-team-Backlog, ORES, Operations, ChangeProp
Joe added a comment to T164177: switchdc: Improve wgReadOnly message.

The manual change + commit + deploy of the MW configuration might actually not be needed anymore, it depends on T163398. If that change lands in production before the switchback the related tasks in Switchdc will be updated to use conftool to change those values, hence that hardcoded part will go away anyway.

I thought this is to be removed at some point, didn't notice we're already on it. Great! So yeah, that should be the preferred way of course.

May 3 2017, 9:31 AM · Patch-For-Review, Operations, codfw-rollout, Operations-Software-Development
Joe merged T164287: Degraded RAID on restbase1018 into T163280: Degraded RAID on restbase1018.
May 3 2017, 9:02 AM · Operations
Joe merged task T164287: Degraded RAID on restbase1018 into T163280: Degraded RAID on restbase1018.
May 3 2017, 9:02 AM · ops-eqiad, Operations
Joe merged T164342: Degraded RAID on restbase1018 into T163280: Degraded RAID on restbase1018.
May 3 2017, 9:01 AM · Operations
Joe merged task T164342: Degraded RAID on restbase1018 into T163280: Degraded RAID on restbase1018.
May 3 2017, 9:01 AM · ops-eqiad, Operations
Joe closed T164202: Degraded RAID on restbase1018 as Resolved.
May 3 2017, 8:59 AM · ops-eqiad, Operations
Joe added a comment to T163292: Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet .

I just recreated the RAID arrays and rebooted the system with the new disk in place. @Eevans I'd let you re-start puppet and attend cassandra. Of course, the data in /srv are gone for good.

May 3 2017, 8:58 AM · Patch-For-Review, Operations, Cassandra, Services (doing)

May 2 2017

Joe added a comment to T159687: etcd switchover/enhancements.

I converted the etcd cluster in eqiad to use nginx for auth/TLS, moved to ecdsa certs with the correct SANs, and started replication codfw => eqiad.

May 2 2017, 7:58 AM · Patch-For-Review, User-Joe, Operations

May 1 2017

Tbayer awarded T163438: VisualEditor broken on wikitech when codfw is primary: "Error loading data from server: apierror-visualeditor-docserver-http: HTTP 500." a Evil Spooky Haunted Tree token.
May 1 2017, 5:02 AM · codfw-rollout, wikitech.wikimedia.org, Labs

Apr 29 2017

Joe added a comment to T163337: Job queue corruption after codfw switch over (Queue growth, duplicate runs).

Note that the only keys that use a TTL are the root job de-duplication hash/timestamp keys. Everything else is normally just directly deleted (lpop,hdel,...) via Lua on ack() or by the JobChron Lua script (which uses similar commands). None of the main data structures have native redis expiration (which uses the special master deletion logic).

Apr 29 2017, 11:25 AM · User-Joe, Performance-Team, User-Elukey, Wikimedia-Incident, MediaWiki-JobQueue, Scoring-platform-team, User-Ladsgroup, ORES, codfw-rollout
Joe added a comment to T163337: Job queue corruption after codfw switch over (Queue growth, duplicate runs).

An additional case I'm going to study in more detail:

Apr 29 2017, 7:42 AM · User-Joe, Performance-Team, User-Elukey, Wikimedia-Incident, MediaWiki-JobQueue, Scoring-platform-team, User-Ladsgroup, ORES, codfw-rollout

Apr 28 2017

Joe added a project to T163565: Install conftool on deployment masters: User-Joe.
Apr 28 2017, 5:32 AM · User-Joe, Patch-For-Review, Operations, Scap (Scap3-MediaWiki-MVP), Deployment-Systems

Apr 27 2017

Joe added a comment to T163337: Job queue corruption after codfw switch over (Queue growth, duplicate runs).

Just to err on the side of caution, I reviewed all the code of JobQueueRedis and of the JobChron, and I found no obvious parts of our LUA scripts that could cause replication to break, like non-deterministic statements.

Apr 27 2017, 10:50 AM · User-Joe, Performance-Team, User-Elukey, Wikimedia-Incident, MediaWiki-JobQueue, Scoring-platform-team, User-Ladsgroup, ORES, codfw-rollout
Joe added a comment to T163337: Job queue corruption after codfw switch over (Queue growth, duplicate runs).

Also let me add a few remarks on the redis replication:

Apr 27 2017, 9:59 AM · User-Joe, Performance-Team, User-Elukey, Wikimedia-Incident, MediaWiki-JobQueue, Scoring-platform-team, User-Ladsgroup, ORES, codfw-rollout