Page MenuHomePhabricator

Joe (Giuseppe Lavagetto)
Spy

Projects (24)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 5:57 AM (267 w, 5 d)
Availability
Available
LDAP User
Giuseppe Lavagetto
MediaWiki User
GLavagetto (WMF) [ Global Accounts ]

Recent Activity

Today

Joe added a comment to T237319: 502 errors on ATS/8.0.5.

nope, a 5xx doesn't translate to BAD_INCOMING_RESPONSE, actually is specifically whitelisted:

case STATUS_CODE_SERVER_ERROR:
    TxnDebug("http_trans", "[is_response_valid] Response Error: Origin Server returned 500 - allowing");
    return true;
Wed, Nov 20, 10:18 AM · Operations, Traffic, User-DannyS712
Joe added a comment to T237319: 502 errors on ATS/8.0.5.

I find this pretty worrisome for the following reasons:

  1. right now we have one remap rule that catches all the requests handled by appservers-rw. This means that ATS only tracks one counter of server connection retries for all the sites handled by appservers-rw.discovery.wmnet, wikipedia, wikidata, wikivoyage.
  2. if one mw server behind appservers-rw.discovery.wmnet misbehaves and triggers enough BAD_INCOMING_RESPONSE errors, it could trigger ATS to mark appservers-rw.discovery.wmnet as down
  3. At that point that ats-be instance would return 5xx (502/504) for every request that otherwise would be handled by appservers-rw.discovery.wmnet
Wed, Nov 20, 10:08 AM · Operations, Traffic, User-DannyS712
Joe added a subtask for T238050: envoy overwrites the server header: T237235: Build and upload envoy 1.12.0 package..
Wed, Nov 20, 9:58 AM · RESTBase, Traffic, Operations
Joe added a parent task for T237235: Build and upload envoy 1.12.0 package.: T238050: envoy overwrites the server header.
Wed, Nov 20, 9:58 AM · Packaging, Operations, serviceops
Joe added a comment to T237235: Build and upload envoy 1.12.0 package..

In the meantime, we have a security release 1.12.1 - I will build it and upload it to stretch and buster.

Wed, Nov 20, 9:57 AM · Packaging, Operations, serviceops
Joe added a comment to T238050: envoy overwrites the server header.

And indeed it seems things are not working as expected:

Wed, Nov 20, 9:56 AM · RESTBase, Traffic, Operations
Joe added a comment to T238050: envoy overwrites the server header.

Not very misteriously, the edges use ATS-BE so they call envoy, while the main dcs are still contacting restbase directly. Meh.

Wed, Nov 20, 9:52 AM · RESTBase, Traffic, Operations
Joe added a comment to T238050: envoy overwrites the server header.

Ok, a bit of digging:

Wed, Nov 20, 9:41 AM · RESTBase, Traffic, Operations
Joe added a comment to T238597: envoyproxy does not automatically reload certificates.

I'm confused. The hot restarter is the default since https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/536149/

Wed, Nov 20, 7:04 AM · serviceops, Operations
Joe closed T212828: SRE FY2019 Q3 goal: Ramp-up serving traffic to PHP 7 as Resolved.
Wed, Nov 20, 6:54 AM · User-Joe, serviceops, Operations
Joe added a comment to T238593: Phabricator downtime due to aphlict and websockets (aphlict current disabled).

I don't think the solution is removing aphlict, but instead proxying to it directly from envoy or ATS, our choice. @ema @Dzahn your choice :)

Wed, Nov 20, 6:51 AM · Operations, Traffic, serviceops, Phabricator

Yesterday

Joe added a comment to T238593: Phabricator downtime due to aphlict and websockets (aphlict current disabled).

The problem is most apache workers ended up being stuck talking to aphlict via proxy_wstunnel which has stability issues (we tried to use it, and it created issues there too, apparently).

Tue, Nov 19, 12:47 PM · Operations, Traffic, serviceops, Phabricator

Thu, Nov 14

Joe updated subscribers of T231011: (euwiki) Mysterious, coordinated slowdowns every ~ 25 minutes on API servers.

@Theklan might be interested in watching this task too.

Thu, Nov 14, 9:58 PM · PHP 7.2 support, serviceops, Operations

Wed, Nov 13

Joe created T238257: Unconference: how to restructure the current rest api.
Wed, Nov 13, 8:15 PM · Wikimedia-Technical-Conference-2019

Mon, Nov 11

Joe added a comment to T238018: mw1239 - Memory correctable errors -EDAC-.

@ayounsi this server is being decommissioned in a few weeks, I don't think it should be fixed at all, we can just acknowledge the alert.

Mon, Nov 11, 8:12 PM · Operations, serviceops

Fri, Nov 8

Joe claimed T237235: Build and upload envoy 1.12.0 package..
Fri, Nov 8, 9:13 AM · Packaging, Operations, serviceops
Joe added a comment to T236833: wt2html: Out of memory crashers.

I think it's practical, in the current moment and in general, to be able to vary the memory limits by payload and /or cluster.

Fri, Nov 8, 8:08 AM · Patch-For-Review, serviceops, Operations, Parsoid-PHP

Thu, Nov 7

Joe added a comment to T237259: Document all uses of the puppetCA certificate.

The calico/node service, and the kube-controller-manager service will need to be restarted on the kubernetes workers and masters respectively.

Thu, Nov 7, 10:16 AM · Patch-For-Review, DBA, User-jbond, Puppet, Operations
Joe added a comment to T237259: Document all uses of the puppetCA certificate.

@Eevans moritz mentioned there maybe some cassandra consideration to take into account and you could enlighten me as to what they are :)

Thu, Nov 7, 9:49 AM · Patch-For-Review, DBA, User-jbond, Puppet, Operations
Joe added a comment to T237362: Rolling restart of etcd to pick up the renewed CA public certificate..
  1. We will need to restart etcd in eqiad as the CA is used in etcd::v3 for peer-to-peer communications
  2. We will not need to restart etcd in codfw as it's currently on etcd v2 and thus is not using certs for server-to-server communications.
Thu, Nov 7, 9:47 AM · Patch-For-Review, serviceops, User-jbond, Puppet, Operations
Joe closed T236275: Parsoid-php doesn't get updated after a code deploy as Resolved.

It is expected you see that failure as one of the two servers you're deploying to in beta, deployment-parsoid09.deployment-prep.eqiad.wmflabs, doesn't have MediaWiki or php7 installed. The other server, deployment-mediawiki-parsoid10.deployment-prep.eqiad.wmflabs, should restart php just fine.

Thu, Nov 7, 7:50 AM · PHP 7.2 support, serviceops, Parsoid, Parsing-Team
Joe added a comment to T236277: Extend Puppet CA Expiry date .

One suggestion: shouldn't we keep the old CA cert around while transitioning?

Thu, Nov 7, 7:33 AM · DBA, Patch-For-Review, User-jbond, Puppet, Operations

Wed, Nov 6

Joe added a comment to T236426: Configure Google Cloud Vision credentials in production.

Hi, the credentials file should be stored, as any secret that needs to be accessed by MediaWiki, within its private repository.

Wed, Nov 6, 5:39 PM · MW-1.35-notes (1.35.0-wmf.8; 2019-11-26), serviceops, Operations, Product-Infrastructure-Team-Backlog (Kanban), Machine vision
Joe added a comment to T228346: PHP 7.2 garbage collector segfault.

JFTR, the mwdebug* servers are running 7.2.24 and can be used for additional tests.

Wed, Nov 6, 10:26 AM · Release-Engineering-Team-TODO, MW-1.35-release, Upstream, MediaWiki-General, PHP 7.2 support

Tue, Nov 5

Joe updated the task description for T237362: Rolling restart of etcd to pick up the renewed CA public certificate..
Tue, Nov 5, 9:52 AM · Patch-For-Review, serviceops, User-jbond, Puppet, Operations
Joe created T237362: Rolling restart of etcd to pick up the renewed CA public certificate..
Tue, Nov 5, 9:50 AM · Patch-For-Review, serviceops, User-jbond, Puppet, Operations
Joe added a comment to T237259: Document all uses of the puppetCA certificate.

As far as etcd is concerned, a rolling restart should be enough to ensure the new CA is picked up. I will take care of that.

Tue, Nov 5, 9:49 AM · Patch-For-Review, DBA, User-jbond, Puppet, Operations
Joe closed T237304: EasyTimeline extension shell error, a subtask of T233654: Make the parsoid cluster support parsoid/PHP, as Resolved.
Tue, Nov 5, 9:45 AM · Operations, serviceops
Joe closed T237304: EasyTimeline extension shell error as Resolved.

Sorry for the inconvenience. This wasn't spotted earlier as we didn't actively remove the packages from the canaries in production.

Tue, Nov 5, 9:45 AM · serviceops, Parsoid-PHP
Joe added a comment to T236277: Extend Puppet CA Expiry date .

In terms of identifying services that use keys issued by the puppet CA -- is it wrong to think that the following would be a complete list?

  • keys created using cergen
  • users of base::expose_puppet_certs
  • the few users we have that are referencing either puppet_ssldir() or manually hardcoding the /var/lib/puppet/ssl directory

I'd think it should be relatively straightforward to find all such users just by querying puppetdb.

Tue, Nov 5, 6:46 AM · DBA, Patch-For-Review, User-jbond, Puppet, Operations
Joe added a comment to T236437: rack/setup/install mw13[49-84].eqiad.wmnet.

Sorry I'm lagging behind on this. Point is, I shouldn't be the only person to answer this question but most of my team is out this week. I'll try to reassign this to someone with time to respond.

Tue, Nov 5, 6:21 AM · Operations, ops-eqiad

Mon, Nov 4

Joe added a comment to T227542: b7-eqiad pdu refresh (Tuesday 11/5 @12pm UTC).

I don't want to conflict-edit the task description, but as far as the MW* and WTP* servers no action is needed.

Mon, Nov 4, 5:27 PM · DC-Ops, Operations, ops-eqiad
Joe created T237235: Build and upload envoy 1.12.0 package..
Mon, Nov 4, 10:29 AM · Packaging, Operations, serviceops
Joe created T237234: Collect metrics from envoy where it is enabled on k8s.
Mon, Nov 4, 10:24 AM · Patch-For-Review, Kubernetes, serviceops, Operations
Joe updated the task description for T235411: Add TLS termination to services running on kubernetes.
Mon, Nov 4, 10:13 AM · Kubernetes, serviceops, Operations
Joe closed T236899: Allow testing of feature-flag-protected features in deployment-charts CI as Resolved.

The CI is far from perfect, but it catches the most mundane issues at least now.

Mon, Nov 4, 9:31 AM · Operations, Release-Engineering-Team, Prod-Kubernetes, serviceops, Release Pipeline
Joe closed T237198: Kubernetes workers frequent oom-killer in action as Invalid.

So:

  • kubernetes{1,2}00{5,6} are specialized nodes that only run kask for sessions, that's why you don't see ooms there.
  • The OOM killer doesn't only get called when the memory of the whole system exceeds its limits, but also when the processes in a cgroup try to allocate more memory than what is allowed to that cgroup.
Mon, Nov 4, 9:04 AM · Operations, serviceops

Thu, Oct 31

Joe committed rDEPLOYCHARTS244d1389399f: blubberoid/scaffold: ports is an array (authored by Joe).
blubberoid/scaffold: ports is an array
Thu, Oct 31, 10:59 AM
Joe created P9505 Helmfile wut.
Thu, Oct 31, 10:23 AM · Kubernetes
Joe committed rDEPLOYCHARTSfd52310e667a: tls: env variables need to be strings in yaml (authored by Joe).
tls: env variables need to be strings in yaml
Thu, Oct 31, 10:20 AM
Joe committed rDEPLOYCHARTS312a5605d0d1: blubberoid: new chart version fixing TLS (authored by Joe).
blubberoid: new chart version fixing TLS
Thu, Oct 31, 10:03 AM
Joe committed rDEPLOYCHARTS2d1bd1909623: Rake: Add yaml validation (authored by Joe).
Rake: Add yaml validation
Thu, Oct 31, 9:53 AM

Wed, Oct 30

Joe added a comment to T236797: How should the MachineVision extension interact with external APIs from production?.

So after some quick grepping, we already define a proxy in mediawiki-config, and it can be retrieved at $wmfLocalServices['urldownloader'], so:

Wed, Oct 30, 6:11 PM · Patch-For-Review, Operations, serviceops, Product-Infrastructure-Team-Backlog, Machine vision
Joe added a comment to T236797: How should the MachineVision extension interact with external APIs from production?.

Basically what you need is:

Wed, Oct 30, 5:41 PM · Patch-For-Review, Operations, serviceops, Product-Infrastructure-Team-Backlog, Machine vision
Joe added a comment to T236797: How should the MachineVision extension interact with external APIs from production?.

The HTTP requests for labels happen asynchronously in a deferred update on upload complete, or when a maintenance script is run. They're fetched from the DB when the user requests the related special page.
What I specifically need to know here is how to get the proxy info from the environment, and if there are any other barriers unique to production for making external HTTP requests. I see that there is an http_proxy setting in hiera, but as far as I can tell from operations-wmf-config that isn't exposed to the MW appservers.

Wed, Oct 30, 5:27 PM · Patch-For-Review, Operations, serviceops, Product-Infrastructure-Team-Backlog, Machine vision
Joe added a comment to T236797: How should the MachineVision extension interact with external APIs from production?.

As far as UX is concerned...
The HTTP request should not happen on page load, it should be deferred and either run in the background (scheduled job) or on the client (JavaScript). If it's run on the client, an API module would need to be created to function as a proxy to the external service.

Wed, Oct 30, 4:56 PM · Patch-For-Review, Operations, serviceops, Product-Infrastructure-Team-Backlog, Machine vision
Joe added a comment to T236797: How should the MachineVision extension interact with external APIs from production?.

Hi, I assumed the fetching of such data would happen via an async job indeed, upon image upload. Anything synchronous is effectively discouraged, even in post-send as even with relatively aggressive timeouts it's easy to clog up a lot of php workers waiting for an external provider which is lagging.

Wed, Oct 30, 4:52 PM · Patch-For-Review, Operations, serviceops, Product-Infrastructure-Team-Backlog, Machine vision
Joe committed rDEPLOYCHARTSfdf238516a69: Rakefile: add the ability to run fixtures with special values (authored by Joe).
Rakefile: add the ability to run fixtures with special values
Wed, Oct 30, 3:35 PM
Joe claimed T236899: Allow testing of feature-flag-protected features in deployment-charts CI.
Wed, Oct 30, 12:21 PM · Operations, Release-Engineering-Team, Prod-Kubernetes, serviceops, Release Pipeline
Joe created T236899: Allow testing of feature-flag-protected features in deployment-charts CI.
Wed, Oct 30, 12:21 PM · Operations, Release-Engineering-Team, Prod-Kubernetes, serviceops, Release Pipeline
Joe committed rDEPLOYCHARTSc3de5a88a94e: blubberoid/staging: brown paper bag fix (authored by Joe).
blubberoid/staging: brown paper bag fix
Wed, Oct 30, 11:26 AM
Joe committed rDEPLOYCHARTS85605163d9aa: blubberoid: add TLS termination in staging (authored by Joe).
blubberoid: add TLS termination in staging
Wed, Oct 30, 11:22 AM
Joe committed rDEPLOYCHARTS4c5902503c29: scaffold: only expose one port as a service by default (authored by Joe).
scaffold: only expose one port as a service by default
Wed, Oct 30, 11:19 AM
Joe updated subscribers of T234283: Messages in Logstash from php-fatal-error.php are missing from type:mediawiki/channel:fatal.

In the case of HHVM, fatals were correctly handled by the daemon and used whatever callback MediaWiki has to manage errors, so hhvm-fatal-error.php has nothing regarding logging, except for sending a counter to statsd:
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/b999b4cba2ec27415d329a62277cdd64f24b440b/modules/mediawiki/templates/hhvm-fatal-error.php.erb

Wed, Oct 30, 8:19 AM · Performance-Team (Radar), Operations, serviceops, observability
Joe added a comment to T235188: Preemptive refresh in getMultiWithSetCallback() and getMultiWithUnionSetCallback() pollutes cache.

@WDoranWMF what I found out is that expunging values by prefix from memcached is impossible to do in a clean way without severely impacting performance.

Wed, Oct 30, 6:38 AM · serviceops, MW-1.35-notes (1.35.0-wmf.2; 2019-10-15), Core Platform Team Workboards (Clinic Duty Team), User-ArielGlenn, Language-Team (Language-2019-October-December), Patch-For-Review, MediaWiki-General, affects-translatewiki.net
Joe added a comment to T236800: Ensure apcu incr/decr are atomic.

We're on 5.1.17 so backporting the patch should be pretty simple.

Wed, Oct 30, 6:04 AM · Performance-Team (Radar), MediaWiki-Cache, Core Platform Team, serviceops
Joe added a comment to T236800: Ensure apcu incr/decr are atomic.

Instead of upgrading to a later version, we can backport that patch to the version we're using instead.

Wed, Oct 30, 6:02 AM · Performance-Team (Radar), MediaWiki-Cache, Core Platform Team, serviceops

Tue, Oct 29

Joe added projects to T236833: wt2html: Out of memory crashers: Operations, serviceops.

We can raise the memory limit for parsoid-php a bit.

Tue, Oct 29, 7:13 PM · Patch-For-Review, serviceops, Operations, Parsoid-PHP
Joe committed rDEPLOYCHARTS50a34af8dc60: Blubberoid: fix whitespace management (authored by Joe).
Blubberoid: fix whitespace management
Tue, Oct 29, 11:47 AM
Pablo-WMDE awarded T236709: Error when executing helmfile commands for the termbox service a Cup of Joe token.
Tue, Oct 29, 10:24 AM · Wikidata-Termbox, Wikidata, serviceops
Joe closed T236709: Error when executing helmfile commands for the termbox service as Resolved.

I have just tested and I can easily run helmfile diff on termbox now, in all environments. Resolving for now

Tue, Oct 29, 10:24 AM · Wikidata-Termbox, Wikidata, serviceops
Joe added a comment to T236709: Error when executing helmfile commands for the termbox service.

@Tarrow @Pablo-WMDE can someone try the release to staging? I should have fixed the rbac roles there. It should've fixed your issues.

Tue, Oct 29, 10:19 AM · Wikidata-Termbox, Wikidata, serviceops
Joe committed rDEPLOYCHARTS49c0a550113f: Revert "Remove the portforward right from deploy role" (authored by akosiaris).
Revert "Remove the portforward right from deploy role"
Tue, Oct 29, 10:13 AM
Joe added a reverting change for rDEPLOYCHARTSdf7431e40c70: Remove the portforward right from deploy role: rDEPLOYCHARTS49c0a550113f: Revert "Remove the portforward right from deploy role".
Tue, Oct 29, 10:13 AM
Joe added a comment to T236709: Error when executing helmfile commands for the termbox service.

@Tarrow if it's an urgent bugfix we can just revert the change to let you deploy immediately. Please let's coordinate on IRC, and sorry for the inconvenience :)

Tue, Oct 29, 10:09 AM · Wikidata-Termbox, Wikidata, serviceops

Mon, Oct 28

Joe added a comment to T235188: Preemptive refresh in getMultiWithSetCallback() and getMultiWithUnionSetCallback() pollutes cache.

To show better the cache issue:

{ echo 'stats items'; sleep 1; } | telnet mc1019 11211 | grep -F items:17:
STAT items:17:number 8433040
STAT items:17:age 88690
STAT items:17:evicted 446703589
STAT items:17:evicted_nonzero 446703231
STAT items:17:evicted_time 88690
STAT items:17:outofmemory 0
STAT items:17:tailrepairs 0
STAT items:17:reclaimed 395768512
STAT items:17:expired_unfetched 221048687
STAT items:17:evicted_unfetched 119428161
STAT items:17:crawler_reclaimed 0
STAT items:17:lrutail_reflocked 0
Mon, Oct 28, 5:07 PM · serviceops, MW-1.35-notes (1.35.0-wmf.2; 2019-10-15), Core Platform Team Workboards (Clinic Duty Team), User-ArielGlenn, Language-Team (Language-2019-October-December), Patch-For-Review, MediaWiki-General, affects-translatewiki.net
Joe claimed T236709: Error when executing helmfile commands for the termbox service.

@Jakob_WMDE this is a result of our temporary fix for a CVE affecting kubernetes. We will try to revert the situation tomorrow. Thanks for your patience.

Mon, Oct 28, 4:59 PM · Wikidata-Termbox, Wikidata, serviceops
Joe added a comment to T235188: Preemptive refresh in getMultiWithSetCallback() and getMultiWithUnionSetCallback() pollutes cache.

After some digging, it appears impossible to reliably get all keys from a memcached server as loaded as our production ones.

Mon, Oct 28, 4:56 PM · serviceops, MW-1.35-notes (1.35.0-wmf.2; 2019-10-15), Core Platform Team Workboards (Clinic Duty Team), User-ArielGlenn, Language-Team (Language-2019-October-December), Patch-For-Review, MediaWiki-General, affects-translatewiki.net
Joe added a comment to T235188: Preemptive refresh in getMultiWithSetCallback() and getMultiWithUnionSetCallback() pollutes cache.

@Joe Apologies for following newb questions - is access to those memcached instances through SRE? And if so what do you want us to prepare for you to make that as straightforward as possible for the SRE side?
Would it be most helpful for us to:

  1. Set the clear criteria for the keys to be removed
  2. Define how to dump and filter the keys
  3. Provide the scripts to do the above

Is there an instances we can use to test the above against?

Mon, Oct 28, 2:55 PM · serviceops, MW-1.35-notes (1.35.0-wmf.2; 2019-10-15), Core Platform Team Workboards (Clinic Duty Team), User-ArielGlenn, Language-Team (Language-2019-October-December), Patch-For-Review, MediaWiki-General, affects-translatewiki.net
Joe committed rDEPLOYCHARTS19c462b6dd83: scaffold: Add option for TLS termination (authored by Joe).
scaffold: Add option for TLS termination
Mon, Oct 28, 11:58 AM

Fri, Oct 25

Joe committed rDEPLOYCHARTS03e1491f5808: blubberoid: Add TLS termination (authored by Joe).
blubberoid: Add TLS termination
Fri, Oct 25, 7:52 AM

Thu, Oct 24

Joe added a comment to T235188: Preemptive refresh in getMultiWithSetCallback() and getMultiWithUnionSetCallback() pollutes cache.
Thu, Oct 24, 10:58 AM · serviceops, MW-1.35-notes (1.35.0-wmf.2; 2019-10-15), Core Platform Team Workboards (Clinic Duty Team), User-ArielGlenn, Language-Team (Language-2019-October-December), Patch-For-Review, MediaWiki-General, affects-translatewiki.net
Joe added a comment to T236275: Parsoid-php doesn't get updated after a code deploy.

Thanks @Joe. What are the perf. implications of solution (2)? Is there a reason why you don't prefer (1) as the temporary solution?

Thu, Oct 24, 6:28 AM · PHP 7.2 support, serviceops, Parsoid, Parsing-Team

Wed, Oct 23

Joe added a comment to T152478: Upgrade Doxygen (to enable INHERIT_DOCS for methods from parent classes).

@Krinkle given it's a CI container, why not install doxygen with pip (properly frozen) instead of importing a package?

Wed, Oct 23, 7:00 PM · serviceops, Release-Engineering-Team, Upstream, MediaWiki-Documentation
Joe created T236275: Parsoid-php doesn't get updated after a code deploy.
Wed, Oct 23, 1:33 PM · PHP 7.2 support, serviceops, Parsoid, Parsing-Team
Joe added a comment to T215465: Development policy: Require use of common storage abstractions.

I think what would be desirable, from the point of view of a developer, to know they can use some specific abstractions, and which ones they should use.

Wed, Oct 23, 10:07 AM · TechCom-RFC, Performance-Team (Radar), User-mobrovac, Services (watching), Core Platform Team Legacy (Watching / External), TechCom

Tue, Oct 22

Joe added a comment to T224589: Migrate dbmonitor hosts to Stretch/Buster.

2 blockers:

  • Exec of /usr/sbin/a2enmod php7.0 fails, as ther right module would be php7.3- No support for buster on the http module? Httpd/Httpd::Mod_conf[php7.0]/Exec[ensure_present_mod_php7.0]
Tue, Oct 22, 1:38 PM · Operations
Joe added a comment to T236125: Trigger envoy reload upon TLS certificate update.

Given we have the hot-restarted now, that's probably a good idea.

Tue, Oct 22, 10:15 AM · Traffic, Operations

Mon, Oct 21

Joe created P9415 (An Untitled Masterwork).
Mon, Oct 21, 3:25 PM

Oct 21 2019

Joe updated the task description for T234646: Wikimedia Technical Conference 2019 Session: Self-service Stateless Microservices (for APIs).
Oct 21 2019, 9:24 AM · International-Developer-Events, Wikimedia-Technical-Conference-2019
Joe created T236017: Move blubberoid to use TLS only..
Oct 21 2019, 8:15 AM · Release Pipeline (Blubber), Kubernetes, serviceops, Operations
Joe created T236008: New Deployment charts should allow exposing services via TLS.
Oct 21 2019, 6:03 AM · Kubernetes, serviceops, Operations
Joe claimed T235411: Add TLS termination to services running on kubernetes.
Oct 21 2019, 5:50 AM · Kubernetes, serviceops, Operations
Joe changed the status of T230951: Transfer ownership of mediawiki-security mailman list to Security Team from Open to Stalled.

changing to stalled, and vacating assignment.

Oct 21 2019, 5:49 AM · Security-Team, Wikimedia-Mailing-lists, Operations
Joe closed T233973: remove service objects from etcd and update documentation as Resolved.
Oct 21 2019, 5:48 AM · Operations, conftool
Joe closed T235675: Upload 3.11.4 packages to APT repo, a subtask of T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release), as Resolved.
Oct 21 2019, 5:35 AM · Core Platform Team Workboards (Green), Patch-For-Review, User-Eevans, Cassandra
Joe closed T235675: Upload 3.11.4 packages to APT repo as Resolved.
Oct 21 2019, 5:35 AM · Patch-For-Review, Operations, Core Platform Team Legacy (Later), User-Eevans, Cassandra

Oct 18 2019

Joe closed T234464: Echostore service endpoints as Resolved.
~$ curl https://echostore.discovery.wmnet:8082/healthz
{
  "version": "v1.0.5",
  "build_date": "2019-10-03T19:30:15+00:00",
  "build_host": "87a0b5ccf9a6",
  "go_version": "go1.11.5"
}
Oct 18 2019, 10:09 AM · serviceops, Operations, Core Platform Team Workboards (User Stories), Notifications, Growth-Team, CPT Initiatives (Multi-DC Echo Notification Storage)
Joe closed T234464: Echostore service endpoints, a subtask of T234402: Wikimedia infrastructure is configured for multi-DC echo notification storage, as Resolved.
Oct 18 2019, 10:09 AM · Core Platform Team Workboards (User Stories), Story, Notifications, Growth-Team, CPT Initiatives (Multi-DC Echo Notification Storage)

Oct 17 2019

Joe added a comment to T235437: RESTBase/RESTRouter/service-runner rate limiting plans.

One detail I want to understand about the Redis hypothesis:

Oct 17 2019, 2:46 PM · service-runner, User-mobrovac, Core Platform Team Workboards (Clinic Duty Team), Services (doing), CPT Initiatives (RESTBase Split (CDP2)), serviceops, Kubernetes, Service-deployment-requests, Operations
Joe reassigned T234646: Wikimedia Technical Conference 2019 Session: Self-service Stateless Microservices (for APIs) from Joe to WMDE-leszek.
Oct 17 2019, 1:43 PM · International-Developer-Events, Wikimedia-Technical-Conference-2019
Joe added a comment to T234376: Provision Kask for Echo timestamp storage in k8s.

Heh yes sorry, I forgot to tell you yesterday - you need to use helmfile destroy in newer versions of helmfile.

Oct 17 2019, 5:32 AM · Patch-For-Review, Core Platform Team Workboards (Green), serviceops, Operations, Notifications, Growth-Team, CPT Initiatives (Multi-DC Echo Notification Storage)

Oct 16 2019

Joe closed T147204: Update confd package as Resolved.

All stretch+ servers in production have been updated to the newer version. Jessie hosts should go away soon.

Oct 16 2019, 11:25 AM · Patch-For-Review, serviceops, User-Joe, Beta-Cluster-reproducible, Operations
Joe closed T235412: Upgrade the envoyproxy package to its latest version., a subtask of T235411: Add TLS termination to services running on kubernetes, as Resolved.
Oct 16 2019, 9:26 AM · Kubernetes, serviceops, Operations
Joe closed T235412: Upgrade the envoyproxy package to its latest version. as Resolved.

All servers in production are upgraded.

Oct 16 2019, 9:26 AM · Kubernetes, serviceops, Operations

Oct 15 2019

Joe added a comment to T235216: Reconsider memcached connection method for MW in PHP7 world.

As far as I remember, the php memcached extension didn't allow using a unix socket to connect to memcached. According to https://www.php.net/manual/en/memcached.addserver.php this is possible since version 2.0 of the memcached extension. We're using 3.0.1 in production now, and we might think of moving mcrouter to listen on a unix socket locally as well, but it would need some testing.

Oct 15 2019, 9:09 PM · Performance-Team (Radar), serviceops
Joe merged Restricted Task into T235412: Upgrade the envoyproxy package to its latest version..
Oct 15 2019, 4:00 PM · Kubernetes, serviceops, Operations
Joe placed T233236: Move labtestwikitech database to clouddb2001-dev up for grabs.

The system for assigning a particular wiki to a particular db host in mediawiki-config has changed a lot since I last touched this code. @Joe, if you could write me a sample patch of how to break out labtestwiki into its own group and direct it to a different db server, I should be able to take it from there.
Moving the database itself (or building a fresh one) is straightforward and something I can do myself.
Thanks!

Oct 15 2019, 3:37 PM · cloud-services-team (Kanban)
Joe added a comment to T234464: Echostore service endpoints.

I've done all the puppet/dns prep work. You can now proceed to prepare this new kask deployment in operations/deployment-charts.

Oct 15 2019, 3:18 PM · serviceops, Operations, Core Platform Team Workboards (User Stories), Notifications, Growth-Team, CPT Initiatives (Multi-DC Echo Notification Storage)
Joe added a comment to T235488: Jobrunners: allow to check that they are in sync with the etcd data.

I think the best way is probably writing a small endpoint in operations/mediawiki-config that just exposes that.

Oct 15 2019, 10:34 AM · Operations, serviceops