Page MenuHomePhabricator

Joe (Giuseppe Lavagetto)
Spy

Projects (22)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 5:57 AM (241 w, 4 d)
Availability
Available
LDAP User
Giuseppe Lavagetto
MediaWiki User
GLavagetto (WMF) [ Global Accounts ]

Recent Activity

Today

Joe edited projects for T223952: Increased instability in MediaWiki backends (according to load balancers), added: DBA; removed Patch-For-Review, Traffic.

Having confirmed pybal is not an issue here, while databases might have a part in the issue, I'll change the tags accordingly.

Tue, May 21, 9:01 AM · Performance-Team, DBA, PHP 7.2 support, HHVM, serviceops, Operations
Joe added a comment to T223952: Increased instability in MediaWiki backends (according to load balancers).

While we have to wait and see if the absence of php7 traffic improves the situation (and in that case, why is that the case), I've noticed one fact about the latest cases of such instabilities:

  • They have a correspondence between the fetch time reported by pybal and what the appservers report
  • Looking at logs, the effect seems to be much more prominent in the last two days (starting on May 20th): on couple random application servers, the number of timeouts (e.g. requests taking more than 5 seconds to complete) more than doubled compared to preceding periods
  • Again looking at apache logs, the slow requests around the time of pybal's show the problem is concentrated on enwiki. All data from a single appserver:
    • enwiki constitutes 35% of requests of the total in the 10 seconds around pybal's detection; of those, 22% took 5 seconds or longer to complete
    • In the same time period, only 0.3% of all requests not going to enwiki took longer than 5 seconds
  • Verifying across the fleet with some cumin sorcery, it seems almost all requests taking more than 5 seconds at the time were on enwiki
Tue, May 21, 9:00 AM · Performance-Team, DBA, PHP 7.2 support, HHVM, serviceops, Operations
Joe added a comment to T223647: Investigate increase in GET ops registered by mcrouter for the mediawiki appserver cluster.

Switching off php7 confirmed it's the cause of the increased number of GETs.

Tue, May 21, 7:20 AM · User-jijiki, serviceops, Operations
Joe added a comment to T223310: Investigate increase in tx bandwidth usage for mc1033.

So, after turning off php7 this morning we saw no modification in the rate of requests to mc1033.

Tue, May 21, 7:17 AM · Patch-For-Review, Growth-Team, Wikidata, Wikidata-Campsite, Performance-Team, MediaWiki-Cache, User-jijiki, serviceops, Operations
Joe added a comment to T223310: Investigate increase in tx bandwidth usage for mc1033.

@kostajh for now I'm switching off php7 for other investigations, so we will know immediately if the additional traffic is due to that or not.

Tue, May 21, 6:52 AM · Patch-For-Review, Growth-Team, Wikidata, Wikidata-Campsite, Performance-Team, MediaWiki-Cache, User-jijiki, serviceops, Operations
Joe added a comment to T223952: Increased instability in MediaWiki backends (according to load balancers).

While there is no evidence that the increase in traffic sent to php7 is the cause of this increase in errors, there are a couple tasks that we'd need to investigate better and that could point at some form of resource starvation due to the coexistence of php7 and HHVM. See T223310 and T223647. So my first order of business will be to turn the php7 sampling back to zero for a couple days, to see if that changes the status quo for pybal's checks.

Tue, May 21, 6:47 AM · Performance-Team, DBA, PHP 7.2 support, HHVM, serviceops, Operations
Joe claimed T223952: Increased instability in MediaWiki backends (according to load balancers).
Tue, May 21, 6:45 AM · Performance-Team, DBA, PHP 7.2 support, HHVM, serviceops, Operations
Joe created T223952: Increased instability in MediaWiki backends (according to load balancers).
Tue, May 21, 6:45 AM · Performance-Team, DBA, PHP 7.2 support, HHVM, serviceops, Operations
Joe added a comment to T223310: Investigate increase in tx bandwidth usage for mc1033.

I think I know what happened here - and it's possibly in relation with T223180 .

Tue, May 21, 5:56 AM · Patch-For-Review, Growth-Team, Wikidata, Wikidata-Campsite, Performance-Team, MediaWiki-Cache, User-jijiki, serviceops, Operations
Joe added a comment to T223180: Monitoring PHP 7 APC usage.

Hi @Krinkle the metrics for php7 exist already, they're exported to prometheus as follows:

Tue, May 21, 5:39 AM · Performance-Team (Radar), PHP 7.2 support, Operations

Wed, May 15

Joe closed T219128: Remove php7 beta feature as Resolved.
Wed, May 15, 4:15 PM · MW-1.34-notes (1.34.0-wmf.4; 2019-05-07), User-jijiki, Patch-For-Review, Beta-Feature, Operations, serviceops
Joe closed T219128: Remove php7 beta feature, a subtask of T219150: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters, as Resolved.
Wed, May 15, 4:15 PM · Patch-For-Review, User-jijiki, Operations, serviceops
Joe added a comment to T223336: [Regression] fatal-errors.php action=segfault results in a 503 error under php7-fpm..

I changed the title of the task to reflect myt findings, and changed the associated tags accordingly

Wed, May 15, 11:57 AM · Performance-Team (Radar), serviceops, User-jijiki, observability, Operations, PHP 7.2 support
Joe renamed T223336: [Regression] fatal-errors.php action=segfault results in a 503 error under php7-fpm. from [Regression] Varnish is replacing the detailed HTTP 500 page from PHP 7 with "503 Service Temporarily Unavailable" to [Regression] fatal-errors.php action=segfault results in a 503 error under php7-fpm..
Wed, May 15, 11:55 AM · Performance-Team (Radar), serviceops, User-jijiki, observability, Operations, PHP 7.2 support
Joe added a project to T223336: [Regression] fatal-errors.php action=segfault results in a 503 error under php7-fpm.: serviceops.
Wed, May 15, 11:54 AM · Performance-Team (Radar), serviceops, User-jijiki, observability, Operations, PHP 7.2 support
Joe removed a project from T223336: [Regression] fatal-errors.php action=segfault results in a 503 error under php7-fpm.: Traffic.
Wed, May 15, 11:54 AM · Performance-Team (Radar), serviceops, User-jijiki, observability, Operations, PHP 7.2 support
Joe added a comment to T223336: [Regression] fatal-errors.php action=segfault results in a 503 error under php7-fpm..

Hi, I've tested a few combinations of errors, and the the only case where this happens is when you choose action=segfault.

Wed, May 15, 11:53 AM · Performance-Team (Radar), serviceops, User-jijiki, observability, Operations, PHP 7.2 support
Joe added a comment to T219128: Remove php7 beta feature.

HHVM feels very slow after using PHP7, so it doesn't make any sense to use HHVM. I'm very sad because I cannot choose PHP7 anymore and I'm forced to use HHVM.

Wed, May 15, 9:33 AM · MW-1.34-notes (1.34.0-wmf.4; 2019-05-07), User-jijiki, Patch-For-Review, Beta-Feature, Operations, serviceops
Joe added a comment to T219128: Remove php7 beta feature.

What was the point remove PHP7 beta feature? Now all logged-in users are forced to use HHVM. There is a huge difference between PHP7 and HHVM.

Please enable it again. This beta feature should be removed only after PHP7 is running by 100% default.

Wed, May 15, 9:11 AM · MW-1.34-notes (1.34.0-wmf.4; 2019-05-07), User-jijiki, Patch-For-Review, Beta-Feature, Operations, serviceops
Joe added a comment to T219128: Remove php7 beta feature.

A net effect of this patch is now all logged-in users are back to HHVM. I think we need to backport the patch above to the running versions of MediaWiki so that we also sample logged-in users (who constitute a sizeable part of the backend requests anyways).

Wed, May 15, 7:49 AM · MW-1.34-notes (1.34.0-wmf.4; 2019-05-07), User-jijiki, Patch-For-Review, Beta-Feature, Operations, serviceops
Joe added a comment to T223345: Zotero container: Production is running candidate version, last production version is broken due to lack of ca-certificates package.

It's working in production because we connect to external URIs via a proxy, hence we don't need ca-certificates.

Wed, May 15, 6:03 AM · Beta-Cluster-reproducible, Editing-team, Core Platform Team Backlog (Next), Services (next), serviceops

Tue, May 14

Joe added a comment to T220235: Migrate Beta cluster services to use Kubernetes .

Lets put together a list of all the services we need to set up as containers within beta (either because stuff is already broken without it or because it will be soon), and figure out how best to arrange the VMs for it. We might not want to be making a new VM for each service if they all run inside containers?

Tue, May 14, 6:50 AM · Patch-For-Review, Editing-team, Core Platform Team Backlog (Next), Services (next), Kubernetes, Release Pipeline, serviceops, Beta-Cluster-Infrastructure
Joe added a comment to T220235: Migrate Beta cluster services to use Kubernetes .

Could we use image version: latest in beta hiera? And somehow pull down the new latest and restart the image whenever a new version is created and uploaded to the registry?

Tue, May 14, 6:22 AM · Patch-For-Review, Editing-team, Core Platform Team Backlog (Next), Services (next), Kubernetes, Release Pipeline, serviceops, Beta-Cluster-Infrastructure
Joe added a comment to T197126: Create tool to handle the state of database configuration in MediaWiki in etcd.

EtcdConfig in MediaWiki has been extensively tested against failures before it was introduced, if that's what @jcrespo was referring to.

Tue, May 14, 6:20 AM · User-ArielGlenn, Patch-For-Review, User-Joe, MediaWiki-Configuration, Operations, DBA
Joe updated subscribers of T206504: Create a new endpoint which returns articles in need of a description.

Redis has issues with cross-dc replication, and it is slowly being removed (jobqueue was, sessions next).

I'm confused about the status of Redis in our infrastructure. Just to be clear, is the plan to (eventually) phase it out completely? I'd emphasize that not all current usages of Redis are as a generic key-value store. As noted in T158239#3223921, the GettingStarted extension is using (extremely useful) Redis-specific set functionality for its edit suggestion engine. It will have to undergo significant changes to move to MySQL or a generic object cache interface. I'd planned to do the same or similar to what GettingStarted is doing here, which was my reason for preferring Redis.

Tue, May 14, 6:12 AM · WikimediaEditorTasks, Wikipedia-Android-App-Backlog, Reading-Infrastructure-Team-Backlog (Kanban), Mobile-Content-Service

Mon, May 13

Jdforrester-WMF awarded T197126: Create tool to handle the state of database configuration in MediaWiki in etcd a Like token.
Mon, May 13, 3:48 PM · User-ArielGlenn, Patch-For-Review, User-Joe, MediaWiki-Configuration, Operations, DBA
Jdforrester-WMF awarded T197126: Create tool to handle the state of database configuration in MediaWiki in etcd a Like token.
Mon, May 13, 3:47 PM · User-ArielGlenn, Patch-For-Review, User-Joe, MediaWiki-Configuration, Operations, DBA
Joe added a comment to T215746: Checkup on cssjanus PHP 7 compat.

Should we backport this to wmf.4? This is blocking further deployment of PHP7.

Mon, May 13, 9:00 AM · MW-1.34-notes (1.34.0-wmf.4; 2019-05-07), Performance-Team, MediaWiki-ResourceLoader, PHP 7.2 support, PHP 7.0 support
Joe added a comment to T220235: Migrate Beta cluster services to use Kubernetes .

The status quo is that services always run their code in beta before it reaches production. For manually deployed code (MW) and most services (before k8s) this is/was automated by Jenkins using basically just git-pull, or cron (puppet). Services not interacting with MW, that use scap (such as webperf services), require individual teams to do their routine deployments in both beta and prod manually (and sometimes skip this in beta).

Mon, May 13, 5:01 AM · Patch-For-Review, Editing-team, Core Platform Team Backlog (Next), Services (next), Kubernetes, Release Pipeline, serviceops, Beta-Cluster-Infrastructure
Joe added a comment to T220235: Migrate Beta cluster services to use Kubernetes .

An example of environmental differences: service-runner uses statsd. In prod we use prometheus-statsd-exporter in a k8s container with service specific metric mappings to get those metrics into prometheus.

Mon, May 13, 4:36 AM · Patch-For-Review, Editing-team, Core Platform Team Backlog (Next), Services (next), Kubernetes, Release Pipeline, serviceops, Beta-Cluster-Infrastructure
Joe added a comment to T117845: Rename the language codes sr-el and sr-ec to the BCP 47 conform codes sr-Latn and sr-Cyrl.

Hi all,

Mon, May 13, 4:25 AM · Patch-For-Review, MediaWiki-Internationalization, I18n

Fri, May 10

Joe closed T216712: Switch PHP 7.2 packages to an internal component as Resolved.
Fri, May 10, 3:37 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge, Operations
Joe closed T216712: Switch PHP 7.2 packages to an internal component, a subtask of T176370: Migrate to PHP 7 in WMF production, as Resolved.
Fri, May 10, 3:36 PM · Core Platform Team Kanban (Doing), Core Platform Team (PHP7 (TEC4)), Patch-For-Review, TechCom-RFC (TechCom-Approved), User-ArielGlenn, HHVM, Operations
Joe closed T222705: Improve Pybal's url checks as Resolved.
Fri, May 10, 10:48 AM · Patch-For-Review, User-jijiki, PHP 7.2 support, Operations, serviceops, Traffic
Joe closed T222705: Improve Pybal's url checks, a subtask of T219150: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters, as Resolved.
Fri, May 10, 10:48 AM · Patch-For-Review, User-jijiki, Operations, serviceops
Joe added a project to T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature): serviceops.
Fri, May 10, 9:46 AM · serviceops, Core Platform Team Kanban (Waiting for Review), Core Platform Team (Security, stability, performance and scalability (TEC1)), PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error
Joe added a comment to T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature).

Going deeper, this template looks up on this and other tables:

Fri, May 10, 9:44 AM · serviceops, Core Platform Team Kanban (Waiting for Review), Core Platform Team (Security, stability, performance and scalability (TEC1)), PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error
Joe added a comment to T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature).

And indeed, this page I just created containing only the use of that template:

Fri, May 10, 9:18 AM · serviceops, Core Platform Team Kanban (Waiting for Review), Core Platform Team (Security, stability, performance and scalability (TEC1)), PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error
Joe added a comment to T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature).

More interesting: I raised the memory limit on mwdebug1002 and got profiling information from both hhvm and php7 and the main difference is the memory spent in Preprocessor_Hash::preprocessToObj:

Fri, May 10, 9:10 AM · serviceops, Core Platform Team Kanban (Waiting for Review), Core Platform Team (Security, stability, performance and scalability (TEC1)), PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error
Joe added a comment to T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature).

from the original request's parsing report:

Fri, May 10, 8:39 AM · serviceops, Core Platform Team Kanban (Waiting for Review), Core Platform Team (Security, stability, performance and scalability (TEC1)), PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error
Joe added a comment to T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature).

Just did some tests in production. The aforementioned page will render fine if I raise the memory limit to 2 GB. It also takes 25 seconds to render vs the 5 seconds it takes with HHVM.

Fri, May 10, 8:17 AM · serviceops, Core Platform Team Kanban (Waiting for Review), Core Platform Team (Security, stability, performance and scalability (TEC1)), PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error
Joe added a comment to T220212: Wikimedia Technical Conference 2019: Discussion .
Fri, May 10, 8:08 AM · International-Developer-Events
Joe added a comment to T220212: Wikimedia Technical Conference 2019: Discussion .

Participants selection should be based on need and not on a fixed quota of participants.

Fri, May 10, 5:41 AM · International-Developer-Events
Joe added a comment to T220212: Wikimedia Technical Conference 2019: Discussion .

Given this is the Wikimedia tech conference, I wouldn't consider central working on defining the relations between the Wikimedia movement and third-party users of Wikimedia products. I'm not against having it as a theme, but it's by far not the first thing I would talk about.

Fri, May 10, 5:20 AM · International-Developer-Events

Thu, May 9

Joe added a comment to T221654: Puppet broken on VMs in deployment-prep.

The way to go for such things is to use role::beta::docker_services on a fresh VM.

Thu, May 9, 3:26 PM · Patch-For-Review, Beta-Cluster-Infrastructure, serviceops
Joe added a comment to T220235: Migrate Beta cluster services to use Kubernetes .

Will the service run into any differences in its environment due to being run with role::beta::docker_services instead of k8s? I'm not 100% thrilled with the idea of introducing another infrastructure difference from production, but it might not be a big deal if the service behaves the same.
When @Ottomata mentioned that role to me e thought that role::beta::docker_services does not work on stretch, is that correct?

Thu, May 9, 3:24 PM · Patch-For-Review, Editing-team, Core Platform Team Backlog (Next), Services (next), Kubernetes, Release Pipeline, serviceops, Beta-Cluster-Infrastructure
Joe added a comment to T218609: Figure out future for newly created deployment-prep jessie instances.

The only caveat is that apparently you need to rrun puppet once, run apt-get update, run puppet again to make it work.

Thu, May 9, 3:21 PM · Beta-Cluster-Infrastructure
Joe added a comment to T218609: Figure out future for newly created deployment-prep jessie instances.

Ok so I can now confirm:

Thu, May 9, 3:21 PM · Beta-Cluster-Infrastructure
Joe added a comment to T218609: Figure out future for newly created deployment-prep jessie instances.

I don't really know why stretch won't work, are we sure that's the case?

Thu, May 9, 11:49 AM · Beta-Cluster-Infrastructure
Joe moved T222705: Improve Pybal's url checks from Backlog to Doing on the serviceops board.
Thu, May 9, 11:47 AM · Patch-For-Review, User-jijiki, PHP 7.2 support, Operations, serviceops, Traffic
Joe added a comment to T220235: Migrate Beta cluster services to use Kubernetes .

Please also note you can run multiple services on the same VM if you really want to, it's enough to add a second stanza in the hiera definition.

Thu, May 9, 9:41 AM · Patch-For-Review, Editing-team, Core Platform Team Backlog (Next), Services (next), Kubernetes, Release Pipeline, serviceops, Beta-Cluster-Infrastructure
Joe added a comment to T220235: Migrate Beta cluster services to use Kubernetes .

There is a simple solution to run services that are now on k8s on deployment-prep:

Thu, May 9, 9:40 AM · Patch-For-Review, Editing-team, Core Platform Team Backlog (Next), Services (next), Kubernetes, Release Pipeline, serviceops, Beta-Cluster-Infrastructure
Joe added a comment to T214975: proton experienced a period of high CPU usage, busy queue, lockups.

So what happened is that the server ran out of available memory and OOM'd.

Thu, May 9, 7:07 AM · Reading-Infrastructure-Team-Backlog, Proton, Operations

Wed, May 8

Joe added a comment to T221347: PHP7 opcache sometimes corrupts when cleared (was: Fatal ConfigException, undefined InitialiseSettings variable).

Today's incident at https://wikitech.wikimedia.org/wiki/Incident_documentation/20190507-opcache was very similar and presumably the same root cause.

mw1320: ConfigException from line 53 of /srv/mediawiki/php-1.34.0-wmf.3/includes/config/GlobalVarConfig.php: GlobalVarConfig::get: undefined option: 'UseKeyHe`der'

The string literal UseKeyHeader had somehow been corrupted in memory to become UseKeyHe`der.

Keeping task open for now, as I'm unsure what the wider context is and what other changes have been or will be made.

If everything else is still the same, then performing opcache invalidation at regular intervals is presumably not enough to fix avoid the issue. It would merely mean that the sites are "only" corrupt and potentially compromised security/privacy-wise for less than 60 seconds at a time.

Wed, May 8, 2:08 PM · PHP 7.2 support, Operations, Wikimedia-production-error
Joe renamed T221347: PHP7 opcache sometimes corrupts when cleared (was: Fatal ConfigException, undefined InitialiseSettings variable) from PHP7 opcache sometimes corrupts (was: Fatal ConfigException, undefined InitialiseSettings variable) to PHP7 opcache sometimes corrupts when cleared (was: Fatal ConfigException, undefined InitialiseSettings variable).
Wed, May 8, 2:02 PM · PHP 7.2 support, Operations, Wikimedia-production-error
Joe added a parent task for T215746: Checkup on cssjanus PHP 7 compat: T219127: SRE FY2019 Q4 goal: complete the transition to PHP7.
Wed, May 8, 12:51 PM · MW-1.34-notes (1.34.0-wmf.4; 2019-05-07), Performance-Team, MediaWiki-ResourceLoader, PHP 7.2 support, PHP 7.0 support
Joe added a subtask for T219127: SRE FY2019 Q4 goal: complete the transition to PHP7: T215746: Checkup on cssjanus PHP 7 compat.
Wed, May 8, 12:51 PM · Operations, serviceops
Joe added a parent task for T219279: Some pages will become completely unreachable after PHP7 update due to Unicode changes: T219127: SRE FY2019 Q4 goal: complete the transition to PHP7.
Wed, May 8, 12:50 PM · Core Platform Team Kanban (Waiting for Review), MW-1.34-notes (1.34.0-wmf.4; 2019-05-07), Patch-For-Review, Core Platform Team (PHP7 (TEC4)), serviceops, Operations, PHP 7.2 support, MediaWiki-General-or-Unknown
Joe added a subtask for T219127: SRE FY2019 Q4 goal: complete the transition to PHP7: T219279: Some pages will become completely unreachable after PHP7 update due to Unicode changes.
Wed, May 8, 12:50 PM · Operations, serviceops
Joe added a parent task for T219901: Default to Preprocessor_Hash for PHP 7: T219127: SRE FY2019 Q4 goal: complete the transition to PHP7.
Wed, May 8, 12:49 PM · MW-1.34-release, Performance-Team (Radar), MediaWiki-Parser
Joe added a subtask for T219127: SRE FY2019 Q4 goal: complete the transition to PHP7: T219901: Default to Preprocessor_Hash for PHP 7.
Wed, May 8, 12:49 PM · Operations, serviceops
Joe added a comment to T218005: Variable from InitialiseSettings can be undefined (corrupt opcache?) .

We have done some work to overcome this sporadic failures:

Wed, May 8, 12:48 PM · PHP 7.2 support, Wikimedia-production-error, MediaWiki-extensions-PagedTiffHandler, User-DannyS712
Joe added a parent task for T218005: Variable from InitialiseSettings can be undefined (corrupt opcache?) : T219127: SRE FY2019 Q4 goal: complete the transition to PHP7.
Wed, May 8, 12:46 PM · PHP 7.2 support, Wikimedia-production-error, MediaWiki-extensions-PagedTiffHandler, User-DannyS712
Joe added a subtask for T219127: SRE FY2019 Q4 goal: complete the transition to PHP7: T218005: Variable from InitialiseSettings can be undefined (corrupt opcache?) .
Wed, May 8, 12:46 PM · Operations, serviceops
Joe added a parent task for T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature): T219127: SRE FY2019 Q4 goal: complete the transition to PHP7.
Wed, May 8, 12:41 PM · serviceops, Core Platform Team Kanban (Waiting for Review), Core Platform Team (Security, stability, performance and scalability (TEC1)), PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error
Joe added a subtask for T219127: SRE FY2019 Q4 goal: complete the transition to PHP7: T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature).
Wed, May 8, 12:41 PM · Operations, serviceops

Tue, May 7

Joe added a comment to T216664: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature).

@Krinkle should we raise the memory limit slightly? I can do some tests, but apart from that what remains to be done?

Tue, May 7, 4:20 PM · serviceops, Core Platform Team Kanban (Waiting for Review), Core Platform Team (Security, stability, performance and scalability (TEC1)), PHP 7.2 support, MediaWiki-History-and-Diffs, MediaWiki-Parser, Wikimedia-production-error
Joe added a comment to T219901: Default to Preprocessor_Hash for PHP 7.

Hi, this is considered a blocker for further deployment of php7. @Krinkle do you think the patch you merged tonight solves the issue and can unblock further deployments?

Tue, May 7, 4:19 PM · MW-1.34-release, Performance-Team (Radar), MediaWiki-Parser
Joe added a comment to T220820: Add a CI check for the use of hiera() function.

I would even suggest if we write a puppet-lint plugin for this to add the fix capability. It should allow a relatively quick removal of all hiera() calls.

Tue, May 7, 10:13 AM · Puppet, Operations

Mon, May 6

Joe added a comment to T187147: Port mediawiki/php/wmerrors to PHP7 and deploy.

So a couple things:

Mon, May 6, 2:58 PM · wmerrors, Wikimedia-Logstash, MediaWiki-Logging, Operations, User-herron, Core Platform Team Backlog, MW-1.34-notes (1.34.0-wmf.5; 2019-05-14), Patch-For-Review, PHP 7.2 support, Core Platform Team (PHP7 (TEC4)), Performance-Team (Radar)
Joe merged T222614: CI no more triggers for some/all? repositories! into T222605: CI is unavailable since around 10:00 UTC.
Mon, May 6, 1:55 PM · Wikimedia-Incident, Patch-For-Review, Continuous-Integration-Config, Release-Engineering-Team
Joe merged task T222614: CI no more triggers for some/all? repositories! into T222605: CI is unavailable since around 10:00 UTC.
Mon, May 6, 1:55 PM · Release-Engineering-Team (Kanban), Continuous-Integration-Infrastructure
D3r1ck01 awarded T222605: CI is unavailable since around 10:00 UTC a The World Burns token.
Mon, May 6, 12:54 PM · Wikimedia-Incident, Patch-For-Review, Continuous-Integration-Config, Release-Engineering-Team
Joe triaged T222605: CI is unavailable since around 10:00 UTC as Unbreak Now! priority.
Mon, May 6, 11:52 AM · Wikimedia-Incident, Patch-For-Review, Continuous-Integration-Config, Release-Engineering-Team
Joe created T222605: CI is unavailable since around 10:00 UTC.
Mon, May 6, 11:51 AM · Wikimedia-Incident, Patch-For-Review, Continuous-Integration-Config, Release-Engineering-Team
Joe closed T216676: Set up A/B testing mechanism for PHP7, as Resolved.
Mon, May 6, 6:29 AM · MW-1.33-notes (1.33.0-wmf.22; 2019-03-19), Patch-For-Review, User-Joe, serviceops, Operations
Joe closed T216676: Set up A/B testing mechanism for PHP7,, a subtask of T212828: SRE FY2019 Q3 goal: Ramp-up serving traffic to PHP 7 , as Resolved.
Mon, May 6, 6:29 AM · User-Joe, serviceops, Operations

Fri, May 3

Joe added a comment to T222452: PHP Fatal Errors on mw1275 after deployment.

some things from my very initial analysis:

  • I tried to purge first the directory that the deployment had invalidated, the error didn't go away
  • I tried purging the autoload file in composer that was suppsedly loading the library from the wrong place, still no dice
  • I finally tried purging all of the opcache, which solved the problem.
Fri, May 3, 2:55 PM · Wikimedia-production-error, serviceops, Operations

Thu, May 2

Joe added a member for netbox: Volans.
Thu, May 2, 9:50 AM
Joe created netbox.
Thu, May 2, 9:49 AM

Mon, Apr 29

Joe added a comment to T219279: Some pages will become completely unreachable after PHP7 update due to Unicode changes.
  1. An opinion on wether it's ok to live with the issue on the pages @Anomie found to be affected
  2. An implementation of a solution for such pages, or at least a proposed workflow around that.

Corey asked me now to work on this task, so I can now get started on doing #2.

For #1, it's hard to say since I don't know anything about the languages that might actually be affected. I can say that on wikis like enwiki it seems the articles affected are mostly about the letters themselves, the main question would be whether in any of these cases enwiki has the article at the lowercase title with a redirect from uppercase rather than vice versa. And I don't see many multi-character article titles in the list from other wikis either.

Mon, Apr 29, 7:55 AM · Core Platform Team Kanban (Waiting for Review), MW-1.34-notes (1.34.0-wmf.4; 2019-05-07), Patch-For-Review, Core Platform Team (PHP7 (TEC4)), serviceops, Operations, PHP 7.2 support, MediaWiki-General-or-Unknown

Mon, Apr 22

Joe added a comment to T221347: PHP7 opcache sometimes corrupts when cleared (was: Fatal ConfigException, undefined InitialiseSettings variable).

Btw, logs don't tell you much in the case you meet such a bug - I would've had to go scavenge a core dump of the running process to try to figure out what was actually going on there. I think the priority should be lowered and we should look at reoccurrences of the same bug after today's change has been widely applied.

Does that mean we are aware of _other_ occurances of similar bug? Unless I'm blind, there's no comment mentioning that above.

Mon, Apr 22, 6:50 AM · PHP 7.2 support, Operations, Wikimedia-production-error
Joe added a comment to T219279: Some pages will become completely unreachable after PHP7 update due to Unicode changes.

@kchapman regarding point 1 above - I've prepared various patches, including https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/505487 that should act as a stopgap solution for now.

Mon, Apr 22, 6:27 AM · Core Platform Team Kanban (Waiting for Review), MW-1.34-notes (1.34.0-wmf.4; 2019-05-07), Patch-For-Review, Core Platform Team (PHP7 (TEC4)), serviceops, Operations, PHP 7.2 support, MediaWiki-General-or-Unknown

Apr 20 2019

Joe added a comment to T221347: PHP7 opcache sometimes corrupts when cleared (was: Fatal ConfigException, undefined InitialiseSettings variable).

I realized I failed to update this ticket with my investigation:

Apr 20 2019, 9:37 AM · PHP 7.2 support, Operations, Wikimedia-production-error
Joe added a comment to T211488: Audit and sync INI settings as needed between HHVM and PHP 7 .

FWIW, the doc_root being set was causing severe issues under php7.2. I removed it from the list of ini settings we want to use, and from the actual config as well.

Apr 20 2019, 8:49 AM · Performance-Team, User-jijiki, PHP 7.2 support, Operations

Apr 19 2019

Joe updated subscribers of T220246: Session storage service Cassandra schema.

I was also wondering if @Marostegui and @jcrespo could share insights from how we manage schema changes and configuration changes on mysql.

Apr 19 2019, 3:17 PM · Core Platform Team (Session Management Service (CDP2)), User-Clarakosi, Core Platform Team Backlog (Next), User-Eevans
Joe added a comment to T220246: Session storage service Cassandra schema.

Let's try to breakdown the procedures of:

  1. Upgrading an existing table for a running application
  2. Creating a new table for a new application
  3. Altering the configuration of a table for a running application
Apr 19 2019, 3:14 PM · Core Platform Team (Session Management Service (CDP2)), User-Clarakosi, Core Platform Team Backlog (Next), User-Eevans
Joe closed T215339: No jobs running on beta cluster as Resolved.

I fixed the configuration of cpjobqueue in deployment-prep, restarted the service, and verified requests are not getting through to the jobrunner:

Apr 19 2019, 11:01 AM · serviceops, Services (done), Wikidata, SDC General, Beta-Cluster-Infrastructure
Joe closed T215339: No jobs running on beta cluster, a subtask of T186993: Beta Cluster search box displays unexisting pages as results, as Resolved.
Apr 19 2019, 11:00 AM · Discovery-Search, Services (next), MediaWiki-Search, Beta-Cluster-Infrastructure
Joe claimed T215339: No jobs running on beta cluster.
Apr 19 2019, 9:42 AM · serviceops, Services (done), Wikidata, SDC General, Beta-Cluster-Infrastructure
Joe added a comment to T215339: No jobs running on beta cluster.

FWIW, I don't think we need the TLS configuration in beta. I can try to simplify things. Sorry for not noticing this bug earlier, but adding Operations or better serviceops could've helped it coming to my attention.

Apr 19 2019, 9:42 AM · serviceops, Services (done), Wikidata, SDC General, Beta-Cluster-Infrastructure

Apr 18 2019

Joe updated subscribers of T221365: MassMessage not delivering.

@Elitre the problem seems to be a regression in MediaWiki, tracked at T221368. @Reedy reverted the group 1 wikis (including meta) to the previous version and now the messages can be re-sent. I don't think there should be duplicates, you can quote me on that and blame me for the annoyance :)

Apr 18 2019, 1:10 PM · Patch-For-Review, MassMessage, Operations
Joe added a comment to T221368: cdnPurge and other jobs fail completely to execute.

We decided to revert given the spike in errors we got started yesterday at 19:15 UTC, and it corresponds to the SAL entry for moving group 1 to wmf.1.

Apr 18 2019, 12:59 PM · MW-1.34-notes (1.34.0-wmf.3; 2019-04-30), Core Platform Team Kanban (Done with CPT), Services (done), Core Platform Team (Security, stability, performance and scalability (TEC1)), WMF-JobQueue, Performance-Team, MediaWiki-JobQueue, Wikimedia-production-error
Joe updated subscribers of T221368: cdnPurge and other jobs fail completely to execute.

@Reedy graciously reverted group 1 for me, as this was the cause for a UBN! ticket.

Apr 18 2019, 12:57 PM · MW-1.34-notes (1.34.0-wmf.3; 2019-04-30), Core Platform Team Kanban (Done with CPT), Services (done), Core Platform Team (Security, stability, performance and scalability (TEC1)), WMF-JobQueue, Performance-Team, MediaWiki-JobQueue, Wikimedia-production-error
Joe triaged T221368: cdnPurge and other jobs fail completely to execute as Unbreak Now! priority.
Apr 18 2019, 12:57 PM · MW-1.34-notes (1.34.0-wmf.3; 2019-04-30), Core Platform Team Kanban (Done with CPT), Services (done), Core Platform Team (Security, stability, performance and scalability (TEC1)), WMF-JobQueue, Performance-Team, MediaWiki-JobQueue, Wikimedia-production-error
Joe created T221368: cdnPurge and other jobs fail completely to execute.
Apr 18 2019, 12:57 PM · MW-1.34-notes (1.34.0-wmf.3; 2019-04-30), Core Platform Team Kanban (Done with CPT), Services (done), Core Platform Team (Security, stability, performance and scalability (TEC1)), WMF-JobQueue, Performance-Team, MediaWiki-JobQueue, Wikimedia-production-error
Joe triaged T221346: Renew certs for mcrouter on all application servers. as High priority.
Apr 18 2019, 11:19 AM · Patch-For-Review, User-Elukey, Operations, serviceops
Joe assigned T221346: Renew certs for mcrouter on all application servers. to fsero.
Apr 18 2019, 11:16 AM · Patch-For-Review, User-Elukey, Operations, serviceops
Joe added a comment to T221346: Renew certs for mcrouter on all application servers..

So the CA public cert will expire as well at the end of May.

Apr 18 2019, 11:10 AM · Patch-For-Review, User-Elukey, Operations, serviceops
Joe created T221346: Renew certs for mcrouter on all application servers..
Apr 18 2019, 10:18 AM · Patch-For-Review, User-Elukey, Operations, serviceops