After performing this edit using rollback on the english wikipedia, I saw an error at the "Action complete" page (here). Screenshots of the page, and of the html that caused it to display, are attached. There were no errors in the console relevant to this. I believe it is related to https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/CommonSettings.php#L517, since the error reports that $wmgUsePagedTiffHandler is undefined, and I can't find it mentioned previously in that file.
|Stalled||None||T216734 Drop devDependency on phpunit-mock-objects, which is abandoned and warns against use|
|Stalled||None||T216735 Drop support for PHPUnit 6.x|
|Stalled||None||T192167 Drop support for PHPUnit 4.x|
|Stalled||None||T196206 Bump symfony libraries when we no longer need hhvm support|
|Open||None||T220589 Support phan in MediaWiki's default composer.json|
|Open||None||T192166 Drop HHVM support from MediaWiki|
|Open||None||T176370 Migrate to PHP 7 in WMF production|
|Open||None||T219127 SRE FY2019 Q4 goal: complete the transition to PHP7|
|Duplicate||None||T218005 Variable from InitialiseSettings can be undefined (corrupt opcache?)|
This error suggests that somehow (part of) InitialiseSettings' variables were not exported under PHP 7. That's very concerning.
The code we use to cache and export these globals via a temporary file may need to be re-reviewed for this. In addition to potential PHP 7 compatibility, it's also possible we're seeing a pre-existing race condition where maybe different threads see a partially written file there or something else weird.
Alternatively, if the setting of this variable works fine, it might be that something is un-setting it. But that seems rather unlikely (and code search suggests nothing else refers to this except the 1 line that sets it, and the 1 line that reads it).
Logs show 461 such errors in the last week, all between 2019-03-11T06:32:48 and 2019-03-11T06:49:09, all on the same server: mw1258. Cross-referencing with SAL shows that we were deploying around that time:
06:58 marostegui@deploy1001: Synchronized wmf-config/db-eqiad.php: Give more traffic to db1099 after upgrade (duration: 00m 48s) 06:49 marostegui@deploy1001: Synchronized wmf-config/db-eqiad.php: Slowly repool db1099 after upgrade (duration: 00m 52s) 06:38 kart_: Finished manual run of unpublished ContentTranslation draft purge script (T217818) 06:37 marostegui: Power cycle mw1280 - server down 06:35 marostegui: Upgrade mysql and kernel on db1099 06:34 marostegui@deploy1001: Synchronized wmf-config/db-eqiad.php: Depool db1099 for upgrade (duration: 03m 01s) 06:03 effie: Restarting pdfrender on scb1003
And know what was the end of these errors?
2019-03-11T06:49:10 php PHP Warning: Cannot modify header information - headers already sent by (output started at /srv/mediawiki/wmf-config/CommonSettings.php:517) 2019-03-11T06:49:22 scap http://mw1258:9181 "GET /opcache-free HTTP/1.1" 200 None
So while the errors found by @MaxSem on logstash seem to refer to a host that was somehow in a bad state in terms of opcache state, which will need further analysis, the other issue to fix is the display_errors ini setting, which used to be in our php configuration and I didn't modify.
Looking at the logs on the server:
- 2019-03-11T06:32:48 errors start
- 2019-03-11T06:34:57 opcache received a request to invalidate a single file /opcache-free?file=wmf-config%2Fdb-eqiad.php, during a scap run.
- 2019-03-11T06:49:09 last error is logged
- 2019-03-11T06:49:22 a full opcache flush was requested /opcache-free - not sure what caused this, but it came from scap.
Now it seems clear to me that a full opcache flush repaired the issue, but the daemon has been restarted since so I lost any ability to further inspect what happened with its opcache. Grafana data don't show anything unusual either.
So while an opcache flush did the trick of stopping the errors, there is no obvious issue at play here. Having opcache revalidation active every 60 seconds would've probably prevented this from happening.
I'll solve the issue with display_errors and leave the bug open at lower priority so that we can relate it to other episodes of this happening.
We have done some work to overcome this sporadic failures:
- Reenabled opcache revalidation, which should hopefully refresh the corrupt cache within 1 minute
- Allowed pybal to monitor both HHVM and php7, so that a failing server will be depooled in case of failures.