Page MenuHomePhabricator

Variable from InitialiseSettings can be undefined (corrupt opcache?)
Closed, DuplicatePublicPRODUCTION ERROR

Description

After performing this edit using rollback on the english wikipedia, I saw an error at the "Action complete" page (here). Screenshots of the page, and of the html that caused it to display, are attached. There were no errors in the console relevant to this. I believe it is related to https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/CommonSettings.php#L517, since the error reports that $wmgUsePagedTiffHandler is undefined, and I can't find it mentioned previously in that file.

diff error 2.jpg (1×2 px, 442 KB)

diff error 1.jpg (1×2 px, 523 KB)

Event Timeline

Aklapper renamed this task from Rollback error to PHP notice "Undefined variable: wmgUsePagedTiffHandler" shown after rollback on enwp.Mar 11 2019, 8:27 AM

Thanks for reporting this!

I can't find it mentioned previously in that file.

It's in https://noc.wikimedia.org/conf/InitialiseSettings.php.txt

@DannyS712 do you have the PHP 7 beta feature enabled?

@DannyS712 do you have the PHP 7 beta feature enabled?

@Legoktm yes, I do

This error suggests that somehow (part of) InitialiseSettings' variables were not exported under PHP 7. That's very concerning.

The code we use to cache and export these globals via a temporary file may need to be re-reviewed for this. In addition to potential PHP 7 compatibility, it's also possible we're seeing a pre-existing race condition where maybe different threads see a partially written file there or something else weird.

Alternatively, if the setting of this variable works fine, it might be that something is un-setting it. But that seems rather unlikely (and code search suggests nothing else refers to this except the 1 line that sets it, and the 1 line that reads it).

Logs show 461 such errors in the last week, all between 2019-03-11T06:32:48 and 2019-03-11T06:49:09, all on the same server: mw1258. Cross-referencing with SAL shows that we were deploying around that time:

06:58 marostegui@deploy1001: Synchronized wmf-config/db-eqiad.php: Give more traffic to db1099 after upgrade (duration: 00m 48s)
06:49 marostegui@deploy1001: Synchronized wmf-config/db-eqiad.php: Slowly repool db1099 after upgrade (duration: 00m 52s)
06:38 kart_: Finished manual run of unpublished ContentTranslation draft purge script (T217818)
06:37 marostegui: Power cycle mw1280 - server down
06:35 marostegui: Upgrade mysql and kernel on db1099
06:34 marostegui@deploy1001: Synchronized wmf-config/db-eqiad.php: Depool db1099 for upgrade (duration: 03m 01s)
06:03 effie: Restarting pdfrender on scb1003

And know what was the end of these errors?

2019-03-11T06:49:10	php	PHP Warning: Cannot modify header information - headers already sent by (output started at /srv/mediawiki/wmf-config/CommonSettings.php:517)
2019-03-11T06:49:22	scap	http://mw1258:9181 "GET /opcache-free HTTP/1.1" 200 None

(dashboard)

Change 497024 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[operations/puppet@production] Disable display_errors in FPM mode

https://gerrit.wikimedia.org/r/497024

Krinkle renamed this task from PHP notice "Undefined variable: wmgUsePagedTiffHandler" shown after rollback on enwp to Variable wmgUsePagedTiffHandler from InitialiseSettings undefined.Mar 16 2019, 6:16 AM
Krinkle moved this task from Sep2019/1.34.wmf.21+ to Resolved on the Wikimedia-production-error board.

So while the errors found by @MaxSem on logstash seem to refer to a host that was somehow in a bad state in terms of opcache state, which will need further analysis, the other issue to fix is the display_errors ini setting, which used to be in our php configuration and I didn't modify.

Looking at the logs on the server:

  • 2019-03-11T06:32:48 errors start
  • 2019-03-11T06:34:57 opcache received a request to invalidate a single file /opcache-free?file=wmf-config%2Fdb-eqiad.php, during a scap run.
  • 2019-03-11T06:49:09 last error is logged
  • 2019-03-11T06:49:22 a full opcache flush was requested /opcache-free - not sure what caused this, but it came from scap.

Now it seems clear to me that a full opcache flush repaired the issue, but the daemon has been restarted since so I lost any ability to further inspect what happened with its opcache. Grafana data don't show anything unusual either.

So while an opcache flush did the trick of stopping the errors, there is no obvious issue at play here. Having opcache revalidation active every 60 seconds would've probably prevented this from happening.

I'll solve the issue with display_errors and leave the bug open at lower priority so that we can relate it to other episodes of this happening.

Change 497024 merged by Giuseppe Lavagetto:
[operations/puppet@production] profile::mediawiki::php: disable display_errors in FPM mode

https://gerrit.wikimedia.org/r/497024

Krinkle renamed this task from Variable wmgUsePagedTiffHandler from InitialiseSettings undefined to Variable from InitialiseSettings can be undefined (corrupt opcache?) .Mar 28 2019, 12:23 AM

We have done some work to overcome this sporadic failures:

  • Reenabled opcache revalidation, which should hopefully refresh the corrupt cache within 1 minute
  • Allowed pybal to monitor both HHVM and php7, so that a failing server will be depooled in case of failures.

@Krinkle this task predates T221347 by a month - isn't that one the duplicate?

@DannyS712: Date is not relevant - whatever task has more info or a patch should be the one to merge into.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:07 PM