Page MenuHomePhabricator

Watchlist and RecentChanges failure due to ORES on frwiki and ruwiki
Closed, ResolvedPublic

Description

[WhNY-gpAAEAAAI@AlH0AAACN] 2017-11-20 22:36:46: Неустранимое исключение типа «RuntimeException»

Appears for all readers and users in Russian Wikipedia.

Logstash

RuntimeException: Unable to parse threshold: [..]
 at /srv/mediawiki/php-1.31.0-wmf.7/extensions/ORES/includes/Stats.php on line 277

https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json

Screen Shot 2017-11-20 at 14.49.55.png (687×1 px, 127 KB)

Server Admin Log

21:37 <awight@tin> Started deploy [ores/deploy@5084251]: Updating ORES to revscoring 2.0.10, T179711
22:04 <demon@tin> rebuilt wikiversions.php and synchronized wikiversions files: group2 to wmf.8
22:10 Sharp rise in HTTP 500 errors
22:27 <awight@tin> Finished deploy [ores/deploy@5084251]: Updating ORES to revscoring 2.0.10 (duration: 49m 54s)
22:54 <awight@tin> Started deploy [ores/deploy@5084251]: Rollback ORES; T179711
22:55 <awight> rolling back ORES to fix T181006
22:55 <demon@tin> rebuilt wikiversions.php and synchronized wikiversions files: no wmf.8 for group2. i hate my life
22:55 <awight@tin> Finished deploy [ores/deploy@5084251]: Rollback ORES (duration: 01m 05s)
23:11 <awight> purged memcache key 'ruwiki:ORES:threshold_statistics:goodfaith:1’,
23:25 <awight@tin> Started deploy [ores/deploy@82a13ae]: Rollback ORES (take 3); T181006

Event Timeline

MBH triaged this task as Unbreak Now! priority.Nov 20 2017, 10:41 PM
[WhNY-gpAAEAAAI@AlH0AAACN] /wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA_%D0%BD%D0%B0%D0%B1%D0%BB%D1%8E%D0%B4%D0%B5%D0%BD%D0%B8%D1%8F?days=0.004001782407407407   RuntimeException from line 277 of /srv/mediawiki/php-1.31.0-wmf.8/extensions/ORES/includes/Stats.php: Unable to parse threshold: {"levelName":"verylikelybad","levelConfig":"maximum recall @ precision >= 0.75","bound":"max","statsData":{"false":{"maximum recall @ precision >= 0.15":{"!f1":0.923,"!precision":0.995,"!recall":0.861,"accuracy":0.86,"f1":0.256,"filter_rate":0.841,"fpr":0.139,"match_rate":0.159,"precision":0.151,"recall":0.842,"threshold":0.252},"maximum recall @ precision >= 0.45":{"!f1":0.985,"!precision":0.977,"!recall":0.993,"accuracy":0.97,"f1":0.269,"filter_rate":0.988,"fpr":0.007,"match_rate":0.012,"precision":0.452,"recall":0.192,"threshold":0.797},"maximum recall @ precision >= 0.75":null},"true":{"maximum recall @ precision >= 0.995":{"!f1":0.254,"!precision":0.149,"!recall":0.854,"accuracy":0.856,"f1":0.921,"filter_rate":0.164,"fpr":0.146,"match_rate":0.836,"precision":0.995,"recall":0.856,"threshold":0.766}}}}
	#0 /srv/mediawiki/php-1.31.0-wmf.8/extensions/ORES/includes/Stats.php(241): ORES\Stats->extractBoundValue(string, string, string, array)
#1 /srv/mediawiki/php-1.31.0-wmf.8/extensions/ORES/includes/Stats.php(44): ORES\Stats->parseThresholds(array, string)
#2 /srv/mediawiki/php-1.31.0-wmf.8/extensions/ORES/includes/Hooks.php(316): ORES\Stats->getThresholds(string)
#3 /srv/mediawiki/php-1.31.0-wmf.8/includes/Hooks.php(177): ORES\Hooks::onChangesListSpecialPageStructuredFilters(SpecialWatchlist)
#4 /srv/mediawiki/php-1.31.0-wmf.8/includes/Hooks.php(205): Hooks::callHook(string, array, array, NULL)
#5 /srv/mediawiki/php-1.31.0-wmf.8/includes/specialpage/ChangesListSpecialPage.php(882): Hooks::run(string, array)
#6 /srv/mediawiki/php-1.31.0-wmf.8/includes/specials/SpecialWatchlist.php(152): ChangesListSpecialPage->registerFilters()
#7 /srv/mediawiki/php-1.31.0-wmf.8/includes/specialpage/ChangesListSpecialPage.php(1023): SpecialWatchlist->registerFilters()
#8 /srv/mediawiki/php-1.31.0-wmf.8/includes/specialpage/ChangesListSpecialPage.php(843): ChangesListSpecialPage->setup(NULL)
#9 /srv/mediawiki/php-1.31.0-wmf.8/includes/specials/SpecialWatchlist.php(85): ChangesListSpecialPage->getOptions()
#10 /srv/mediawiki/php-1.31.0-wmf.8/includes/specialpage/SpecialPage.php(522): SpecialWatchlist->execute(NULL)
#11 /srv/mediawiki/php-1.31.0-wmf.8/includes/specialpage/SpecialPageFactory.php(578): SpecialPage->run(NULL)
#12 /srv/mediawiki/php-1.31.0-wmf.8/includes/MediaWiki.php(287): SpecialPageFactory::executePath(Title, RequestContext)
#13 /srv/mediawiki/php-1.31.0-wmf.8/includes/MediaWiki.php(851): MediaWiki->performRequest()
#14 /srv/mediawiki/php-1.31.0-wmf.8/includes/MediaWiki.php(523): MediaWiki->main()
#15 /srv/mediawiki/php-1.31.0-wmf.8/index.php(43): MediaWiki->run()
#16 /srv/mediawiki/w/index.php(3): include(string)
#17 {main}

Looks like 22:27 awight@tin: Finished deploy [ores/deploy@5084251]: Updating ORES to revscoring 2.0.10, T179711 (duration: 49m 54s) probably started this?

@Catrope Definitely caused by my deployment. The strange thing is, what we deployed was a fix to T179711, which only should have added that "null" value for requests that were already failing due to impossible config.

The fix (ideally) is to tweak the ruwiki thresholds config until it's within the possible range.

Still don't working. When it will be fixed?

Mentioned in SAL (#wikimedia-operations) [2017-11-20T23:11:47Z] <awight> purged memcache key 'ruwiki:ORES:threshold_statistics:goodfaith:1’, T181006

Change 392535 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] Ores: Emergency disable on frwiki and ruwiki

https://gerrit.wikimedia.org/r/392535

Krinkle renamed this task from Watchlist and RecentChanges don't work on ruwiki to Watchlist and RecentChanges failure due to ORES on frwiki and ruwiki.Nov 20 2017, 11:28 PM
Krinkle edited projects, added Wikimedia-Incident; removed WMF-General-or-Unknown.
Krinkle updated the task description. (Show Details)

Change 392535 merged by jenkins-bot:
[operations/mediawiki-config@master] Ores: Emergency disable on frwiki and ruwiki

https://gerrit.wikimedia.org/r/392535

Mentioned in SAL (#wikimedia-operations) [2017-11-20T23:35:47Z] <legoktm@tin> Synchronized wmf-config/InitialiseSettings.php: emergency disable ORES on frwp/ruwp T181006 (duration: 00m 49s)

I've confirmed trhat both Wikis have recovered.

So, for clarity, it seems that in this case, ORES began to work as documented and that caused a failure in Watchlist/RecentChanges. It seems that the next step WRT completing this reverted deployment is to fix the way that Watchlist/RecentChanges degrade.

awight lowered the priority of this task from Unbreak Now! to High.Nov 21 2017, 12:03 AM

Reducing the priority, we need to reenable ORES on these wikis very carefully. ORES server code is rolled back, so this should *theoretically* be a smooth re-enablement.

When ORES will be reenabled?

Change 392845 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/extensions/ORES@master] Disable the filter if ORES says the threshold doesn't exist

https://gerrit.wikimedia.org/r/392845

In T181006#3780900, @MaxBioHazard wrote:

When ORES will be reenabled?

Unfortunately, as we're still dealing with the fallout of the events, we do not plan on reenabling until next week, even then we aren't entirely sure. Sorry for the inconvience.

@MaxBioHazard it seems like Release-Engineering-Team would like us to wait until next week -- after the US holiday. I wish we could have it re-enabled sooner. Thanks for your patience and sorry for the inconvenience.

OK, but can you explain - why you can't just undo the change, that caused this crash? And is the neural network for Russian language damaged?

The machine predictor for Russian is intact. It's an incompatibility with MediaWiki that caused the problem. We can't just switch the configuration back because it may cause an outage again.

I've just asked in #wikimedia-releng what the chances are of getting a config change through today and will report back.

I've just asked in #wikimedia-releng what the chances are of getting a config change through today and will report back.

It is the wednesday before a long weekend where all of Release Engineering and many in Ops will not be working. There is a rule for "No deploys on Friday." Today is this week's Friday. No :)

This new feature can wait until next week.

Thanks for chiming in @greg. The good news for @MaxBioHazard is that we've narrowed in on the issue so we know what caused it and can move forward with confidence next week. See T181168.

Change 392845 merged by jenkins-bot:
[mediawiki/extensions/ORES@master] Disable the filter if ORES says the threshold doesn't exist

https://gerrit.wikimedia.org/r/392845

"Next week" is here, so we are waiting for ORES reenabling.

My patch is merged and I will backport it today. Then we will reenable one wiki to be sure it's not making a problem.

Change 393659 had a related patch set uploaded (by Awight; owner: Amir Sarabadani):
[mediawiki/extensions/ORES@wmf/1.31.0-wmf.8] Disable the filter if ORES says the threshold doesn't exist

https://gerrit.wikimedia.org/r/393659

Change 393659 merged by jenkins-bot:
[mediawiki/extensions/ORES@wmf/1.31.0-wmf.8] Disable the filter if ORES says the threshold doesn't exist

https://gerrit.wikimedia.org/r/393659

Change 393667 had a related patch set uploaded (by Awight; owner: Awight):
[operations/mediawiki-config@master] Reenable ORES on frwiki, ruwiki, and wikidatawiki

https://gerrit.wikimedia.org/r/393667

Change 393667 merged by jenkins-bot:
[operations/mediawiki-config@master] Reenable ORES on frwiki, ruwiki, and wikidatawiki

https://gerrit.wikimedia.org/r/393667

Mentioned in SAL (#wikimedia-operations) [2017-11-27T21:59:37Z] <awight@tin> Synchronized wmf-config/InitialiseSettings.php: Reenable ORES on frwiki, ruwiki, and wikidata; T181006 (duration: 00m 45s)

awight claimed this task.

ORES is reenabled on these wikis.