Page MenuHomePhabricator

June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact
Closed, ResolvedPublicPRODUCTION ERROR

Description

Since approx Friday June 25, appservers seem to be accumulating 'stuck' active workers, which eventually causes a large tail latency impact.

Number of appservers with zero idle php-fpm workers: https://w.wiki/3YvS

image.png (722×1 px, 94 KB)

Tail latency impact:
image.png (812×1 px, 139 KB)

@Dzahn and @elukey and others have been doing various rolling restarts as seen in SAL.

The cause is not clear; perhaps related to a certain kind of request, or perhaps a newly-introduced bug?

There's also a notable correlation with increased cache miss / regenerations on WANObjectCache types SqlBlobStore_blob and also filerepo_file_foreign_description. Unclear if this is a symptom or a cause.

https://w.wiki/3YvU

image.png (691×1 px, 161 KB)

Event Timeline

CDanis triaged this task as High priority.Sun, Jun 27, 4:30 PM

Some notes from the restarts:

  • The appservers with a lot of busy workers eventually failed pybal health checks and were not able to respond to php7adm /apcu-frag (see T285593)
  • Before I started last big batch of restarts (~30) I checked php7adm /apcu-frag on all those nodes and the fragmentation was in the range of 75%-85%
  • After a php-fpm restart the appserver returned into an healthy state without getting back into the weird state (so causing latency problems etc..). In other words, I had to restart php-fpm on appservers only once for each server.

The feeling that I have is that for some reason the APCu fragmentation on appservers started to grow steadily, and the impact to latency was the side effect of having one or more appservers running with more than 90% of APCu fragmentation.

and also filerepo_file_foreign_description

This is something that @Ladsgroup and @tstarling have been working on, specifically around file metadata. I did notice https://gerrit.wikimedia.org/r/c/mediawiki/core/+/699907 in wmf.11 adding/modifying some recursion, but it looks correct to me, assuming the $metadata isn't super deep.

Just as reminder, mw1384 was depooled in the second wave of appserver issues (over the weekend), the aim was to use it to investigate further. Some details about what I currently see:

elukey@mw1384:~$ php7adm opcache-info | jq '.'
{
  "opcache_enabled": true,
  "cache_full": false,
  "restart_pending": false,
  "restart_in_progress": false,
  "memory_usage": {
    "used_memory": 540358888,
    "free_memory": 428286856,
    "wasted_memory": 105096080,
    "current_wasted_percentage": 9.78783518075943
  },
  "interned_strings_usage": {
    "buffer_size": 100663296,
    "used_memory": 38252016,
    "free_memory": 62411280,
    "number_of_strings": 843283
  },
  "opcache_statistics": {
    "num_cached_scripts": 12237,
    "num_cached_keys": 20577,
    "max_cached_keys": 32531,
    "hits": 34126771539,
    "start_time": 1624458727,
    "last_restart_time": 0,
    "oom_restarts": 0,
    "hash_restarts": 0,
    "manual_restarts": 0,
    "misses": 19543,
    "blacklist_misses": 0,
    "blacklist_miss_ratio": 0,
    "opcache_hit_rate": 99.99994273414117
  }
}
elukey@mw1384:~$ php7adm apcu-frag
<!DOCTYPE html>
<html lang="en" dir="ltr">
<meta charset="utf-8">
<title>Wikimedia Error</title>
[..]
elukey@mw1384:~$ sudo journalctl -u php7.2-fpm | tail -n 2
Jun 29 06:02:59 mw1384 php7.2-fpm[36464]: @cee: {"channel":"exception","message":"[YNq3klJLXfqpAEce5jvd5gAAAAQ] PHP Fatal error: Allowed memory size of 524288000 bytes exhausted (tried to allocate 67108872 bytes) in /var/www/php-monitoring/lib.php:41","exception":{"message":"Allowed memory size of 524288000 bytes exhausted (tried to allocate 67108872 bytes)","file":"/var/www/php-monitoring/lib.php:41","trace":"#0 [internal function]: unknown()\n#1 /var/www/php-monitoring/lib.php(41): apcu_sma_info()\n#2 /var/www/php-monitoring/lib.php(348): apcu_frag()\n#3 /var/www/php-monitoring/index.php(41): show_apcu_frag()\n"},"caught_by":"/etc/php/php7-fatal-error.php via php-wmerrors","phpversion":"7.2.34-18+0~20210223.60+debian10~1.gbpb21322+wmf1","servergroup":"appserver","type":"mediawiki","normalized_message":"Allowed memory size of 524288000 bytes exhausted (tried to allocate 67108872 bytes) in lib.php","reqId":"YNq3klJLXfqpAEce5jvd5gAAAAQ","url":"/apcu-frag","ip":"::1","http_method":"GET","server":"wikimedia.org"}
Jun 29 06:02:59 mw1384 php7.2-fpm[36464]: PHP Fatal error:  Allowed memory size of 524288000 bytes exhausted (tried to allocate 67108872 bytes) in /var/www/php-monitoring/lib.php on line 41
elukey@mw1384:~$ free -m
              total        used        free      shared  buff/cache   available
Mem:          63879       12080       10930        6525       40868       44743
Swap:             0           0           0

The php slow log shows entries up to the depool time, and most of the stack traces end up in apcu-related calls. Can't really find much more, but we should either debug further the host or just restart php-fpm and repool it.

Mentioned in SAL (#wikimedia-operations) [2021-06-29T08:21:38Z] <elukey> depool mw1355 (mw appserver) for debugging - T285634

The problem seems to be quite clearly caused by excessive apcu locking. Let's review the symptoms:

  1. The slow log shows all long-running requests are stuck waiting for apcu
  2. The cpu fragmentation is clearly very high when this happens
  3. The apcu fragmentation check runs out of memory on these appservers, showing we have way too many slots to compute

We should backtrack which patch caused this by looking at the timing too.

Mentioned in SAL (#wikimedia-operations) [2021-06-29T08:25:44Z] <elukey> cumin 'A:mw-eqiad' '/usr/local/sbin/restart-php7.2-fpm' -b 2 -s 30 - T285634

Also: the memory gets exhausted by this operation:

$sma_info = apcu_sma_info();

this means that that data structure is larger than 600 MB on those servers, again contributing to the idea that the problem are many small slots being created in acpu, and the general acpu usage going up significantly.

Joe raised the priority of this task from High to Unbreak Now!.Tue, Jun 29, 8:35 AM
Joe added projects: Release, Train Deployments.
Joe changed the subtype of this task from "Task" to "Production Error".

The recurring problem seems to have started last friday, June 25th, in the first hours of the day. This seems to point to the latest train release as the culprit:

19:10 dduvall@deploy1002: rebuilt and synchronized wikiversions files: all wikis to 1.37.0-wmf.11

Adding relevant tags because this seems like a train blocker; also setting the priority as UBN! as we have to do rolling restarts of appservers daily, and that's not ok.

and also filerepo_file_foreign_description

This is something that @Ladsgroup and @tstarling have been working on, specifically around file metadata.

I don't think I changed ForeignDBFile::getDescriptionText() or anything in that area. I might have changed some other WANObjectCache calls, but not that one.

In general, WANObjectCache is just meant to work. If calling getWithSetCallback() too much causes this issue, that's a bug in itself, never mind what is calling it. It should either be configured in a different way, or it should be smart enough to not break things.

Mentioned in SAL (#wikimedia-operations) [2021-06-29T08:47:21Z] <elukey> repool mw13[55,84] after debugging - T285634

I don't know if it helps, the increasing numbers of busy workers correlates with an elevated number of apcu gets https://w.wiki/3ZMn

image.png (932×1 px, 125 KB)

image.png (914×1 px, 117 KB)

@jijiki I think this correlation is another hint that what we get is some form of deadlock in apcu under excessive pressure. Why that happens is kind-of a mystery though at the moment.

If you expand the above graphs you can clearly see something changed around the night between the 24th and the 25th

I have a suspicion that this Wikibase cache is related

https://grafana.wikimedia.org/d/u5wAugyik/wikibase-statsdrecordingsimplecache?orgId=1&from=now-7d&to=now

image.png (505×1 px, 99 KB)

This shows a 5-100x hit rate

also some non trival increases in miss rate

image.png (476×1 px, 81 KB)

I think what @Addshore just found is a good candidate for being the source of the issue.

I'll try and get some more info from apcu on servers, although they've all been recently restarted to ease the pressure. I will take several snapshots of the apcu metadata during the day.

Sadly, php-fpm doesn't expose apcu data, just the metadata. We might be able to extract the data with some gdb wizardry, but I'd leave that as a last resort.

Scavenging the production logs, we found that Special:EntityData requests for rdf documents were possibly the culprit.

This is the result of profiling http://www.wikidata.org/wiki/Special:EntityData/Q146190.rdf : https://performance.wikimedia.org/xhgui/run/view?id=60daeb09163043ed29f0b1db

We perform 340k apcu_store calls and 178k apcu_get calls just for that page.

Scavenging the production logs, we found that Special:EntityData requests for rdf documents were possibly the culprit.

Did the code change, or is it just someone is spidering Special:EntityData, hitting that code an awefull lot?

Scavenging the production logs, we found that Special:EntityData requests for rdf documents were possibly the culprit.

Did the code change, or is it just someone is spidering Special:EntityData, hitting that code an awefull lot?

I'd say the former, given the perfect time correspondence between wmf.11 being deployed to group 1 and sudden increases in apcu activity.

Change 702096 had a related patch set uploaded (by Tarrow; author: Tarrow):

[mediawiki/extensions/Wikibase@master] Use EntityLookup backed TermLookup for Rdf PropertyStubs

https://gerrit.wikimedia.org/r/702096

Change 702018 had a related patch set uploaded (by Ladsgroup; author: Tarrow):

[mediawiki/extensions/Wikibase@wmf/1.37.0-wmf.11] Use EntityLookup backed TermLookup for Rdf PropertyStubs

https://gerrit.wikimedia.org/r/702018

Change 702019 had a related patch set uploaded (by Ladsgroup; author: Tarrow):

[mediawiki/extensions/Wikibase@wmf/1.37.0-wmf.12] Use EntityLookup backed TermLookup for Rdf PropertyStubs

https://gerrit.wikimedia.org/r/702019

Did the code change, or is it just someone is spidering Special:EntityData, hitting that code an awefull lot?

Yes, entities mentioned by other entities started being looked up from secondary data.

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/699417 is where this started for all entities and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/698474/12/repo/includes/Rdf/RdfBuilder.php#421 is where it happened a little earlier for just properties.

However, with wmf10 being skipped looks like both happened at the same time.

Change 702018 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@wmf/1.37.0-wmf.11] Use EntityLookup backed TermLookup for Rdf PropertyStubs

https://gerrit.wikimedia.org/r/702018

Mentioned in SAL (#wikimedia-operations) [2021-06-29T11:35:10Z] <ladsgroup@deploy1002> sync-file aborted: Backport: [[gerrit:702018|Use EntityLookup backed TermLookup for Rdf PropertyStubs (T285634)]] (duration: 00m 10s)

Mentioned in SAL (#wikimedia-operations) [2021-06-29T11:36:47Z] <ladsgroup@deploy1002> Synchronized php-1.37.0-wmf.11/extensions/Wikibase/repo/includes/Rdf/PropertyStubRdfBuilder.php: Backport: [[gerrit:702018|Use EntityLookup backed TermLookup for Rdf PropertyStubs (T285634)]], Part I (duration: 00m 56s)

Mentioned in SAL (#wikimedia-operations) [2021-06-29T11:38:03Z] <ladsgroup@deploy1002> Synchronized php-1.37.0-wmf.11/extensions/Wikibase/repo/: Backport: [[gerrit:702018|Use EntityLookup backed TermLookup for Rdf PropertyStubs (T285634)]], Part II (duration: 00m 58s)

This is basically done, we just need to wait to see if it continues to happen.

Change 702096 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Use EntityLookup backed TermLookup for Rdf PropertyStubs

https://gerrit.wikimedia.org/r/702096

Change 702019 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@wmf/1.37.0-wmf.12] Use EntityLookup backed TermLookup for Rdf PropertyStubs

https://gerrit.wikimedia.org/r/702019

Data on the number of apcu gets/s normalized after the release: https://grafana-rw.wikimedia.org/goto/4-r5Lhk7z

I'm going to optimistically resolve this bug.