Page MenuHomePhabricator

Investigate doubling of hhvm APC value size
Closed, InvalidPublic

Description

Since September 8th or so there is a rise in hhvm apc value size and entries count.

I am not sure if it is an issue, but it seems worth investigating.

See https://grafana.wikimedia.org/dashboard/db/hhvm-apc-usage

apcvalue.png (670×1 px, 169 KB)

Screen Shot 2016-09-29 at 20.07.41.png (1×2 px, 416 KB)

  • APC Value size: From ~1.5GB to ~3.7GB (247% increase)
  • APC Entries count: From 85K to 190K (223% increase)

Event Timeline

The two large bumps happened in the weeks starting exactly on the day of group2 going to 1.28.0-wmf18 (e.g. enwiki). Before that, the graph looks pretty flat and stable.

I don't see any new APC users in core added in wmf18 (nor wmf19).

Looking through MW for ObjectCache::getLocalServerInstance() I see that CirrusSearch has two APC callers. There also were some suggester changes deployed in 1.28.0-wmf18 and many of the graphs at https://grafana.wikimedia.org/dashboard/db/elasticsearch line up with the APC graphs.

hmm, the only cirrus change that involved the APC cache in wmf17->wmf18 should have been a test only change. The number of keys also increased from 90k to 150k, and now 180k, but cirrus's usage of server local cache is un-parameterized, so in total we should have <2k keys.

The bump in the elasticsearch graphs on 9/8 should coincide with a new background check. Basically we have a cron job that regularly runs and queue's jobs that check that the search indices and the database are in sync. That check was pre-existing, but we added a new condition to it that ensures not only are things existing and in the correct index, but also with the latest revision id. This turned up a bug in our re-indexing process that loses the stored revision id's, which has triggered a reindex (spread over ~2 weeks) of 20% of the search index. I'm not aware of anything in core that this triggers to use the local server cache when parsing wikitext pages to extract their search fields though.

one potential option, the hhvm admin server has /dump-apc and /dump-apc-meta endpoints. These are marked as "This is extremely expensive and not recommended for use outside of development scenarios". That said could probably depool a server that has a particularly large apc, and then dump the information. Since the key count increased from 90k to 180k it might not be too hard to try and figure out what is taking half the key space.

The /dump-apc endpoint is already enabled and you can use it from curl on a server: https://wikitech.wikimedia.org/wiki/HHVM/Troubleshooting

debt subscribed.

Removing the search team tag on this - please let us know if you need more information from us! :)

Krinkle renamed this task from rise in hhvm apc value size to Investigate 250% increase in hhvm apc value size.Sep 29 2016, 7:10 PM
Krinkle triaged this task as High priority.
Krinkle edited projects, added Performance-Team; removed Performance Issue.
Krinkle updated the task description. (Show Details)
Krinkle renamed this task from Investigate 250% increase in hhvm apc value size to Investigate doubling of hhvm APC value size.Sep 29 2016, 7:34 PM

Many HHVM instances have not been restarted since August. Since many APC keys include the branch name, we can expect the size to grow substantially as each new branch is deployed, provided HHVM is not restarted. This appears to be the case.