Page MenuHomePhabricator

GrowthExperiments:graduateEligibleMentees.php maintenance scripts OoMs for eswiki
Open, HighPublic

Description

I tried running GrowthExperiments:graduateEligibleMentees.php (a brand new maintenance script for T403563) in production eswiki, and it OoM'ed:

[urbanecm@deploy1003 ~]$ mwscript-k8s -f GrowthExperiments:graduateEligibleMentees.php -- --wiki=eswiki --dry-run
⏳ Starting GrowthExperiments:graduateEligibleMentees.php on Kubernetes as job mw-script.eqiad.vlmj41vw ...
🚀 Job is running.
📜 Streaming logs:
Processing mentor Oscar ....    done! Would graduate 25 mentees so far.
Processing mentor Tsaorin...    done! Would graduate 35 mentees so far.
/usr/local/bin/mwscript: line 124:     8 Killed                  ${RUNNER} ${CMD} "${@}"
Processing mentor Hard...[urbanecm@deploy1003 ~]$
[urbanecm@deploy1003 ~]$ kubectl get job mw-script.eqiad.vlmj41vw
NAME                       COMPLETIONS   DURATION   AGE
mw-script.eqiad.vlmj41vw   0/1           18m        18m
[urbanecm@deploy1003 ~]$ kubectl get pods -l job-name=mw-script.eqiad.vlmj41vw
NAME                             READY   STATUS      RESTARTS   AGE
mw-script.eqiad.vlmj41vw-vgcbs   0/3     OOMKilled   0          18m
[urbanecm@deploy1003 ~]$

The resource limits are currently set to:

resources:
  limits:
    cpu: "1"
    memory: 1200Mi
  requests:
    cpu: "1"
    memory: 1000Mi

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Since it would be very useful to have profiling data on this (as well as for T395893, another OoM issue Growth has ATM), I tried looking into ways how to profile scripts in production. I used https://wikitech.wikimedia.org/wiki/WikimediaDebug#Plaintext_CLI_profile to start from.

Unfortunately, that page says "FIXME: Update using mwscript-k8s" and merely replacing mwscript with mwscript-k8s doesn't work:

[urbanecm@deploy1003 ~]$ mwscript-k8s -f showSiteStats.php -- --wiki=nlwiki --profiler=text
⏳ Starting showSiteStats.php on Kubernetes as job mw-script.eqiad.eqzqaeuk ...
🚀 Job is running.
📜 Streaming logs:
Total edits       : 69727796
Number of articles:  2197122
Total pages       :  4728808
Number of users   :  1443595
Active users      :     8531
Number of images  :       20
[urbanecm@deploy1003 ~]$

I looked into why that might be. Firstly, --profiler depends on having a reasonable profiling config in $wgProfiler. In production, this variable is constructed as $wmgProfiler in src/Profiler.php (in config repo), which is then copied over to $wgProfiler by CommonSettings.php.

src/Profiler.php is supposed to be started via PhpAutoPrepend.php, which is in turn executed via the auto_prepend_file config by PHP itself. For some reason, auto_prepend_file is only set for web workers, not in CLI, which seems to be intentional:

* PHP is configured to execute this file before the main script. This uses the
* `auto_prepend_file` setting. This is currently enabled on web requests
* only, not on Maintenance/CLI processes.

However, in beta, this is _not_ the case:

urbanecm@deployment-deploy04:~$ mwscript shell.php cswiki
Psy Shell v0.12.10 (PHP 8.3.24 — cli) by Justin Hileman
> ini_get('auto_prepend_file')
= "/srv/mediawiki/wmf-config/PhpAutoPrepend.php"

> ^D

   INFO  Ctrl+D.

urbanecm@deployment-deploy04:~$

and indeed, profiling information can be generated using --profiler=text in beta:

urbanecm@deployment-deploy04:~$ mwscript showSiteStats.php --wiki=cswiki --profiler=text
Total edits       : 4823
Number of articles: 2681
Total pages       : 2902
Number of users   :  581
Active users      :    5
Number of images  :    0
<!--
100.00% 251.247      1 - main()
 40.87% 102.680      1 - MediaWiki\Maintenance\MaintenanceRunner::run
 38.98% 97.927      1 - ShowSiteStats::execute
 33.97% 85.359      1 - Wikimedia\Rdbms\SelectQueryBuilder::fetchRow
 33.97% 85.349      1 - Wikimedia\Rdbms\DBConnRef::selectRow
 33.97% 85.337      1 - Wikimedia\Rdbms\DBConnRef::__call
 31.64% 79.497      1 - Wikimedia\Rdbms\DBConnRef::ensureConnection
 31.62% 79.453      1 - Wikimedia\Rdbms\LoadBalancer::getConnectionInternal
 30.84% 77.474      1 - Wikimedia\Rdbms\LoadBalancer::getReaderIndex
[...]

That all being said, --profiler=text (if it works) only includes information about the time spent in individual methods, not information about memory consumption. So even if it was working, it would not be directly useful.

There was also a short conversation on this in #wikimedia-sre:

00:59 <urbanecm> is it possible to get profiling data from jobs (even completed ones)?
00:59 <urbanecm> i can run the job on smaller set of data and profiling could help me understand what is going on
01:00 <rzl> hm, I don't know enough about PHP profiling to give you a solid answer, but I suspect it's doable with a little wiggling
01:01 <rzl> two broad approaches are to log whatever data you need to stdout on some container (in which case you can retrieve it after completion with `kubectl logs`) or to get the pod to stay running so that you can talk to it -- in which case maybe you launch a `shell.php` job and invoke your real work from inside it
01:05 <urbanecm> hmm, https://wikitech.wikimedia.org/wiki/WikimediaDebug#Plaintext_CLI_profile seems to have some guidance, but just running things with --profiler=text doesn't seem to do anything
01:08 <urbanecm> it _does_ work in beta though
01:09 <rzl> yeah I'm not sure -- the wrapper script should pass --profiler=text through to mwscript just fine, but my guess is whatever profiling stuff it needs isn't actually available in the cli image
01:14 <swfrench-wmf> wait, hmmm ... urbanecm: do you happen to know why profiler you need? I just realized that xhprof _is_ in the cli image :)
01:16 <urbanecm> yeah, i'm just reading https://github.com/wikimedia/operations-mediawiki-config/blob/master/src/Profiler.php#L65, which does have some logic for setting that up
01:18 <swfrench-wmf> indeed, we recently switched it over from tideways
01:23 <bd808> urbanecm: K.rinkle would be a good person to poke for help in figuring out profiling. He's the wizard at most of that kind of stuff in prod.
01:23 <swfrench-wmf> ^ this :)

@Krinkle, do you have any suggestions for how to proceed here? If we had information about how much memory is used (ideally, by uploading data to https://performance.wikimedia.org/xhgui on-request, similar to XWikimediaDebug), this would be much easier to investigate.

I succeeded in getting some profiling data for the GrowthExperiments:graduateEligibleMentees.php --wiki=eswiki --mentor=Astromessier execution@eswiki. They are available on https://performance.wikimedia.org/xhgui/run/view?id=68dbff845b0226f971a7f33d, and I'll be posting an explanatory comment that describes how I got there.

That all being said, --profiler=text (if it works) only includes information about the time spent in individual methods, not information about memory consumption. So even if it was working, it would not be directly useful.

This is only partially true. ProfilerXhprof::getFunctionReport() only prints time spent, but it has access to memory consumption information as well. So, this would be a matter of getting --profiler working _and_ updating the output format to include more information.

I spent some time looking into how we can get more data about this. First, I was thinking how to trigger profiling code from src/Profiler.php (in the config repo) in a maintenance script context. Whenever profiling is in use, src/Profiler.php (in the config repo) is responsible for controlling it. Specifically, that file:

  1. Runs xhprof_enable() with XHPROF_FLAGS_MEMORY (among other flags), meaning we do have memory usage information
  2. Relies on ProfilerXhprof (in core) to generate textual representation of the profiling data (see few comments above for an example)
  3. When XWikimediaDebug's profiling option is enabled, it uses XhguiSaverPdo (in the config repo) to push profiling data to https://performance.wikimedia.org/xhgui/, where it can be analyzed

As prior investigation showed, the textual representation generated by ProfilerXhprof does not include any information about memory. This doesn't mean it cannot be added, as the XHPROF_FLAGS_MEMORY flag is passed, meaning this information is available.

However, memory consumption data is pushed to the https://performance.wikimedia.org/xhgui database. So, I started looking into whether I can convince Profiler.php to do this via a CLI. It turns out this is possible under two conditions:

  • wmf-config/PhpAutoPrepend.php is included "fairly early" during the lifecycle (under normal condtions, it is included by the autoprepend feature as the very first thing that runs),
  • $profileToXhgui in Profiler::xhprofSetup evaluates to true

The second condition is easy to fulfil by live hacking (simply override the value of that variable to true). The first one is more trickier, as we cannot livehack php configuration (at least, not easily). However, it turns out the file doesn't need to be included as the very first thing (at the cost of profiling accuracy). In practice, it seems to be sufficient to run it as the first command in CommonSettings.php (which can be also done via livehacking).

In production, mw-experimental is the environment for livehacking. This made me ask: how can I run maintenance scripts in mw-experimental?

It turns out that with the mw-experimental-deploy credentials, I can run kubectl exec, shelling into the mw-experimental pod, running /srv/mediawiki/multiversion/MWScript.php manually. So, I made the above-described changes in mw-experimental using the documented process, and then run the maintenance script as follows:

[urbanecm@deploy2002 ~]$ kubectl exec -c mediawiki-pinkllama-app mw-experimental.codfw.pinkllama-779d86d874-znl5n -it -- php /srv/mediawiki/multiversion/MWScript.php Version.php --wiki=dewiki

As a result, https://performance.wikimedia.org/xhgui has the profiling data, as expected.


As I'm writing this, I realise it might've been wiser to include PhpAutoPrepend.php from MWScript.php instead, as that is the entrypoint I am executing. In that case, the final effect might be very close to what auto prepend would do.

To verify whether this is still an issue, I'm re-running the script again (on eswiki, in dry-run):

[urbanecm@deploy2002 ~]$ mwscript-k8s -f GrowthExperiments:graduateEligibleMentees.php -- --wiki=eswiki --dry-run
ℹ️ Your job will run on PHP 8.3 (T405955). If you encounter a compatibility issue, you can use --php_version 8.1 to explicitly select 8.1.
⏳ Starting GrowthExperiments:graduateEligibleMentees.php on Kubernetes as job mw-script.codfw.1eu0hage ...

Okay, this OOM'ed fast:

[urbanecm@deploy2002 ~]$ kubectl get pods -l job-name=mw-script.codfw.1eu0hage
NAME                             READY   STATUS      RESTARTS   AGE
mw-script.codfw.1eu0hage-h8rrt   0/3     OOMKilled   0          4m53s
[urbanecm@deploy2002 ~]$