Page MenuHomePhabricator

Have some dedicated jobrunners that aren't active videoscalers
Closed, ResolvedPublic

Description

Follow-up from https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-03-30_Jobqueue_overload

Currently all of our jobrunners are also videoscalers. This is usually fine except sometimes in cases of high videoscaling demand can cause spikes of ffmpeg usage, overloading the host. There's also a difference in importance, the job queue is pretty important to be running smoothly, users will notice pretty quickly when it's delayed. But video scaling is usually less important as it only affects those people uploading new media and trying to use them in articles.

In this week's outage the job queue basically stalled while videoscalers sucked up all the CPU. I propose that we *always* have 2-4 jobrunners that are depooled from the videoscaling cluster that will just execute standard jobs so if there is another videoscaler overload it doesn't fully stop the job queue.

Event Timeline

On the change-prop side we already route all video scaling jobs to videoscaler.discovery.wmnet and all other jobs to jobrunner.discovery.wmnet so no change needed there.

I agree, we should start with 2 servers, with a higher weight than the others, and adjust in case we have some similar issues. I think we are getting new servers in eqiad, we could use 2 from the new batch for this job.

We probably should have reserved capacity for both clusters. Something like 4 for jobrunners, 2 for videoscalers.

@akosiaris would start like this then 2 (jr) + 2 (vs) + 20 (both), and see if it is sensible enough,

edit: typo in number of servers

jijiki triaged this task as Medium priority.Apr 2 2021, 11:47 AM

@akosiaris would start like this then 2 (jr) + 2 (vs) + 2 (both), and see if it is sensible enough,

Sure. Don't go overboard with trying to pick the right numbers, this is about mitigating a risk (massive server side upload) that we can also react quickly to and have runbooks for, plus this work will become moot when mediawiki is moved to kubernetes. Interestingly, the jobrunners are interestingly one of the first (if not THE FIRST) clusters, we can deploy kubernetes pods for as it's not in the hot path of end-user requests.

videscalers

{ 'host': 'mw1335.eqiad.wmnet', 'weight':20, 'enabled': True }
{ 'host': 'mw1336.eqiad.wmnet', 'weight':20, 'enabled': True }

jobrunners

{ 'host': 'mw1337.eqiad.wmnet', 'weight':20, 'enabled': True }
{ 'host': 'mw1338.eqiad.wmnet', 'weight':20, 'enabled': True }

Let's monitor and see if that is sensible.

videscalers

{ 'host': 'mw1335.eqiad.wmnet', 'weight':20, 'enabled': True }
{ 'host': 'mw1336.eqiad.wmnet', 'weight':20, 'enabled': True }

jobrunners

{ 'host': 'mw1337.eqiad.wmnet', 'weight':20, 'enabled': True }
{ 'host': 'mw1338.eqiad.wmnet', 'weight':20, 'enabled': True }

Let's monitor and see if that is sensible.

+1

Once we give it a thumbs up, let's not forget to make it permanent in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/conftool-data/node/eqiad.yaml#134 (and videoscalers stanza a few lines below)

I just mentioned this in T279309. Since new hardware is being racked in T273915 and I was wondering if i should assign any as dedicated jobrunners.

But the ones you picked are fine as well, they will be used at least for another year.

+ 1 to those from me too. Note that we'll also need to make this distinction in codfw too.

I just mentioned this in T279309. Since new hardware is being racked in T273915 and I was wondering if i should assign any as dedicated jobrunners.

But the ones you picked are fine as well, they will be used at least for another year.

I think it would make sense to have the newer hardware on double duty (both jobscaler+videoscaler) since it'll be less likely to hit an overload.

ACK, thanks Legoktm. That does make sense.

Change 678926 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site/conftool-data: designate mw2394,mw2395 as dedicated jobrunners

https://gerrit.wikimedia.org/r/678926

Change 678926 merged by Dzahn:

[operations/puppet@production] site/conftool-data: designate mw2394,mw2395 as dedicated jobrunners

https://gerrit.wikimedia.org/r/678926

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw2394.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104132056_dzahn_9242_mw2394_codfw_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2021-04-13T20:58:56Z] <mutante> mw2395, mw2395 - reimaging as jobrunners (T279100)

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw2395.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104132103_dzahn_10155_mw2395_codfw_wmnet.log.

Completed auto-reimage of hosts:

['mw2394.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mw2395.codfw.wmnet']

and were ALL successful.

mw2394 and mw2395 have been reimaged as jobrunners/videoscalers and then I pooled them into the jobrunner cluster but NOT the videoscaler cluster.

I set the weight for jobrunner to 15 (regular jobrunners have weight 10).

@Legoktm Would you say this is resolved (for) now?

I think we're missing 2 codfw servers that are *only* videoscalers?

I'm a bit confused now. I thought that was the question we talked about in today's meeting.

Change 679022 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/puppet@production] conftool-data: Document which servers are only pooled as jobrunner/videoscaler

https://gerrit.wikimedia.org/r/679022

uh, that's right, my bad >.<

I submitted a documentation patch just to make sure it's explained why those servers are depooled but I think this is done otherwise.

Change 679022 merged by Legoktm:

[operations/puppet@production] conftool-data: Document which servers are only pooled as jobrunner/videoscaler

https://gerrit.wikimedia.org/r/679022

documentation patch +1, confirmed that's how it is now. Thanks. I could also be wrong, if we still want "videoscaler-only" in codfw as well, I can do that of course.

Change 679258 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] conftool: Create a shared jobrunner_videoscaler

https://gerrit.wikimedia.org/r/679258

I 've gone a bit overboard and created https://gerrit.wikimedia.org/r/679258 that uses YAML map merges to implement the patches above also in the structures. That will probably help avoid some unfortunate typo or mistake where the actual state in conftool differs from what we 've documented. What do you think?

While crafting it I also noted a minor discrepancy so there is a TODO in the patch about that. We seem to only have 1 dedicated videoscaler in codfw, we probably want to have 2 to match eqiad?

akosiaris@cumin1001:~$ sudo confctl select 'cluster=videoscaler,service=apache2' get |grep no
{"mw2395.codfw.wmnet": {"weight": 0, "pooled": "no"}, "tags": "dc=codfw,cluster=videoscaler,service=apache2"}
{"mw2394.codfw.wmnet": {"weight": 0, "pooled": "no"}, "tags": "dc=codfw,cluster=videoscaler,service=apache2"}
{"mw1338.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=videoscaler,service=apache2"}
{"mw1337.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=videoscaler,service=apache2"}
akosiaris@cumin1001:~$ sudo confctl select 'cluster=jobrunner,service=apache2' get |grep no
{"mw2410.codfw.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=codfw,cluster=jobrunner,service=apache2"}
{"mw1336.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=jobrunner,service=apache2"}
{"mw1335.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=jobrunner,service=apache2"}

I 've gone a bit overboard and created https://gerrit.wikimedia.org/r/679258 that uses YAML map merges to implement the patches above also in the structures. That will probably help avoid some unfortunate typo or mistake where the actual state in conftool differs from what we 've documented. What do you think?

I think we are overengineering this a bit. Next time we have an issue like that, our first actions will be to simply check the status of each cluster on etcd and rebalance. I feel that comments on puppet are good enough.

I 've gone a bit overboard and created https://gerrit.wikimedia.org/r/679258 that uses YAML map merges to implement the patches above also in the structures. That will probably help avoid some unfortunate typo or mistake where the actual state in conftool differs from what we 've documented. What do you think?

I think we are overengineering this a bit. Next time we have an issue like that, our first actions will be to simply check the status of each cluster on etcd and rebalance. I feel that comments on puppet are good enough.

True, but I wasn't wondering about what we will do when we have an issue, that much is clear. I wanted to more accurately depict the "normal" state of things and how to make it a tad easier to return to it after having taken corrective action due to an issue.

We seem to only have 1 dedicated videoscaler in codfw, we probably want to have 2 to match

I am still a bit unclear whether we want 0 or 2. But 1 seems wrong either way and at least for my part I did not intend it to be that way.

Alex, your patch looks good but i can also see Effie's point. hmm...

I think we are overengineering this a bit. Next time we have an issue like that, our first actions will be to simply check the status of each cluster on etcd and rebalance. I feel that comments on puppet are good enough.

True, but I wasn't wondering about what we will do when we have an issue, that much is clear. I wanted to more accurately depict the "normal" state of things and how to make it a tad easier to return to it after having taken corrective action due to an issue.

+1 to overengineering but given you've already done the overengineering I don't really see any reason not to merge it.

Some "errors" restarting php-fpm and depooling services popped up while running the train today and it was suggested to mention it here. Here is the stacktrace:

Screen Shot 2021-04-14 at 12.35.54 PM.png (560×1 px, 204 KB)

Change 679432 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] conftool: fix TODO by adding 2 dedicated codfw videoscalers

https://gerrit.wikimedia.org/r/679432

Reopening. The bug that @jeena reported in T279100#7000270 is reproducable. It seems that safe-service-restart won't take the previous state (pool/depooled) of a resource into account when verifying that everything is pooled.

So, the crux of the issue is at those 2 functions below

def pool(self, pooled):
    """Pool a list of services"""
    try:
        for obj in pooled:
            obj.update({"pooled": "yes"})
        self._verify_status(True)
        return True 
    except (BackendError, PoolStatusError) as e:
        logger.error("Error depooling the servers: %s", e)
        return False

def _verify_status(self, want_pooled):
    if want_pooled:
        desired_status = "enabled/up/pooled"
    else:
        desired_status = "disabled/*/not pooled"
    for baseurl in self.lvs_uris:
        parsed = parse.urlparse(baseurl)
        url = "{baseurl}/{fqdn}".format(baseurl=baseurl, fqdn=self.fqdn)
        # This will raise a PoolStatusError
        logger.debug("Now verifying %s", url) 
        self._fetch_retry(url, want_pooled, parsed, desired_status)

The pool() function will get a list of services to pool, in the specific case the jeena raised, it will only pool the service that was pooled already before the script started (that is the jobrunner service) and leave the videoscaler service depooled (correctly). However it calls the _verify_status() function that has no such knowledge of previous state and relies only on the lvs uris passed on the command line.

Let me see if I can fix that easily.

Change 681676 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] (WIP) conftool: improve safe-service-restart multiple cluster support

https://gerrit.wikimedia.org/r/681676

Change 679432 merged by Dzahn:

[operations/puppet@production] conftool: add comments about 2 dedicated videoscalers

https://gerrit.wikimedia.org/r/679432

So, the crux of the issue is at those 2 functions below

def pool(self, pooled):
    """Pool a list of services"""
    try:
        for obj in pooled:
            obj.update({"pooled": "yes"})
        self._verify_status(True)
        return True 
    except (BackendError, PoolStatusError) as e:
        logger.error("Error depooling the servers: %s", e)
        return False

def _verify_status(self, want_pooled):
    if want_pooled:
        desired_status = "enabled/up/pooled"
    else:
        desired_status = "disabled/*/not pooled"
    for baseurl in self.lvs_uris:
        parsed = parse.urlparse(baseurl)
        url = "{baseurl}/{fqdn}".format(baseurl=baseurl, fqdn=self.fqdn)
        # This will raise a PoolStatusError
        logger.debug("Now verifying %s", url) 
        self._fetch_retry(url, want_pooled, parsed, desired_status)

The pool() function will get a list of services to pool, in the specific case the jeena raised, it will only pool the service that was pooled already before the script started (that is the jobrunner service) and leave the videoscaler service depooled (correctly). However it calls the _verify_status() function that has no such knowledge of previous state and relies only on the lvs uris passed on the command line.

Let me see if I can fix that easily.

I think we should pool all servers back to all pools, and lower the weight of those we originally wanted in one pool, until we fix the restart script properly.

Change 682619 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] safe-service-restart: Only verify in scope services

https://gerrit.wikimedia.org/r/682619

So, the crux of the issue is at those 2 functions below

def pool(self, pooled):
    """Pool a list of services"""
    try:
        for obj in pooled:
            obj.update({"pooled": "yes"})
        self._verify_status(True)
        return True 
    except (BackendError, PoolStatusError) as e:
        logger.error("Error depooling the servers: %s", e)
        return False

def _verify_status(self, want_pooled):
    if want_pooled:
        desired_status = "enabled/up/pooled"
    else:
        desired_status = "disabled/*/not pooled"
    for baseurl in self.lvs_uris:
        parsed = parse.urlparse(baseurl)
        url = "{baseurl}/{fqdn}".format(baseurl=baseurl, fqdn=self.fqdn)
        # This will raise a PoolStatusError
        logger.debug("Now verifying %s", url) 
        self._fetch_retry(url, want_pooled, parsed, desired_status)

The pool() function will get a list of services to pool, in the specific case the jeena raised, it will only pool the service that was pooled already before the script started (that is the jobrunner service) and leave the videoscaler service depooled (correctly). However it calls the _verify_status() function that has no such knowledge of previous state and relies only on the lvs uris passed on the command line.

Let me see if I can fix that easily.

I think we should pool all servers back to all pools, and lower the weight of those we originally wanted in one pool, until we fix the restart script properly.

Yeah, that could work. Overall it might indeed be better to play with weights instead of actual pool/depool status. It's probably a fine stopgap until mediawiki makes it to k8s which will make most of this discussion moot.

That being said, I did try a quick solution at least to the bug that @jeena mentioned, see https://gerrit.wikimedia.org/r/682619

Change 684287 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] safe-service-restart: only verify pooled services

https://gerrit.wikimedia.org/r/684287

Change 681676 abandoned by Effie Mouzeli:

[operations/puppet@production] (WIP) conftool: improve safe-service-restart multiple cluster support

Reason:

https://gerrit.wikimedia.org/r/681676

Change 684287 merged by Giuseppe Lavagetto:

[operations/puppet@production] safe-service-restart: only verify pooled services

https://gerrit.wikimedia.org/r/684287

Removing SRE, has already been triaged to a more specific SRE subteam

Joe claimed this task.
Joe moved this task from Backlog to Done on the SRE-Sprint-Week-Sustainability-March2023 board.