Page MenuHomePhabricator

WDQS data-transfer cookbook needs to wait for updater to catchup on lag
Closed, ResolvedPublic

Description

As a user of WDQS, I only want non-lagged servers to be pooled.
As an operator of WDQS I don't want to be alerted transient issues that are part of normal operations during data reload.

WDQS data-transfer cookbook needs to wait for updater to catchup on lag before removing icinga downtime and repooling servers.

We already have code to wait for updater to catch up on lag in the data-reload cookbook. This code should be moved to Spicerack (so that it is in a central location) and should be used for both the data-reload and data-transfer cookbooks.

AC:

  • no alert are raised during normal operations of the data-transfer cookbook
  • wdqs servers are not pooled until they are fully catched up on lag

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 679320 had a related patch set uploaded (by Gehel; author: Gehel):

[operations/cookbooks@master] WDQS: Wait for updater to catchup during data transfer.

https://gerrit.wikimedia.org/r/679320

Change 679320 merged by Ryan Kemper:

[operations/cookbooks@master] WDQS: Wait for updater to catchup during data transfer.

https://gerrit.wikimedia.org/r/679320

Mentioned in SAL (#wikimedia-operations) [2021-04-15T04:14:14Z] <ryankemper> T280108 T267927 wdqs2008 (source) caught up on lag, xfering to wdqs1004: sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs1004.eqiad.wmnet --reason "transferring wikidata journal following reload from dumps" --blazegraph_instance blazegraph --task-id T267927

Mentioned in SAL (#wikimedia-operations) [2021-04-15T06:33:53Z] <ryankemper> !log T280108 T267927 data-transfer to wdqs1004 was successful; cookbook failed due to a newly introduced minor type error that didn't effect the transfer itself

Tested out the new cookbook, there's at least one small issue:

Exception raised while executing cookbook sre.wdqs.data-transfer:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 234, in run
    raw_ret = runner.run()
  File "/usr/lib/python3/dist-packages/spicerack/_module_api.py", line 18, in run
    return self._run(self.args, self.spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/wdqs/data-transfer.py", line 179, in run
    wait_for_updater(prometheus, site(source), source)
  File "/usr/lib/python3/dist-packages/spicerack/decorators.py", line 104, in wrapper
    return func(*args, **kwargs)  # type: ignore
  File "/srv/deployment/spicerack/cookbooks/sre/wdqs/__init__.py", line 25, in wait_for_updater
    last_updated = int(result['value'][1])
TypeError: list indices must be integers or slices, not str
END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)

Error seems quite straightforward - we'll want to patch this.

Change 679702 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/cookbooks@master] wdqs: result is list not dictionary

https://gerrit.wikimedia.org/r/679702

(For future reference)

import logging
logging.basicConfig(level=logging.DEBUG)
from spicerack import Spicerack
spicerack = Spicerack(verbose=True, dry_run=True)


def site(host):
    """Hacky way to get the site in which a host is located."""
    if 'eqiad' in str(host):
        return 'eqiad'
    if 'codfw' in str(host):
        return 'codfw'
    raise ValueError('Site is unknown for {host}.'.format(host=host))

remote = spicerack.remote()
prometheus = spicerack.prometheus()

remote_host = remote.query('wdqs2008.codfw.wmnet')

host = remote_host.hosts[0].split(".")[0]
query = "scalar(time() - blazegraph_lastupdated{instance='%s:9193'})" % host
result = prometheus.query(query, site(remote_host))
last_updated = int(result[1])
if last_updated > 1200:
        raise ValueError("Let's wait for updater to catch up, current last_updated of {} is too high.".format(last_updated))

Snippet used to diagnose the above issue without having to actually run the cookbook

Change 679702 merged by Ryan Kemper:

[operations/cookbooks@master] wdqs: result is list not dictionary

https://gerrit.wikimedia.org/r/679702

Mentioned in SAL (#wikimedia-operations) [2021-04-15T16:17:13Z] <ryankemper> T280108 T267927 Merged https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/679702 and ran puppet-agent on cumin2001 before next round of wdqs data-transfers

Mentioned in SAL (#wikimedia-operations) [2021-04-15T16:21:44Z] <ryankemper> T280108 T267927 Current wdqs transfers in progress: wqds1004->wdqs1005, wdqs2008->wdqs2001

100.0% (2/2) success ratio (>= 100.0% threshold) for command: 'systemctl start ...art wdqs-updater'.
100.0% (2/2) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
[1/1000, retrying in 600.00s] Attempt to run 'cookbooks.sre.wdqs.wait_for_updater' raised: invalid literal for int() with base 10: 'NaN'
[2/1000, retrying in 600.00s] Attempt to run 'cookbooks.sre.wdqs.wait_for_updater' raised: invalid literal for int() with base 10: '4458.164000034332'
[3/1000, retrying in 600.00s] Attempt to run 'cookbooks.sre.wdqs.wait_for_updater' raised: invalid literal for int() with base 10: '4230.2699999809265'
[4/1000, retrying in 600.00s] Attempt to run 'cookbooks.sre.wdqs.wait_for_updater' raised: invalid literal for int() with base 10: '3850.37700009346'
[5/1000, retrying in 600.00s] Attempt to run 'cookbooks.sre.wdqs.wait_for_updater' raised: invalid literal for int() with base 10: '3175.4830000400543'
[6/1000, retrying in 600.00s] Attempt to run 'cookbooks.sre.wdqs.wait_for_updater' raised: invalid literal for int() with base 10: '888.5910000801086'
[7/1000, retrying in 600.00s] Attempt to run 'cookbooks.sre.wdqs.wait_for_updater' raised: invalid literal for int() with base 10: '39.6949999332428'
[8/1000, retrying in 600.00s] Attempt to run 'cookbooks.sre.wdqs.wait_for_updater' raised: invalid literal for int() with base 10: '42.799999952316284'

Another small issue. Looks like we need to tell it to truncate (or just use doubles instead of ints)

Mentioned in SAL (#wikimedia-operations) [2021-04-15T22:33:28Z] <ryankemper> T280108 T267927 Data transfers completed successfully; small issue with new wait_for_updater logic is preventing termination so I ctrl+c'd manually

Mentioned in SAL (#wikimedia-operations) [2021-04-15T22:46:40Z] <ryankemper> T280108 T267927 Manually re-enabled and ran puppet on wdqs1005 (had closed the tmux pane which terminated the cookbook without letting it do its final cleanup)

Change 680095 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/cookbooks@master] wdqs: int can't take in float as string

https://gerrit.wikimedia.org/r/680095

Change 680095 merged by Ryan Kemper:

[operations/cookbooks@master] wdqs: int can't take in float as string

https://gerrit.wikimedia.org/r/680095

Gehel claimed this task.