Page MenuHomePhabricator

asoranking failed its monthly run stat1007
Closed, ResolvedPublic

Description

Hi! asoranking failed to run, I think due to the recent upgrade of OS on stat1007 (stretch -> buster):

Nov 01 12:07:19 stat1007 performance-asoranking[24862]: 2020-11-01 12:07:19,857 - ASORanking - DEBUG - 350 whitelisted ASNs for cellular
Nov 01 12:07:20 stat1007 performance-asoranking[24862]: Traceback (most recent call last):
Nov 01 12:07:20 stat1007 performance-asoranking[24862]:   File "/srv/deployment/performance/asoranking/asoranking.py", line 354, in <module>
Nov 01 12:07:20 stat1007 performance-asoranking[24862]:     aso.run()
Nov 01 12:07:20 stat1007 performance-asoranking[24862]:   File "/srv/deployment/performance/asoranking/asoranking.py", line 35, in run
Nov 01 12:07:20 stat1007 performance-asoranking[24862]:     self.generate_report()
Nov 01 12:07:20 stat1007 performance-asoranking[24862]:   File "/srv/deployment/performance/asoranking/asoranking.py", line 320, in generate_report
Nov 01 12:07:20 stat1007 performance-asoranking[24862]:     self.args.threshold
Nov 01 12:07:20 stat1007 performance-asoranking[24862]:   File "/srv/deployment/performance/asoranking/asoranking.py", line 234, in generate_ranking
Nov 01 12:07:20 stat1007 performance-asoranking[24862]:     median_ttfb_by_aso = median_ttfb_by_aso.sort(['ttfb'])
Nov 01 12:07:20 stat1007 performance-asoranking[24862]:   File "/usr/lib/python3/dist-packages/pandas/core/generic.py", line 4378, in __getattr__
Nov 01 12:07:20 stat1007 performance-asoranking[24862]:     return object.__getattribute__(self, name)
Nov 01 12:07:20 stat1007 performance-asoranking[24862]: AttributeError: 'DataFrame' object has no attribute 'sort'
Nov 01 12:07:20 stat1007 systemd[1]: performance-asoranking.service: Main process exited, code=exited, status=1/FAILURE
Nov 01 12:07:20 stat1007 systemd[1]: performance-asoranking.service: Failed with result 'exit-code'.

elukey@stat1007:~$ dpkg -l | grep pandas
ii  python3-pandas                        0.23.3+dfsg-3                                          all          data structures for "relational" or "labeled" data - Python 3
ii  python3-pandas-lib                    0.23.3+dfsg-3                                          amd64        low-level implementations and bindings for pandas - Python 3

From a quick reading it might be just a matter of moving to something like:

median_ttfb_by_aso = median_ttfb_by_aso.sort_values(by='ttfb')

Event Timeline

Change 638093 had a related patch set uploaded (by Gilles; owner: Gilles):
[performance/asoranking@master] Make code compatible with Pandas >= 0.20

https://gerrit.wikimedia.org/r/638093

Gilles triaged this task as Medium priority.Nov 2 2020, 12:36 PM
Gilles moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.

Change 638093 merged by jenkins-bot:
[performance/asoranking@master] Make code compatible with Pandas >= 0.20

https://gerrit.wikimedia.org/r/638093

Mentioned in SAL (#wikimedia-operations) [2020-11-03T10:21:37Z] <gilles@deploy1001> Started deploy [performance/asoranking@2a2cb05]: T266985

Mentioned in SAL (#wikimedia-operations) [2020-11-03T10:22:03Z] <gilles@deploy1001> Finished deploy [performance/asoranking@2a2cb05]: T266985 (duration: 00m 26s)

@elukey the scap deploy fails on permissions, does this look familiar?

gilles@deploy1001:/srv/deployment/performance/asoranking$ scap deploy T266985
10:21:37 Started deploy [performance/asoranking@2a2cb05]
10:21:37 Deploying Rev: HEAD = 2a2cb059fbeef89805503d2a4597852ca80b0c97
10:21:37 Started deploy [performance/asoranking@2a2cb05]: T266985
10:21:37 
== DEFAULT ==
:* stat1007.eqiad.wmnet
10:21:39 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'performance/asoranking', '-g', 'default', 'fetch', '--refresh-config'] on stat1007.eqiad.wmnet returned [70]: 10:21:39 WARNING  - Unhandled error:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/scap/cli.py", line 334, in run
    app._load_config()
  File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 114, in _load_config
    overrides = self._get_config_overrides()
  File "/usr/lib/python2.7/dist-packages/scap/deploy.py", line 531, in _get_config_overrides
    with open(self.context.local_config, 'w') as cfg:
IOError: [Errno 13] Permission denied: '/srv/deployment/performance/asoranking-cache/.config'
10:21:39 ERROR    - deploy-local failed: <IOError> [Errno 13] Permission denied: '/srv/deployment/performance/asoranking-cache/.config'

performance/asoranking: fetch stage(s): 100% (ok: 0; fail: 1; left: 0)          
10:21:39 1 targets had deploy errors
10:21:39 1 targets failed
10:21:39 1 of 1 default targets failed, exceeding limit

/srv/deployment/performance/asoranking-cache/.config is owned by this user:

-rw-r--r-- 1 druid analytics-privatedata  558 Jun 11 19:20 .config

And I'm not quite sure why "druid" owns this.

These are the scap settings:

[global]
git_repo: performance/asoranking
git_deploy_dir: /srv/deployment
git_repo_user: analytics
ssh_user: analytics-deploy
dsh_targets: targets

The last time I deployed an update of this service was on June 11 and it was successful. Has something changed since in regards to UNIX users/groups on stat machines?

@Gilles yes we reimaged the host to Debian Buster and some system users don't keep the same uid, I just chown -R analytics. Can you retry?

Mentioned in SAL (#wikimedia-operations) [2020-11-03T10:45:12Z] <gilles@deploy1001> Started deploy [performance/asoranking@2a2cb05]: T266985

Mentioned in SAL (#wikimedia-operations) [2020-11-03T10:45:19Z] <gilles@deploy1001> Finished deploy [performance/asoranking@2a2cb05]: T266985 (duration: 00m 07s)

@elukey still the same. Shouldn't it be analytics-deploy?

Mentioned in SAL (#wikimedia-operations) [2020-11-03T11:51:05Z] <gilles@deploy1001> Started deploy [performance/asoranking@2a2cb05]: T266985

Mentioned in SAL (#wikimedia-operations) [2020-11-03T11:51:11Z] <gilles@deploy1001> Finished deploy [performance/asoranking@2a2cb05]: T266985 (duration: 00m 03s)

Mentioned in SAL (#wikimedia-analytics) [2020-11-03T13:02:17Z] <elukey> force a restart of performance-asoranking.service on stat1007 after fix for pandas' sort() - T266985

I ran it manually and it completed successfully.