Page MenuHomePhabricator

ORES deployment - Spring 2021
Closed, ResolvedPublic

Description

In this change:

  • Adds 6 new topic models
  • Switch to 10k optimized vectors from 100k most common word vectors
  • Adds nlwiki article quality model

Things to look out for:

  1. Changes in memory usage (should go down)
  2. Faster startup time

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 675612 had a related patch set uploaded (by Halfak; author: Halfak):
[mediawiki/services/ores/deploy@master] Adds new topic models, limits vectors to 10k, and nlwiki wp10

https://gerrit.wikimedia.org/r/675612

Change 675612 merged by Accraze:

[mediawiki/services/ores/deploy@master] Adds new topic models, limits vectors to 10k, and nlwiki wp10

https://gerrit.wikimedia.org/r/675612

Patchset merged. Should be good to move to Beta for now.

Thank you! Will run a test on beta when I get a chance and report back here.

Deploy failed with the following error:

halfak@deployment-deploy01:/srv/deployment/ores/deploy$ scap deploy -v T278723
16:57:47 Running ['git', 'show', '-s', '--format=%ct', '257a349d02347537c1cbb5d6a4a367ccaf08a3cb'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7ffa4436cc90>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
16:57:47 Command exited with code 0
16:57:47 Running ['git', 'ls-remote', '--get-url'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7ffa4436cdb0>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
16:57:47 Command exited with code 0
16:57:47 Started deploy [ores/deploy@257a349]
16:57:47 Running ['git', 'tag', '--list', 'scap/sync/2021-04-13/*'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7ffa4436cae0>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
16:57:47 Command exited with code 0
16:57:47 Running ['git', 'rev-parse', '--verify', 'HEAD'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7ffa4436cd20>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
16:57:47 Command exited with code 0
16:57:47 Deploying Rev: HEAD = 257a349d02347537c1cbb5d6a4a367ccaf08a3cb
16:57:47 Update DEPLOY_HEAD
16:57:47 Creating /srv/deployment/ores/deploy/.git/DEPLOY_HEAD
16:57:47 Running ['git', 'for-each-ref', '--sort=taggerdate', '--format=%(refname)', 'refs/tags'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7ffa4436cae0>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
16:57:47 Command exited with code 0
16:57:47 Running ['git', 'tag', '-d', u'scap/sync/2019-12-03/0001'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7ffa4436cd20>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
16:57:47 Command exited with code 0
16:57:47 Update server info
16:57:47 Running ['git', 'update-server-info'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7ffa4436cc90>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
16:57:47 Command exited with code 0
16:57:47 Running ['git', 'submodule', 'foreach', '--recursive', 'git update-server-info'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7ffa4436cdb0>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
16:57:47 Command exited with code 0
16:57:47 Started deploy [ores/deploy@257a349]: T278723
16:57:47 
== DEFAULT ==
:* deployment-ores01.deployment-prep.eqiad.wmflabs
16:57:47 Running remote deploy cmd ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'ores/deploy', '-g', 'default', 'fetch', '--refresh-config']
16:57:47 Using key: /etc/keyholder.d/deploy_service.pub
16:57:47 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'ores/deploy', '-g', 'default', 'fetch', '--refresh-config'] on deployment-ores01.deployment-prep.eqiad.wmflabs returned [255]: OpenSSH_7.4p1 Debian-10+deb9u6, OpenSSL 1.0.2u  20 Dec 2019
debug1: Reading configuration data /dev/null
debug1: Connecting to deployment-ores01.deployment-prep.eqiad.wmflabs [172.16.4.95] port 22.
debug1: Connection established.
debug1: identity file /etc/keyholder.d/deploy_service.pub type 1
debug1: key_load_public: No such file or directory
debug1: identity file /etc/keyholder.d/deploy_service.pub-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u6
debug1: Remote protocol version 2.0, remote software version OpenSSH_7.4p1 Debian-10+deb9u6
debug1: match: OpenSSH_7.4p1 Debian-10+deb9u6 pat OpenSSH* compat 0x04000000
debug1: Authenticating to deployment-ores01.deployment-prep.eqiad.wmflabs:22 as 'deploy-service'
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: algorithm: curve25519-sha256@libssh.org
debug1: kex: host key algorithm: ecdsa-sha2-nistp256
debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug1: Server host key: ecdsa-sha2-nistp256 SHA256:gNywH2BdkKg2mU45nnQhMo6HX336cntrTME3iKfbczo
Host key verification failed.

16:57:47 connection to deployment-ores01.deployment-prep.eqiad.wmflabs failed and future stages will not be attempted for this target
ores/deploy: fetch stage(s): 100% (ok: 0; fail: 1; left: 0)                     
16:57:47 1 targets had deploy errors
16:57:47 1 targets failed
16:57:47 1 of 1 default targets failed, exceeding limit
Rollback all deployed groups? [Y/n]: Y
16:57:56 Finished deploy [ores/deploy@257a349]: T278723 (duration: 00m 09s)
16:57:56 Finished deploy [ores/deploy@257a349] (duration: 00m 09s)

After checking the ssh known hosts on deployment-ores01 without finding anything weird, I tried to deploy with my user and it worked:

elukey@deployment-deploy01:/srv/deployment/ores/deploy$ scap deploy -v T278723
17:29:45 Running ['git', 'show', '-s', '--format=%ct', '257a349d02347537c1cbb5d6a4a367ccaf08a3cb'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7fb0b9cfbc90>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
17:29:45 Command exited with code 0
17:29:45 Running ['git', 'ls-remote', '--get-url'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7fb0b9cfbdb0>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
17:29:45 Command exited with code 0
17:29:45 Started deploy [ores/deploy@257a349]
17:29:45 Running ['git', 'tag', '--list', 'scap/sync/2021-04-13/*'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7fb0b9cfbae0>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
17:29:46 Command exited with code 0
17:29:46 Running ['git', 'rev-parse', '--verify', 'HEAD'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7fb0b9cfbd20>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
17:29:46 Command exited with code 0
17:29:46 Deploying Rev: HEAD = 257a349d02347537c1cbb5d6a4a367ccaf08a3cb
17:29:46 Update DEPLOY_HEAD
17:29:46 Creating /srv/deployment/ores/deploy/.git/DEPLOY_HEAD
17:29:46 Running ['git', 'for-each-ref', '--sort=taggerdate', '--format=%(refname)', 'refs/tags'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7fb0b9cfbae0>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
17:29:46 Command exited with code 0
17:29:46 Running ['git', 'tag', '-d', u'scap/sync/2020-01-23/0001'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7fb0b9cfbd20>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
17:29:46 Command exited with code 0
17:29:46 Update server info
17:29:46 Running ['git', 'update-server-info'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7fb0b9cfbc90>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
17:29:46 Command exited with code 0
17:29:46 Running ['git', 'submodule', 'foreach', '--recursive', 'git update-server-info'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7fb0b9cfbdb0>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
17:29:46 Command exited with code 0
17:29:46 Started deploy [ores/deploy@257a349]: T278723
17:29:46 
== DEFAULT ==
:* deployment-ores01.deployment-prep.eqiad.wmflabs
17:29:46 Running remote deploy cmd ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'ores/deploy', '-g', 'default', 'fetch', '--refresh-config']
17:29:46 Using key: /etc/keyholder.d/deploy_service.pub
ores/deploy: fetch stage(s): 100% (ok: 1; fail: 0; left: 0)
17:31:11 Running remote deploy cmd ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'ores/deploy', '-g', 'default', 'config_deploy', '--refresh-config']
17:31:11 Using key: /etc/keyholder.d/deploy_service.pub
ores/deploy: config_deploy stage(s): 100% (ok: 1; fail: 0; left: 0)
17:31:13 Running remote deploy cmd ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'ores/deploy', '-g', 'default', 'promote', '--refresh-config']
17:31:13 Using key: /etc/keyholder.d/deploy_service.pub
ores/deploy: promote and restart_service stage(s): 100% (ok: 1; fail: 0; left: 0)
17:31:23 
== DEFAULT ==
:* deployment-ores01.deployment-prep.eqiad.wmflabs
17:31:23 Running remote deploy cmd ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'ores/deploy', '-g', 'default', 'finalize', '--refresh-config']
17:31:23 Using key: /etc/keyholder.d/deploy_service.pub
ores/deploy: finalize stage(s): 100% (ok: 1; fail: 0; left: 0)
17:31:27 Finished deploy [ores/deploy@257a349]: T278723 (duration: 01m 41s)
17:31:27 Finished deploy [ores/deploy@257a349] (duration: 01m 41s)

deployment-ores01.deployment-prep.eqiad.wmflabs

this may or may not be the issue but I'd encourage you to change your hosts list to use .wikimedia.cloud names rather than .wmflabs; any new or rebuilt hosts won't exist under .wmflabs

Looks like we can't reach ores-beta.wmflabs.org. I created a task for exploring it T280420: ores-beta.wmflabs.org is unreachable

It's possible that the deployment brought down web service without reporting an error. I haven't had time to look into it, so I'm posting here in case someone can help out.

It looks like @elukey's deployment run failed to update the submodules on the deployment host (deployment-ores01). Here's what I see on the deployment host:

halfak@deployment-ores01:/srv/deployment/ores/deploy$ git status
HEAD detached at 257a349
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)
  (commit or discard the untracked or modified content in submodules)

	modified:   submodules/articlequality (modified content)
	modified:   submodules/draftquality (modified content)
	modified:   submodules/drafttopic (modified content)
	modified:   submodules/editquality (modified content)
	modified:   submodules/wheels (modified content)

It looks like the submodule for "assets" was updated, but not the submodules for editquality, draftquality, etc. I'm guessing that scap had some sort of hiccup. Also it appears that I can't run scap deploy myself still. I'm not sure why there's still a key failure as I'm able to directly SSH to both the deployment-deploy01 sever and deployment-ores01 server.

@Halfak the git status is the same on ores1001:

elukey@ores1001:/srv/deployment/ores/deploy$ git status
HEAD detached at 6912889
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)
  (commit or discard the untracked or modified content in submodules)

	modified:   submodules/articlequality (modified content)
	modified:   submodules/draftquality (modified content)
	modified:   submodules/drafttopic (modified content)
	modified:   submodules/editquality (modified content)
	modified:   submodules/wheels (modified content)

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	venv/

no changes added to commit (use "git add" and/or "git commit -a")

Re-tried the deployment, no hiccups:

elukey@deployment-deploy01:/srv/deployment/ores/deploy$ scap deploy -v T278723                                             
17:34:56 Running ['git', 'show', '-s', '--format=%ct', '257a349d02347537c1cbb5d6a4a367ccaf08a3cb'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7f2d5fe98c90>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
17:34:56 Command exited with code 0
17:34:56 Running ['git', 'ls-remote', '--get-url'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7f2d5fe98db0>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
17:34:56 Command exited with code 0
17:34:56 Started deploy [ores/deploy@257a349]
17:34:56 Running ['git', 'tag', '--list', 'scap/sync/2021-04-22/*'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7f2d5fe98ae0>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
17:34:56 Command exited with code 0
17:34:56 Running ['git', 'rev-parse', '--verify', 'HEAD'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7f2d5fe98d20>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
17:34:56 Command exited with code 0
17:34:56 Deploying Rev: HEAD = 257a349d02347537c1cbb5d6a4a367ccaf08a3cb
17:34:56 Update DEPLOY_HEAD
17:34:56 Creating /srv/deployment/ores/deploy/.git/DEPLOY_HEAD
17:34:56 Running ['git', 'for-each-ref', '--sort=taggerdate', '--format=%(refname)', 'refs/tags'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7f2d5fe98ae0>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
17:34:56 Command exited with code 0
17:34:56 Running ['git', 'tag', '-d', u'scap/sync/2020-01-24/0001'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7f2d5fe98d20>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
17:34:56 Command exited with code 0
17:34:56 Update server info
17:34:56 Running ['git', 'update-server-info'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7f2d5fe98c90>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
17:34:56 Command exited with code 0
17:34:56 Running ['git', 'submodule', 'foreach', '--recursive', 'git update-server-info'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7f2d5fe98db0>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
17:34:56 Command exited with code 0
17:34:56 Started deploy [ores/deploy@257a349]: T278723
17:34:56 
== DEFAULT ==
:* deployment-ores01.deployment-prep.eqiad.wmflabs
17:34:56 Running remote deploy cmd ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'ores/deploy', '-g', 'default', 'fetch', '--refresh-config']
17:34:56 Using key: /etc/keyholder.d/deploy_service.pub
ores/deploy: fetch stage(s): 100% (ok: 1; fail: 0; left: 0)
17:34:58 Running remote deploy cmd ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'ores/deploy', '-g', 'default', 'config_deploy', '--refresh-config']
17:34:58 Using key: /etc/keyholder.d/deploy_service.pub
ores/deploy: config_deploy stage(s): 100% (ok: 1; fail: 0; left: 0)
17:34:59 Running remote deploy cmd ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'ores/deploy', '-g', 'default', 'promote', '--refresh-config']
17:34:59 Using key: /etc/keyholder.d/deploy_service.pub
ores/deploy: promote and restart_service stage(s): 100% (ok: 1; fail: 0; left: 0)
17:35:00 
== DEFAULT ==
:* deployment-ores01.deployment-prep.eqiad.wmflabs
17:35:00 Running remote deploy cmd ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'ores/deploy', '-g', 'default', 'finalize', '--refresh-config']
17:35:00 Using key: /etc/keyholder.d/deploy_service.pub
ores/deploy: finalize stage(s): 100% (ok: 1; fail: 0; left: 0)
17:35:00 Finished deploy [ores/deploy@257a349]: T278723 (duration: 00m 04s)
17:35:00 Finished deploy [ores/deploy@257a349] (duration: 00m 04s)

I have now. I think the non-updating submodules was a red herring. I see now that the code and assets filenames were not aligned. I've got a change in progress that should resolve T280420.

OK I've made the updates and we're ready for a new deployment to Beta, but I'm still blocked on being able to run scap myself.

I suspect these are the key lines from my output where everything goes wrong.

debug1: identity file /etc/keyholder.d/deploy_service.pub type 1
debug1: key_load_public: No such file or directory

I'll see if I can learn more by asking in #wikimedia-cloud

elukey@deployment-deploy01:/srv/deployment/ores/deploy$ scap deploy -v T278723
16:54:55 Running ['git', 'show', '-s', '--format=%ct', '666f1dd6ab44434f7eaf938a38e06e4a4c34217b'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7f160bee1c90>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
16:54:55 Command exited with code 0
16:54:55 Running ['git', 'ls-remote', '--get-url'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7f160bee1db0>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
16:54:55 Command exited with code 0
16:54:55 Started deploy [ores/deploy@666f1dd]
16:54:55 Running ['git', 'tag', '--list', 'scap/sync/2021-04-24/*'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7f160bee1ae0>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
16:54:55 Command exited with code 0
16:54:55 Running ['git', 'rev-parse', '--verify', 'HEAD'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7f160bee1d20>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
16:54:55 Command exited with code 0
16:54:55 Deploying Rev: HEAD = 666f1dd6ab44434f7eaf938a38e06e4a4c34217b
16:54:55 Update DEPLOY_HEAD
16:54:55 Creating /srv/deployment/ores/deploy/.git/DEPLOY_HEAD
16:54:55 Running ['git', 'for-each-ref', '--sort=taggerdate', '--format=%(refname)', 'refs/tags'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7f160bee1ae0>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
16:54:55 Command exited with code 0
16:54:55 Running ['git', 'tag', '-d', u'scap/sync/2020-04-20/0003'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7f160bee1d20>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
16:54:55 Command exited with code 0
16:54:55 Update server info
16:54:55 Running ['git', 'update-server-info'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7f160bee1c90>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
16:54:55 Command exited with code 0
16:54:55 Running ['git', 'submodule', 'foreach', '--recursive', 'git update-server-info'] with {'stdin': <open file '/dev/null', mode 'rb' at 0x7f160bee1db0>, 'cwd': '/srv/deployment/ores/deploy', 'stderr': -1, 'stdout': -1}
16:54:55 Command exited with code 0
16:54:55 Started deploy [ores/deploy@666f1dd]: T278723
16:54:55 
== DEFAULT ==
:* deployment-ores01.deployment-prep.eqiad.wmflabs
16:54:55 Running remote deploy cmd ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'ores/deploy', '-g', 'default', 'fetch', '--refresh-config']
16:54:55 Using key: /etc/keyholder.d/deploy_service.pub
ores/deploy: fetch stage(s): 100% (ok: 1; fail: 0; left: 0)
16:56:05 Running remote deploy cmd ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'ores/deploy', '-g', 'default', 'config_deploy', '--refresh-config']
16:56:05 Using key: /etc/keyholder.d/deploy_service.pub
ores/deploy: config_deploy stage(s): 100% (ok: 1; fail: 0; left: 0)
16:56:06 Running remote deploy cmd ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'ores/deploy', '-g', 'default', 'promote', '--refresh-config']
16:56:06 Using key: /etc/keyholder.d/deploy_service.pub
ores/deploy: promote and restart_service stage(s): 100% (ok: 1; fail: 0; left: 0)
16:57:40 
== DEFAULT ==
:* deployment-ores01.deployment-prep.eqiad.wmflabs
16:57:40 Running remote deploy cmd ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'ores/deploy', '-g', 'default', 'finalize', '--refresh-config']
16:57:40 Using key: /etc/keyholder.d/deploy_service.pub
ores/deploy: finalize stage(s): 100% (ok: 1; fail: 0; left: 0)
16:57:42 Finished deploy [ores/deploy@666f1dd]: T278723 (duration: 02m 47s)
16:57:42 Finished deploy [ores/deploy@666f1dd] (duration: 02m 47s)

https://ores-beta.wmflabs.org/ seems now visible again :)

www-data  9237 12.7 29.7 3152332 2428900 ?     Ss   16:56   0:32 /srv/deployment/ores/deploy-cache/revs/666f1dd6ab44434f7eaf938a38e06e4a4c34217b/venv/bin/python3 /srv/deployment/ores/deploy/venv/bin/celery worker --app ores_celery.application --loglevel ERROR
www-data  9470  0.0 29.3 3152332 2397828 ?     S    16:57   0:00  \_ /srv/deployment/ores/deploy-cache/revs/666f1dd6ab44434f7eaf938a38e06e4a4c34217b/venv/bin/python3 /srv/deployment/ores/deploy/venv/bin/celery worker --app ores_celery.application --loglevel ERROR
www-data  9471  0.0 29.3 3152332 2397828 ?     S    16:57   0:00  \_ /srv/deployment/ores/deploy-cache/revs/666f1dd6ab44434f7eaf938a38e06e4a4c34217b/venv/bin/python3 /srv/deployment/ores/deploy/venv/bin/celery worker --app ores_celery.application --loglevel ERROR
www-data  9472  0.0 29.3 3152332 2397828 ?     S    16:57   0:00  \_ /srv/deployment/ores/deploy-cache/revs/666f1dd6ab44434f7eaf938a38e06e4a4c34217b/venv/bin/python3 /srv/deployment/ores/deploy/venv/bin/celery worker --app ores_celery.application --loglevel ERROR
www-data  9473  0.0 29.3 3152332 2397828 ?     S    16:57   0:00  \_ /srv/deployment/ores/deploy-cache/revs/666f1dd6ab44434f7eaf938a38e06e4a4c34217b/venv/bin/python3 /srv/deployment/ores/deploy/venv/bin/celery worker --app ores_celery.application --loglevel ERROR
www-data  9474  0.0 29.3 3152332 2398020 ?     S    16:57   0:00  \_ /srv/deployment/ores/deploy-cache/revs/666f1dd6ab44434f7eaf938a38e06e4a4c34217b/venv/bin/python3 /srv/deployment/ores/deploy/venv/bin/celery worker --app ores_celery.application --loglevel ERROR
www-data  9477  0.0 29.3 3152332 2398020 ?     S    16:57   0:00  \_ /srv/deployment/ores/deploy-cache/revs/666f1dd6ab44434f7eaf938a38e06e4a4c34217b/venv/bin/python3 /srv/deployment/ores/deploy/venv/bin/celery worker --app ores_celery.application --loglevel ERROR
www-data  9481  0.0 29.3 3152332 2398020 ?     S    16:57   0:00  \_ /srv/deployment/ores/deploy-cache/revs/666f1dd6ab44434f7eaf938a38e06e4a4c34217b/venv/bin/python3 /srv/deployment/ores/deploy/venv/bin/celery worker --app ores_celery.application --loglevel ERROR
www-data  9521  7.4  7.3 1325476 599284 ?      Ssl  16:57   0:12 /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/ores.ini
www-data 10378  0.0  6.9 1259680 570400 ?      S    16:58   0:00  \_ /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/ores.ini
www-data 10379  0.0  6.9 1259424 568428 ?      S    16:58   0:00  \_ /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/ores.ini
www-data 10380  0.0  6.9 1259424 570276 ?      S    16:58   0:00  \_ /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/ores.ini
www-data 10381  0.0  6.9 1259424 568428 ?      S    16:58   0:00  \_ /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/ores.ini
www-data 10382  0.0  6.9 1259424 570272 ?      S    16:58   0:00  \_ /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/ores.ini
www-data 10383  0.0  6.9 1259424 568428 ?      S    16:58   0:00  \_ /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/ores.ini
www-data 10384  0.0  6.9 1259424 568428 ?      S    16:58   0:00  \_ /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/ores.ini
www-data 10385  0.0  6.9 1259424 568428 ?      S    16:58   0:00  \_ /usr/bin/uwsgi --die-on-term --ini /etc/uwsgi/apps-enabled/ores.ini

No more Celery errors afaics!

We got a connection error when trying to talk to redis.

Traceback (most recent call last):
  File "./ores/wsgi/routes/v3/util.py", line 105, in process_score_request
    score_response = scoring_system.score(score_request)
  File "./ores/scoring_systems/scoring_system.py", line 60, in score
    response = self._score(request)
  File "./ores/scoring_systems/celery_queue.py", line 194, in _score
    self._check_queue_full()
  File "./ores/scoring_systems/celery_queue.py", line 204, in _check_queue_full
    queue_size = self.redis.llen(DEFAULT_CELERY_QUEUE)
  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/redis/client.py", line 1953, in llen
    return self.execute_command('LLEN', name)
  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/redis/client.py", line 898, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)
  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/redis/connection.py", line 1192, in get_connection
    connection.connect()
  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/redis/connection.py", line 567, in connect
    self.on_connect()
  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/redis/connection.py", line 643, in on_connect
    auth_response = self.read_response()
  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/redis/connection.py", line 739, in read_response
    response = self._parser.read_response()
  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/redis/connection.py", line 340, in read_response
    raise error
redis.exceptions.AuthenticationError: Client sent AUTH, but no password is set

I'm not sure where this came from. I'll dig into our configuration. I'm wondering if there was an update to redis that required/removed a password somewhere.

It looks like we do use a password to connect and when I use that password on deployment-ores01, it connects to redis just fine.

halfak@deployment-ores01:/etc/ores$ redis-cli -a <password from config> -h deployment-ores01.deployment-prep.eqiad.wmflabs
deployment-ores01.deployment-prep.eqiad.wmflabs:6379> llen celery
(integer) 0

Looks like the redis-cli ignores passwords if they aren't required. I tried setting the password requirement by running CONFIG SET requirepass "<thepassword>" in the terminal and that seemed to work.

Now I'm hitting this error:

Traceback (most recent call last):
  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/mwapi/session.py", line 101, in _request
    auth=auth)
  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/srv/deployment/ores/deploy/venv/lib/python3.5/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=6500): Max retries exceeded with url: /w/api.php?action=query&rvslots=main&format=json&rvprop=userid%7Ccomment%7Csize%7Cuser%7Cids%7Ctimestamp%7Ccontentmodel%7Ccontent&prop=revisions&revids=34567 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f074a5615c0>: Failed to establish a new connection: [Errno 111] Connection refused',))

It looks like this is related to work that was done in this commit: https://phabricator.wikimedia.org/rORESDEPLOYe860508bb36d64683434e79e646795530a529c97

We need to set up Envoy in our cloud instance. What is Envoy?

I found https://wikitech.wikimedia.org/wiki/Envoy.

@Ladsgroup, it appears that you were the one to set this up in production. Did you also test it on beta? Should this be working?

Here's the most relevant patchset. https://gerrit.wikimedia.org/r/c/mediawiki/services/ores/deploy/+/621522

AFAICT, Envoy is not working in our test environment and this breaks our deployment patterns. Was it tested in production? Either way, I think the right way forward is to set up Envoy for deployment-ores01 so that the beta deployment is function.

We don't have Envoy anywhere on the beta cluster at the moment. It's on my long-term to-do list to set up and there's a task somewhere, but if you don't have an easy way to not use it on ORES I can see if I can set it up for you without some blocked featueres not needed here.

Thanks for the consideration Majavah. Right now, I don't see a good option.

  1. Add Envoy to the beta cluster and config for ORES.
  2. Add Envoy handling to mwapi and add it to the config setup for ORES. Change the deployment config for ORES to not directly reference Envoy. And add some optional config to read from hiera to add local config for Envoy in production.

@Halfak IIUC https://ores.wmflabs.org is backed by Ores instances in the "ores" wmcs project (not deployment-prep), that should already be configured with Envoy in theory. Have you tried in there?

https://github.com/wikimedia/ores-wmflabs-deploy

We do our production test deployments in deployment-prep (ores-beta.wmflabs.org). ores.wmflabs.org is for experimental model deployments. E.g. scap is set up to work in beta. For ores.wmflabs.org we use fabric to deploy because scap is unavailable.

@Halfak good to know, is there any documentation about the whole deployment process somewhere?

@Majavah after some thought, I think it would be great if you would look into Envoy for Beta. Honestly, I am asking this because it makes my work easier. But I also figure that anything else that is using Envoy in production would probably want to also use Envoy in Beta.

Do you think it will be much work? I've been reading the docs for it, but I don't have a clear picture for what it would take to set up.

Spent some time today trying to add the Envoy config to the Ores instance in Beta, and all the production code assumes (rightfully) TLS + LVS IPs, so adapting it to beta may not be possible without further puppet changes.

One alternative solution (I'll try it this afternoon) is to simply add an envoy/nginx manual install of a simple proxy listening on port 6500, proxying requests to deployment-mediawiki11 (the mw host in beta). Not very great but could be a simple workaround to see if more things are needed for beta to become testable again.

Ok there is a basic nginx listening on localhost:6500 on deployment-ores01, @Halfak can you tell me how to repro the connection error highlighted in T278723#7031400?

Thanks for your work @elukey!

I was able to get around this by re-setting the password on redis to the password we have in the configuration for connecting to redis. See my comment here: T278723#7031400

Are you still seeing that error? If so, it seems my setting the password setting didn't stick. Maybe redis got reset or there is some process that is dropping the password from the configuration.

We can also set the password in the /etc/ configuration. See https://www.stackink.com/how-to-set-password-for-redis-server for a discussion.

Note that we have redis servers on port 6380 and 6379 so the password will need to be set for both.

@Halfak nono I meant how to trigger the Failed to establish a new connection: [Errno 111] Connection refused problem (related IIUC to connections to localhost:6500 failing). Now that we have a proxy things should flow nicely, but not sure how to test it.

Aha! I misunderstood. I'm seeing the auth error re-appear https://ores-beta.wmflabs.org/v3/scores/enwiki/

It is blocking the connection refused error. I got that when I tried to get a prediction. You can do that by accessing this URL once the AUTH issue is dealt with again. https://ores-beta.wmflabs.org/v3/scores/enwiki/1234567

@Halfak beta seems unblocked for the moment, please check if there are other issues. Current problems live-patched that may require a better fix:

  • Redis is deployed via role::labs::ores::redis, that directly instance ores::redis (now prohibited by the current puppet guidelines), and the password is not set in there, so it doesn't match the Ores one (picked from the labs/private repository). We should file a code review to restructure the role to include a profile and probably a hiera lookup for the password (that we'll store in labs/private).
  • The nginx config is of course not puppetized yet, but afaics the Ores beta testing relies on a connection to the main Wikis, not the api-appservers in deployment-prep. I made an ack to make the proxy working, but in theory a good integration testing environment should be self contained (namely ores should call deployment-mediawiki11, not the outside internet). Unfortunately deployment-mediawiki11 seems to have only the test domain configured, so we have to rely on outside internet/wikis for the moment.

I filed T281495: Restructure ORES labs redis puppet role because that seems like it is solvable and might be part of the ORES test environment migration to the MWCS project.

@kevinbazira, it looks like we're ready to proceed with a production deployment. I'll drop you an email about scheduling.

@Halfak I'd ask, whenever you have a moment, for some details about the following points:

  • What are the high level changes of this deployment? Anything big / new that we'll have to pay attention?
  • Is there a clean rollback plan in case we notice instability of Ores? (I am thinking even after hours/days)

I am asking from an SRE point of view to be prepared, as you mentioned in our last conversation Ores has not seen a deployment in a while and we want to be careful in changing its current state (namely, doing so in a controlled way with all the necessary precautions :).

Thanks!

I added some details about the nature of the deployment to the task description. Main concern is memory usage changes. We saw a drop in memory usage on Beta as expected. It appears that we see a drop in memory usage from 40% to 22% on web nodes.

There are details on rollbacks in the run book I liked to earlier, but I'll make sure that we make an explicit post here with the version we'll need to roll back to if something goes wrong. It's standard practice to keep it handy while doing the deployment anyway.

Looks like I was mistaken while I was reading the graphs. The change is from 27% to 22% of available memory. So no substantial change which is a minor surprise but not that big of a surprise given that we are adding 7 new models while reducing memory consumption from word vector embeddings.

On paper we should have free memory available on Production nodes, but ideally the three changes outlined in the description could have been broken down into three separate deployments to have a better sense of what performance impact each change has. I know that there may be some interconnection between the changes, and that now it would be a problem to break everything down, but please let's remember it next time. Big deployments are not great in general, I really prefer smaller ones :)

Question about the rollback: I see from https://wikitech.wikimedia.org/wiki/ORES/Deployment#Rollback that it is basically a scap deployment (in case it is not necessary to disable the Ores extension in Mediawiki), is there anything else to do ? For example Redis status etc..

Question about checks post deployment:

  • What checks are we going to do to validate that the new version of Ores is stable? And who should do them? (we have little experience operating it, so in case we have to do something it would be really appreciated a runbook :)
  • What are the channels where people usually report Ores problems? It would be nice to proactively ping people using Ores rather than waiting for bug reports etc..

Any info/suggestion is appreciated :)

I totally agree about big deployments! It's been too long.

It is a scap deployment. In theory, rollback is really fast so long as we have the target commit hash. Nothing else to do after rollback.

For checks, see the "Monitoring" section of the deployment docs.

Production ORES service graphs: https://grafana.wikimedia.org/dashboard/db/ores?orgId=1.
Production ORES extension graphs: https://grafana.wikimedia.org/dashboard/db/ores-extension?orgId=1
Site-wide error graphs: https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json?refresh=5m&orgId=1
Watch the logs, especially for ERROR-level messages: https://logstash.wikimedia.org/app/kibana#/dashboard/ORES
Watch MediaWiki fatal logs: https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors

99% of the time, if there is an issue, it'll be visible in the production ORES service graphs.

What are the channels where people usually report Ores problems? It would be nice to proactively ping people using Ores rather than waiting for bug reports etc..

I'm not sure what you are asking. #wikimedia-operations is usually a good place to be during the deployment. For model fitness issues, people tend to bring it to #wikimedia-ai, [[mw:ORES]], or a related page on their local wiki.

We have plans to experiment with the new Dutch model with Dutch Wikipedians. They have been waiting for a looong time so we might need to blow some cobwebs from the communication channels. The right time to do that is usually when we have a deployment scheduled. @Chtnnh and I were working on new/more-complete process documentation for how to manage communication around models and deployments, but @calbon told us to hold of for some foundation-wide discussion/consultation.

I totally agree about big deployments! It's been too long.

Sure but it is not only a matter of too much time that passed, but how we are deploying things :)

It is a scap deployment. In theory, rollback is really fast so long as we have the target commit hash. Nothing else to do after rollback.

Sure this part is clear, and I feel more confident about a deploy, but the more changes the go out at the same time the more problems come up when trying to figure out where the problem is (if any comes up). Fine for this use case, but in the future small deployments are better (even if there are multiple things backlogged).

For checks, see the "Monitoring" section of the deployment docs.

Production ORES service graphs: https://grafana.wikimedia.org/dashboard/db/ores?orgId=1.
Production ORES extension graphs: https://grafana.wikimedia.org/dashboard/db/ores-extension?orgId=1
Site-wide error graphs: https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json?refresh=5m&orgId=1
Watch the logs, especially for ERROR-level messages: https://logstash.wikimedia.org/app/kibana#/dashboard/ORES
Watch MediaWiki fatal logs: https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors

99% of the time, if there is an issue, it'll be visible in the production ORES service graphs.

Thanks a lot!

What are the channels where people usually report Ores problems? It would be nice to proactively ping people using Ores rather than waiting for bug reports etc..

I'm not sure what you are asking. #wikimedia-operations is usually a good place to be during the deployment. For model fitness issues, people tend to bring it to #wikimedia-ai, [[mw:ORES]], or a related page on their local wiki.

You answered my question, thanks :)

We have plans to experiment with the new Dutch model with Dutch Wikipedians. They have been waiting for a looong time so we might need to blow some cobwebs from the communication channels. The right time to do that is usually when we have a deployment scheduled. @Chtnnh and I were working on new/more-complete process documentation for how to manage communication around models and deployments, but @calbon told us to hold of for some foundation-wide discussion/consultation.

+1 makes sense. My 2 cents on this - we need to make sure that future efforts for new models will keep in mind that we'll migrate to a new pipeline during the next months, to avoid forcing the community and the ML team to do the work twice :)

Mentioned in SAL (#wikimedia-operations) [2021-05-05T13:41:26Z] <kevinbazira@deploy1002> Started deploy [ores/deploy@5612f30]: Regular ORES Deployment T278723

Mentioned in SAL (#wikimedia-operations) [2021-05-05T13:58:13Z] <kevinbazira@deploy1002> Finished deploy [ores/deploy@5612f30]: Regular ORES Deployment T278723 (duration: 16m 47s)

The ORES deployment has been completed. Thanks to @elukey and @klausman.

In case there are any issues we could have missed, please let us know.

There seems to be a regression in scores errored mostly in codfw (ORES is active/active), so some traffic is impacted. This is an example:

https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-2021.05.05?id=5qMhPXkBfVMx58vqE8EF

Stacktrace is:

Traceback (most recent call last):
  File "/srv/deployment/ores/deploy-cache/revs/5612f30290e00e4d76500b65d67cae7ac102c3ac/venv/lib/python3.5/site-packages/celery/app/trace.py", line 375, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/srv/deployment/ores/deploy-cache/revs/5612f30290e00e4d76500b65d67cae7ac102c3ac/venv/lib/python3.5/site-packages/celery/app/trace.py", line 632, in __protected_call__
    return self.run(*args, **kwargs)
  File "/srv/deployment/ores/deploy-cache/revs/5612f30290e00e4d76500b65d67cae7ac102c3ac/ores/scoring_systems/celery_queue.py", line 70, in _process_score_map
    root_cache=root_cache)
  File "/srv/deployment/ores/deploy-cache/revs/5612f30290e00e4d76500b65d67cae7ac102c3ac/ores/scoring_systems/scoring_system.py", line 167, in _process_score_map
    seconds=self.timeout)
  File "/srv/deployment/ores/deploy-cache/revs/5612f30290e00e4d76500b65d67cae7ac102c3ac/ores/util.py", line 25, in timeout
    result = func(*args, **kwargs)
  File "/srv/deployment/ores/deploy-cache/revs/5612f30290e00e4d76500b65d67cae7ac102c3ac/ores/scoring_context.py", line 91, in process_model_scores
    self._process_score(model_name, dependency_cache=root_cache)
  File "/srv/deployment/ores/deploy-cache/revs/5612f30290e00e4d76500b65d67cae7ac102c3ac/ores/scoring_context.py", line 133, in _process_score
    score = self[model_name].score(feature_values)
  File "/srv/deployment/ores/deploy-cache/revs/5612f30290e00e4d76500b65d67cae7ac102c3ac/venv/lib/python3.5/site-packages/revscoring/scoring/models/sklearn.py", line 262, in score
    prediction.append(estimator.predict([scaled_fv_vector])[0])
  File "/srv/deployment/ores/deploy-cache/revs/5612f30290e00e4d76500b65d67cae7ac102c3ac/venv/lib/python3.5/site-packages/sklearn/ensemble/_gb.py", line 2165, in predict
    raw_predictions = self.decision_function(X)
  File "/srv/deployment/ores/deploy-cache/revs/5612f30290e00e4d76500b65d67cae7ac102c3ac/venv/lib/python3.5/site-packages/sklearn/ensemble/_gb.py", line 2121, in decision_function
    raw_predictions = self._raw_predict(X)
  File "/srv/deployment/ores/deploy-cache/revs/5612f30290e00e4d76500b65d67cae7ac102c3ac/venv/lib/python3.5/site-packages/sklearn/ensemble/_gb.py", line 1655, in _raw_predict
    raw_predictions = self._raw_predict_init(X)
  File "/srv/deployment/ores/deploy-cache/revs/5612f30290e00e4d76500b65d67cae7ac102c3ac/venv/lib/python3.5/site-packages/sklearn/ensemble/_gb.py", line 1649, in _raw_predict_init
    raw_predictions = self.loss_.get_init_raw_predictions(
AttributeError: 'BinomialDeviance' object has no attribute 'get_init_raw_predictions'
elukey@ores2001:~$ sudo journalctl -u celery-ores-worker.service  | grep Warning
May 05 13:59:19 ores2001 celery-ores-worker[25245]: /srv/deployment/ores/deploy-cache/revs/5612f30290e00e4d76500b65d67cae7ac102c3ac/venv/lib/python3.5/site-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.ensemble.gradient_boosting module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.ensemble. Anything that cannot be imported from sklearn.ensemble is now part of the private API.
May 05 13:59:19 ores2001 celery-ores-worker[25245]:   warnings.warn(message, FutureWarning)
May 05 13:59:19 ores2001 celery-ores-worker[25245]: /srv/deployment/ores/deploy-cache/revs/5612f30290e00e4d76500b65d67cae7ac102c3ac/venv/lib/python3.5/site-packages/sklearn/base.py:318: UserWarning: Trying to unpickle estimator GradientBoostingClassifier from version 0.20.3 when using version 0.22.1. This might lead to breaking code or invalid results. Use at your own risk.
May 05 13:59:19 ores2001 celery-ores-worker[25245]:   UserWarning)
May 05 13:59:19 ores2001 celery-ores-worker[25245]: /srv/deployment/ores/deploy-cache/revs/5612f30290e00e4d76500b65d67cae7ac102c3ac/venv/lib/python3.5/site-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.tree.tree module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.tree. Anything that cannot be imported from sklearn.tree is now part of the private API.
May 05 13:59:19 ores2001 celery-ores-worker[25245]:   warnings.warn(message, FutureWarning)
May 05 13:59:19 ores2001 celery-ores-worker[25245]: /srv/deployment/ores/deploy-cache/revs/5612f30290e00e4d76500b65d67cae7ac102c3ac/venv/lib/python3.5/site-packages/sklearn/base.py:318: UserWarning: Trying to unpickle estimator DecisionTreeRegressor from version 0.20.3 when using version 0.22.1. This might lead to breaking code or invalid results. Use at your own risk.
May 05 13:59:19 ores2001 celery-ores-worker[25245]:   UserWarning)

Can we figure out what request caused this error?

It's likely that a model was accidentally built with the wrong version of sklearn. This is a relatively easy fix if we know which model it is.

@Halfak I see mostly 'model_names': ['reverted', 'articletopic'] for viwiki in codfw that leads to errors.

Aha! Looks like https://ores-beta.wmflabs.org/v3/scores/viwiki/123125/articletopic raises the error.

This is surprising because the viwiki model was built in batch with the others.

In my opinion we should rollback, work on a patch and re-rollout when we are ok, doing more testing. Thoughts?

I'll try to find some time this evening to rebuild the viwiki model with the right version of sklearn.

In the meantime, this is unlikely to be causing issues for users right now. The only tool that uses the reverted model is huggle. Huggle won't see the precached value, but it makes requests in near realtime so it shouldn't affect performance anyway. The articlequality model for viwiki is new so it likely doesn't have any use yet.

@Halfak what is the likelihood that other models have the same issues, but we haven't seen errors yet due to not enough requests ending up in ERRORS?

Also, is there documentation about how to build models for ORES? This is something that we as ML Team should be able to do in case of need, so better to know how it gets done if possible :)

The pipelines are documented/automated in the relevant Makefiles. E.g. if you install the dependencies for https://github.com/wikimedia/drafttopic, delete the old viwiki models and run make models it should rebuild the relevant models.

The pipelines are documented/automated in the relevant Makefiles. E.g. if you install the dependencies for https://github.com/wikimedia/drafttopic, delete the old viwiki models and run make models it should rebuild the relevant models.

Thanks a lot!

Sorry I missed one of your other questions.

@Halfak what is the likelihood that other models have the same issues, but we haven't seen errors yet due to not enough requests ending up in ERRORS?

It is unlikely given that precaching is hitting most of the models constantly. I just ran through all of the other newly deployed models and I was able to confirm that they are not returning an error for some random revision.

Found a few minutes. Rebuild in progress.

@Halfak quick check in to understand the status of the fix (and if my team should follow up to fix the regression etc..) :)

The fix is merged. Because of T212818, you'll need to manually propagate the changes to gerrit for the drafttopic repo before updating the deploy repo and re-deploying to Beta.

I'm surprised to not see this documented on the https://wikitech.wikimedia.org/wiki/ORES/Deployment page. I remember walking @kevinbazira though this with the intention to get it documented. I bet it is documented in a task somewhere but didn't get copied to the wiki page. Maybe he remembers where?

IIUC the next steps should be to run something like T212818#4865070 for drafttopic, then updating the related submodule in the deploy repo and then re-test in Beta.

@kevinbazira if you have time we can work on this together to get the new changes deployed and tested in Beta (so we'll hopefully do a quick deployment afterwards). What do you think?

@elukey, that's fine, I set up a meeting and shared it with you on your calendar. Please feel free to adjust the time to a favourable one.

Change 689884 had a related patch set uploaded (by Elukey; author: Elukey):

[mediawiki/services/ores/deploy@master] Update drafttopic to its latest version

https://gerrit.wikimedia.org/r/689884

Change 689884 merged by Elukey:

[mediawiki/services/ores/deploy@master] Update drafttopic to its latest version

https://gerrit.wikimedia.org/r/689884

https://ores-beta.wmflabs.org/v3/scores/viwiki/123125/articletopic seems working now :)

@Halfak do you want to do a sanity check in beta before we move to prod?

Mentioned in SAL (#wikimedia-operations) [2021-05-13T07:10:47Z] <kevinbazira@deploy1002> Started deploy [ores/deploy@8fd23ed]: Regular ORES Deployment T278723

Mentioned in SAL (#wikimedia-operations) [2021-05-13T07:43:37Z] <kevinbazira@deploy1002> Finished deploy [ores/deploy@8fd23ed]: Regular ORES Deployment T278723 (duration: 32m 50s)

The ORES deployment with rebuilt viwiki models has been completed. Thanks to @elukey!

In case there are any issues we could have missed, please let us know.

Metrics are good, no weird ERROR registered in logstash afaics, https://ores.wikimedia.org/v3/scores/viwiki/123125/articletopic works, I think that we can call we can call this task done :)

Please re-open if anything is missing!