Page MenuHomePhabricator

Deploy the retrained model
Closed, ResolvedPublic

Description

Once we have improved the signals and gotten enough new training data we need to retrain the model and deploy it.

Acceptance criteria:

  • ORES uses the new improved Item quality model on Wikidata

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 26 2020, 4:19 PM
Lydia_Pintscher renamed this task from deploy the retraien the model to deploy the retrained the model.Aug 26 2020, 4:19 PM
Restricted Application added a project: Wikidata. · View Herald TranscriptAug 26 2020, 4:19 PM
Lydia_Pintscher renamed this task from deploy the retrained the model to deploy the retrained model.Aug 26 2020, 4:31 PM
Lydia_Pintscher renamed this task from deploy the retrained model to Deploy the retrained model.
Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptSep 8 2020, 11:25 PM

Current status: pull request needs to be merged and then pushed to production

Change 636463 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/services/ores/deploy@master] Upgrade articlequality to master

https://gerrit.wikimedia.org/r/636463

This git lfs thing is a mess... I hope T264651: Migrate ORES/Revscoring/etc. repos to Gitlab or Gerrit gets done ASAP.

Deploying to beta cluster now.

Change 636463 merged by Ladsgroup:
[mediawiki/services/ores/deploy@master] Upgrade articlequality to master

https://gerrit.wikimedia.org/r/636463

Trying to deploy to beta:

17:13:56 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'ores/deploy', '-g', 'default', 'fetch', '--refresh-config'] on deployment-ores01.deployment-prep.eqiad.wmflabs returned [255]: Permission denied (publickey).

17:13:56 connection to deployment-ores01.deployment-prep.eqiad.wmflabs failed and future stages will not be attempted for this target
ores/deploy: fetch stage(s): 100% (ok: 0; fail: 1; left: 0)                     
17:13:56 1 targets had deploy errors
17:13:56 1 targets failed
17:13:56 1 of 1 default targets failed, exceeding limit

What?

logging in:

$ ssh deployment-ores01.deployment-prep.eqiad.wmflabs
Linux deployment-ores01 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3.1 (2019-02-19) x86_64
Debian GNU/Linux 9.3 (stretch)
The last Puppet run was at Tue Sep  8 22:26:12 UTC 2020 (68806 minutes ago). 

Puppet is not working...

ladsgroup@deployment-ores01:~$ sudo puppet agent -tv
2020-10-26 17:18:17.532026 WARN  puppetlabs.facter - locale environment variables were bad; continuing with LANG=C LC_ALL=C
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Could not find class role::ores::redis for deployment-ores01.deployment-prep.eqiad.wmflabs on node deployment-ores01.deployment-prep.eqiad.wmflabs
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

Macro shits-on-fire:

It seems putting ores on envoy broke ores on beta cluster:

ladsgroup@deployment-ores01:~$ sudo puppet agent -tv
2020-10-26 17:27:00.475030 WARN  puppetlabs.facter - locale environment variables were bad; continuing with LANG=C LC_ALL=C
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, If you want non-sni TLS to be supported, you need to define  profile::tlsproxy::envoy::global_cert_name or  profile::tlsproxy::envoy::acme_cert_name (file: /etc/puppet/modules/profile/manifests/tlsproxy/envoy.pp, line: 144, column: 13) on node deployment-ores01.deployment-prep.eqiad.wmflabs
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

Change 636492 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/services/ores/deploy@master] Bump to HEAD of articlequality again

https://gerrit.wikimedia.org/r/636492

Change 636492 merged by Ladsgroup:
[mediawiki/services/ores/deploy@master] Bump to HEAD of articlequality again

https://gerrit.wikimedia.org/r/636492

Mentioned in SAL (#wikimedia-operations) [2020-10-26T20:08:37Z] <ladsgroup@deploy1001> Started deploy [ores/deploy@6912889]: Deploy new version of articlequality for wikidata (T261326)

Mentioned in SAL (#wikimedia-operations) [2020-10-26T20:15:30Z] <ladsgroup@deploy1001> Finished deploy [ores/deploy@6912889]: Deploy new version of articlequality for wikidata (T261326) (duration: 06m 53s)

The timing of precaching requests had a 20% dive:

BeforeAfter

For precaching requests of wikidata the nose dive is much bigger, around half:

BeforeAfter

This would help in the capacity issues as well.