Page MenuHomePhabricator

research/mwaddlink has failing CI on the main branch
Closed, ResolvedPublic

Description

As of today, research/mwaddlink CI setup fails on the main branch, see https://gerrit.wikimedia.org/r/c/research/mwaddlink/+/1000867, which fails with:

INFO: Scanner configuration file: /opt/sonar-scanner/conf/sonar-scanner.properties
INFO: Project root configuration file: /srv/app/sonar-project.properties
INFO: SonarScanner 4.6.0.2311
INFO: Java 11.0.18 Debian (64-bit)
INFO: Linux 4.19.0-25-amd64 amd64
INFO: User cache: /cache/sonar/cache
INFO: Scanner configuration file: /opt/sonar-scanner/conf/sonar-scanner.properties
INFO: Project root configuration file: /srv/app/sonar-project.properties
INFO: Analyzing on SonarCloud
INFO: Default locale: "en", source code encoding: "UTF-8" (analysis is platform dependent)
INFO: Load global settings
INFO: Load global settings (done) | time=452ms
INFO: Server id: 1BD809FA-AWHW8ct9-T_TB3XqouNu
INFO: User cache: /cache/sonar/cache
INFO: Loading required plugins
INFO: Load plugins index
INFO: Load plugins index (done) | time=406ms
INFO: Load/download plugins
INFO: Load/download plugins (done) | time=1136ms
INFO: Load project settings for component key: 'research-mwaddlink'
INFO: Load project settings for component key: 'research-mwaddlink' (done) | time=341ms
INFO: Process project properties
INFO: Project key: research-mwaddlink
INFO: Base dir: /srv/app
INFO: Working dir: /srv/app/.scannerwork
INFO: Load project branches
INFO: Load project branches (done) | time=369ms
INFO: Check ALM binding of project 'research-mwaddlink'
INFO: Detected project binding: NOT_BOUND
INFO: Check ALM binding of project 'research-mwaddlink' (done) | time=587ms
INFO: Load project pull requests
INFO: Load project pull requests (done) | time=340ms
INFO: Load branch configuration
INFO: Load branch configuration (done) | time=2ms
INFO: Load quality profiles
INFO: Load quality profiles (done) | time=643ms
INFO: Load active rules
INFO: Load active rules (done) | time=10399ms
INFO: Organization key: wmftest
INFO: Branch name: 1000867-1, type: short-lived
WARN: The property 'sonar.login' is deprecated and will be removed in the future. Please use the 'sonar.token' property instead when passing a token.
INFO: Preprocessing files...
INFO: 1 language detected in 20 preprocessed files
INFO: 0 files ignored because of inclusion/exclusion patterns
INFO: 8 files ignored because of scm ignore settings
INFO: ------------------------------------------------------------------------
INFO: EXECUTION FAILURE
INFO: ------------------------------------------------------------------------
INFO: Total time: 18.662s
INFO: Final Memory: 10M/188M
INFO: ------------------------------------------------------------------------
ERROR: Error during SonarScanner execution
ERROR: 

The version of Java (11.0.18) used to run this analysis is deprecated, and SonarCloud no longer supports it. Please upgrade to Java 17 or later.
You can find more information here: https://docs.sonarsource.com/sonarcloud/appendices/scanner-environment/

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Blocks research/mwaddlink merges.

Change 1001958 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[research/mwaddlink@main] build: Use Java 17 for codehealth job

https://gerrit.wikimedia.org/r/1001958

@Urbanecm_WMF I'm stepping away from this one. Hopefully the patch linked here will help you. There are a couple of issues:

  1. Old version of sonar-scanner image that uses Java 11. This is easy to fix by updating the sonar-scanner image to use the new version with Java 17
  2. scikit-learn doesn't build -- fixed by updating from 0.2.x to 1.4.x. Not sure if that breaks things in the pipeline code
  3. scikit-learn 1.4.x needs a newer version of joblib (0.16.0->1.2.0), however, joblib 1.2.0 needs python > 3.7
  4. Update to debian bullseye to get python 3.9
    1. noting that stat1008 has python3.7 by default, not sure if python3.9 is available there

The patch fails on Werkzeug:

==================================== ERRORS ====================================
______________________ ERROR collecting tests/test_app.py ______________________
ImportError while importing test module '/srv/app/tests/test_app.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/lib/python3.9/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_app.py:7: in <module>
    import app
app.py:2: in <module>
    from flask import (
.tox/pytest-integration/lib/python3.9/site-packages/flask/__init__.py:7: in <module>
    from .app import Flask as Flask
.tox/pytest-integration/lib/python3.9/site-packages/flask/app.py:28: in <module>
    from . import cli
.tox/pytest-integration/lib/python3.9/site-packages/flask/cli.py:18: in <module>
    from .helpers import get_debug_flag
.tox/pytest-integration/lib/python3.9/site-packages/flask/helpers.py:16: in <module>
    from werkzeug.urls import url_quote
E   ImportError: cannot import name 'url_quote' from 'werkzeug.urls' (/srv/app/.tox/pytest-integration/lib/python3.9/site-packages/werkzeug/urls.py)

I guess next step is to try updating Werkzeug library.

Anyway, all of this would require some careful testing and review of the pipeline code and the app serving code.

Please feel free to use the patch I submitted, if it is helpful.

hashar subscribed.

Definitely, which is because the Pipeline lib jobs are entirely defined in the source repositories and the images/definitions are neither tracked nor managed centrally. For busy enough repositories, the failures tend to be caught rather early, for this research/mwaddlink repo it seems to have drifted a bit.

Looks like it now fails on:

.tox/pytest-integration/lib/python3.9/site-packages/flask/helpers.py:16: in <module>
    from werkzeug.urls import url_quote
E   ImportError: cannot import name 'url_quote' from 'werkzeug.urls' (/srv/app/.tox/pytest-integration/lib/python3.9/site-packages/werkzeug/urls.py)

Which I guess is related to python modules being out of sync somehow.

  1. Update to debian bullseye to get python 3.9
    1. noting that stat1008 has python3.7 by default, not sure if python3.9 is available there

I checked stat1008 and python3.9 is not available as shown below:

kevinbazira@stat1008:~/fix-add-a-link-CI-deps/mwaddlink$ virtualenv -p python3.9 venv
The path python3.9 (from --python=python3.9) does not exist

The model training pipeline is run on stat1008, without python3.9, this process will be blocked.

There are two solutions I was thinking through:

1.Update only model serving requirements since these are the dependencies currently affecting CI.
IIUC app.py is the one that relies on the werkzeug dependency. Is app.py only used in the Flask app that creates an API to query add-a-link models?

If that's the case, then we could update requirements-query.txt and not requirements.txt.

Pros: Fixes CI. Doesn't block model training.
Cons: Model training and model serving environments won't match which may cause unforeseen errors in production.

2.Update both model training and serving requirements since this will achieve more predictable results in prod.

The models in prod were trained and saved using pickle protocol 4. In order for the sqlitedict dependency to continue using pickle protocol 4, the pipeline had to use python3.7.

If we are ready to retrain and redeploy all models, then we could update both requirements-query.txt and requirements.txt to use python3.10.

Pros: Fixes CI. Doesn't block model training. Model training and model serving environments match.
Cons: We'll have to retrain and redeploy 250+ models.

CCing @MGerlach, in case he would like to share his thoughts on this issue.

Following T357217#9534633, I had a chat with @MGerlach and option 1 is the most feasible at the moment.

Building on top of @kostajh's patch, I have updated requirements-query.txt with scikit-learn==1.0.2 to match requirements.txt.

Testing locally the images built successfully for variants: build, production-build, and production. The test variant ran into the issue below:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tox 4.12.1 requires chardet>=5.2, but you have chardet 3.0.4 which is incompatible.

Going to push a WIP patch for this.

Finally got the test build to succeed. @kostajh and @Urbanecm_WMF please review whenever you get a minute: https://gerrit.wikimedia.org/r/1001958. Thanks!

Hi @kevinbazira, thanks for working on this issue and for submitting a patch. I'm, however, concerned that merging https://gerrit.wikimedia.org/r/c/research/mwaddlink/+/1001958 could go with significant side effects. That patch changes the production build significantly, which is what we use in the production linkrecommendation service. Namely, it upgrades the OS from buster to bullseye, which means increasing the Python version. It also upgrades the query requirements. All of those changes are done without touching how generating new datasets works like.

In other words, if we merge that patch, generating datasets would use older libraries, older OS, older Python, etc., only to be read by significantly newer software. In some way, I feel we're about to set a timed bomb by merging the patch. We could easily find ourself in a situation when we produce datasets the production service is unable to read because of library/OS inconsistencies.

In general, I think the CI should use the same version as we use in production. If we're stuck with Python 3.7 on the dataset generation end, I think we should use Python 3.7 everywhere else. Otherwise, the CI wouldn't actually test what the application would do after the change (since it runs in a very different environment in prod), which kind of defeats the purpose of a CI.

Before moving ahead, I'd like to understand why upgrading Java requires updating Python itself or Python libraries. Shouldn't that be independent?

Hi @Urbanecm_WMF, both you and I share the same concerns regarding model training matching model serving requirements as I wrote in T357217#9534633.

Since I had built on top of Kosta's patch in T357217#9541884, he was kind enough to share more context on gerrit: https://gerrit.wikimedia.org/r/c/research/mwaddlink/+/1001958/comments/165454ac_2367cabe

The tl;dr is the current sonar codehealth image that has Java 17 doesn't support python3.7. If we found a way to copy coverage reports from the test to the codehealth variant, this would enable us to retain the Python3.7 build variant as proposed by Kosta. (see T307772)

Hi @hashar, in this blubber file, we have been trying to copy files from the test to the codehealth variant but keep running into the error below:

#6 local://context
#6 sha256:177a22c08429ba9a74acccc3451f709da925e7f932e391246aa7c960ee0b7b7d
#6 DONE 0.0s
failed to solve with frontend dockerfile.v0: failed to solve with frontend gateway.v0: rpc error: code = Unknown desc = failed to compile to LLB state: preparation: pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed

Here is the full build log: https://integration.wikimedia.org/ci/job/research-mwaddlink-pipeline-test/1023/execution/node/86/log/

What could we be missing in the copies configuration that leads to this permission issue?

Change 1004131 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[research/mwaddlink@main] build: Instead of re-running tox in codehealth, reuse coverage data

https://gerrit.wikimedia.org/r/1004131

[...]
What could we be missing in the copies configuration that leads to this permission issue?

It doesn't work, because you've added it into an incorrect file. The copies declaration in blubber.yaml copies files between build variants when building the container, but we need to do the copying while running the container, which means it needs to be in the .pipeline/config.yaml file (unhelpfully, it is called copy there). I uploaded https://gerrit.wikimedia.org/r/1004131, which fixes the problem. No idea how it results in the permission error specifically though.

Change 1004131 merged by jenkins-bot:

[research/mwaddlink@main] build: Instead of re-running tox in codehealth, reuse coverage data

https://gerrit.wikimedia.org/r/1004131

Change 1001958 abandoned by Kosta Harlan:

[research/mwaddlink@main] [WIP] blubber: support Java 17 for codehealth job

Reason:

I465b73664b087c96fe6a056bc642b00b726a0c27

https://gerrit.wikimedia.org/r/1001958