Page MenuHomePhabricator

Migrate Tech Wishes scraper to gitlab.wikimedia.org
Open, Needs TriagePublic

Description

In order to use the Wikimedia Airflow "artifact" mechanism, we'll need to move https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/ to the Wikimedia Gitlab infrastructure. This was discovered when the production Airflow failed to load our packaged binary from gitlab.com

Steps:

Importing groups by direct transfer is currently disabled. Please ask your Administrator to enable it in the Admin settings. Remember to enable it also on the instance you are migrating from.

Event Timeline

I'm surprised that this is the case. Can you share the error and definition of the artfiact() mechanism? Perhaps this is a config issue.

https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/jobs/761299 shows the issue

excerpt midly de-mangled and unwrapped:

Exception occurred while synchronizing page-summary-scraper-0.6.1.tgz
https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/package_files/278501910/download
Traceback (most recent call last):
File "/home/blunderbuss/.cache/pypoetry/virtualenvs/blunderbuss-2uZo5AhP-py3.11/lib/python3.11/site-packages/aiohttp/connector.py", line 1298, in _wrap_create_connection
sock = await aiohappyeyeballs.start_connection(
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/blunderbuss/.cache/pypoetry/virtualenvs/blunderbuss-2uZo5AhP-py3.11/lib/python3.11/site-packages/aiohappyeyeballs/impl.py", line 141, in start_connection
   raise OSError(first_errno, msg)
TimeoutError: [Errno 110] Multiple exceptions: [Errno 110] Connect call failed (\\\'2606:4700:90:0:f22e:fbec:5bed:a9b9\\\', 443, 0, 0), [Errno 110] Connect call failed (\\\'172.65.251.78\\\', 443)
The above exception was the direct cause of the following exception:\\n\\nTraceback (most recent call last):
  File "/home/blunderbuss/.cache/pypoetry/virtualenvs/blunderbuss-2uZo5AhP-py3.11/lib/python3.11/site-packages/fsspec/implementations/http.py", line 437, in _info
    await _file_info(\\n  File "/home/blunderbuss/.cache/pypoetry/virtualenvs/blunderbuss-2uZo5AhP-py3.11/lib/python3.11/site-packages/fsspec/implementations/http.py", line 849, in _file_info\\n    r = await session.get(url, allow_redirects=ar, **kwargs)

The artifact was defined like so, in wmde/config/artifacts.yaml:

page-summary-scraper-0.6.1.tgz:
  id: https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/package_files/278501910/download
  source: url

Just for the record, I think it's a good policy to not pull binaries from servers outside of Wikimedia infrastructure—so I'm not asking for a change here, only providing information about what went wrong :-)

Just for the record, I think it's a good policy to not pull binaries from servers outside of Wikimedia infrastructure—so I'm not asking for a change here, only providing information about what went wrong :-)

Ah, yes, yes, I had missed that. Carry on with this task.

Pausing a moment before killing the old repo.

On second thought, I'm going to say there's no rush to kill off the old repo yet. Forking / syncing new changes is exactly what all this git is good at.

@awight / @lilients_WMDE This task is opened but has no active project tag. Could you please associate one? Thanks!