Page MenuHomePhabricator

Airflow devenv (WMDE) cannot see webproxy
Closed, ResolvedPublic

Description

Some of the tasks in my work-in-progress DAG (https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1957) need to connect to the Enterprise API, and my task expects to reach it through the normal webproxy. However, webproxy cannot be reached from this environment.

As a diagnostic, I've replaced my bash_command with the following:

nc -X connect -x webproxy:8080 -v api.enterprise.wikimedia.com 443

which times out like so:

[2026-02-17, 12:05:36 UTC] {bash.py:95} INFO - nc: connect to webproxy port 8080 (tcp) failed: Connection timed out
[2026-02-17, 12:07:51 UTC] {bash.py:95} INFO - nc: connect to webproxy port 8080 (tcp) failed: Connection timed out

When I attempt to egress to other internal resources without a proxy jump, the network seems fine:

nc -v analytics-hive.eqiad.wmnet 10000

[2026-02-17, 12:07:51 UTC] {bash.py:95} INFO - Connection to analytics-hive.eqiad.wmnet (2620:0:861:100:10:64:138:7) 10000 port [tcp/*] succeeded!

Event Timeline

The way airflow egress works is by assigning external services policies to specific components (webserver, scheduler, task-pod, gitsync, etc).

Assuming you need your task pod to connect to the internet (api.enterprise.wikimedia.com points to some AWS ALB domain), you indeed need to:

  • connect to that domain via a proxy
  • set the appropriate HTTP_PROXY env vars (or use some kind of proxy argument to your http request client).

We don't currently have an external-service network policy for webrequest, but we have one for urldownloader, that is being used by airflow-platform-eng: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/master/helmfile.d/dse-k8s-services/airflow-platform-eng/values-production.yaml#51
This configuration is in charge of generating a NetworkPolicy that will allow all airflow task pods labeled with component: task-pod (automatically taken care of by Airflow) as well as proxy: urldownloader (which you should ensure in your operator configuration). The second label is required because we don't want to grant internet access to all tasks in an instance only because one requires it.

Reading your code a bit more in details, it appears that you're reaching out to the internet from 2 different locations:

  • from your task pod, via the BashOperator: this should be taken care of by adding the following configuration block to deployment-charts/helmfile.d/dse-k8s-eqiad/airflow-wmde/values-production.yaml, as well as ensuring that your task pod is labeled with proxy=urldownloader
deployment-charts/helmfile.d/dse-k8s-eqiad/airflow-wmde/values-production.yaml
worker:
  proxy:
    urldownloader:
      enabled: true
wmde/dags/wiki_page_cite_references/summarize_pages.py
from kubernetes.client import models as k8s

proxy_env = {
    "http_proxy": "http://url-downloader.eqiad.wikimedia.org:8080",
    "https_proxy": "http://url-downloader.eqiad.wikimedia.org:8080",
    "no_proxy": (
        ".wikipedia.org,.wikimedia.org,.wikibooks.org,.wikinews.org,.wikiquote.org,.wikisource.org,.wikiversity.org,"
        ".wikivoyage.org,.wikidata.org,.wikiworkshop.org,.wikifunctions.org,.wiktionary.org,.mediawiki.org,"
        ".wmfusercontent.org,.w.wiki,.wikimediacloud.org,.wmnet,127.0.0.1,::1"
    ),
}

executor_config_with_proxy = {
    "pod_override": k8s.V1Pod(metadata=k8s.V1ObjectMeta(labels={"proxy": "urldownloader"})),
}
with create_easy_dag(
    dag_id="wiki_page_summary_monthly",
    doc_md=__doc__,
    start_date=props.start_date,
    schedule="@monthly",
    dagrun_timeout=timedelta(days=3),
    tags=["monthly", "to_iceberg", "page_summary"],
    sla=props.sla,
    max_active_runs=1,
) as dag:

        BashSensor(
            task_id="check_enterprise_snapshots",
            env=scraper_env,
            bash_command=elixir_command(
                f'Wiki.ReleaseTasks.check_availability("{ props.wiki }", "{ snapshot_month }")'
            ),
            poke_interval=timedelta(hours=3).total_seconds(),
            executor_config=executor_config_with_proxy
        )
  • from the Skein job in YARN. That should simply be taken care of by passing env=proxy_env to the SimpleSkeinOperator, as you're already doing.

Let me know how it goes!

Change #1243841 had a related patch set uploaded (by Awight; author: Awight):

[operations/deployment-charts@master] Add helm value to optionally allow egress for airflow-wmde

https://gerrit.wikimedia.org/r/1243841

@brouberol That's amazing, thank you. I'll wait for the chart deployment and will post the outcome here.

Change #1243841 merged by Brouberol:

[operations/deployment-charts@master] Add helm value to optionally allow egress for airflow-wmde

https://gerrit.wikimedia.org/r/1243841

I have merged the patch. It will be taken into account in your devenv after you delete and re-create it.

I think it works!

nc -X connect -x url-downloader.eqiad.wikimedia.org:8080 -v api.enterprise.wikimedia.com 443

INFO - Connection to api.enterprise.wikimedia.com 443 port [tcp/*] succeeded!

Actually, I think it worked even before the chart change, maybe thanks to something about how the charts are inherited from one another...

awight claimed this task.