Page MenuHomePhabricator

Investigate Gobblin failures
Closed, ResolvedPublic

Description

We have seen more Gobblin errors during the past two weeks, related from what we saw in logs to some kubernetes resource constraint issue.
We should investigate and make sure we understand the root cause of this problem.

Note: The dpe-sre + search teams have been testing semantic-search infra changes on the dse-k8s-eqiad cluster the same weeks. Could it be related?

Event Timeline

JAllemandou renamed this task from Investigate Gobblin failures between 2026-02-16 and 2026-03-07 to Investigate Gobblin failures.Mar 27 2026, 9:51 AM
JAllemandou updated the task description. (Show Details)

We have experienced failures in the past few days days (March 24, 25, 26, 27).
Here's a summary of the detailed failure:
False errors:

  • Airflow log retrieval failure. The underlying task was successful.

Real errors:

  • 8 times connection problem to metawiki API:
FailureRequest to uri https://meta.wikimedia.org/w/api.php?format=json&action=streamconfigs&all_settings=true failed. BasicHttpResult(failure)  encountered local exception: Connect to mw-api-int-ro.discovery.wmnet:4446 [mw-api-int-ro.discovery.wmnet/10.2.2.81] failed: Connection timed out (Connection timed out)
  • Error with file availablity in task pod. Possibly related to git sync synchronicity.
skein.exceptions.DriverError: Failed to submit application, exception:
File file:/opt/airflow/dags/.worktrees/9a4d46e7e7fd1cf2774e8a92eec5724f00e250ab/main/dags/gobblin/config/analytics-common.properties does not exist
  • Error with Yarn launching skein app.
org.apache.hadoop.yarn.exceptions.InvalidApplicationMasterRequestException: Application doesn't exist in cache appattempt_1773779850057_1912_000001

More failures from March 27th to March 30th:

Real Error:

  • 2x Error with file availablity in task pod. Possibly related to git sync synchronicity.
skein.exceptions.DriverError: Failed to submit application, exception:
File file:/opt/airflow/dags/.worktrees/9a4d46e7e7fd1cf2774e8a92eec5724f00e250ab/main/dags/gobblin/config/analytics-common.properties does not exist
  • 4x connection problem to metawiki API:

FailureRequest to uri https://meta.wikimedia.org/w/api.php?format=json&action=streamconfigs&all_settings=true failed. BasicHttpResult(failure) encountered local exception: Connect to mw-api-int-ro.discovery.wmnet:4446 [mw-api-int-ro.discovery.wmnet/10.2.2.81] failed: Connection timed out (Connection timed out)

False errors:

  • 1x Airflow log retrieval failure. The underlying task was successful.

FailureRequest to uri https://meta.wikimedia.org/w/api.php?format=json&action=streamconfigs&all_settings=true failed. BasicHttpResult(failure) encountered local exception: Connect to mw-api-int-ro.discovery.wmnet:4446 [mw-api-int-ro.discovery.wmnet/10.2.2.81] failed: Connection timed out (Connection timed out)

This is a client side timeout, yes? I wonder what our client timeout is...

This is a client side timeout, yes? I wonder what our client timeout is...

We use Apache HTTP version 4.5 (https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia-event-utilities/+/refs/heads/master/pom.xml#93).
Since Apache 4.3, the default connection timeout value is the one set by the system (https://stackoverflow.com/questions/9734384/default-timeout-for-httpcomponent-client).
I don't know how much it would be for our system (probably 60s?).

2 failed instance this morning:

  • 2x connection problem to metawiki API.

No recent failures. Closing for now.