Page MenuHomePhabricator

Root cause Archiva outage from 2023-09-24
Closed, ResolvedPublic

Description

While working on testing T346281 in production we hit an Archiva outage:

Archiva outage:

Because we use Spark 3.2 for the backfill, we created a custom conda environment. To avoid complicating matters even more, I am using Spark's ivy mechanism to pull one dependency at runtime: iceberg-spark-runtime-3.3_2.12-1.2.1.jar. This is done using our Archiva instance as a proxy. This suddenly started failing with:

sudo -u analytics yarn logs -appOwner analytics -applicationId application_1694521537759_58660
...
:::: ERRORS
	SERVER ERROR: Server Error url=https://archiva.wikimedia.org/repository/mirrored/org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/1.2.1/iceberg-spark-runtime-3.3_2.12-1.2.1.pom

	SERVER ERROR: Server Error url=https://archiva.wikimedia.org/repository/mirrored/org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/1.2.1/iceberg-spark-runtime-3.3_2.12-1.2.1.jar
...

And effectively, when I tried to pull the artifact manually I got:

√ ~ % curl https://archiva.wikimedia.org/repository/mirrored/org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/1.2.1/iceberg-spark-runtime-3.3_2.12-1.2.1.pom
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
<title>Error 500 Server Error</title>
</head>
<body><h2>HTTP ERROR 500</h2>
<p>Problem accessing /repository/mirrored/org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/1.2.1/iceberg-spark-runtime-3.3_2.12-1.2.1.pom. Reason:
<pre>    Server Error</pre></p><h3>Caused by:</h3><pre>java.lang.RuntimeException: /var/cache/archiva/temp2871579391051196831: No space left on device
	at org.apache.archiva.proxy.DefaultRepositoryProxyConnectors.createWorkingDirectory(DefaultRepositoryProxyConnectors.java:1078)
...

I've reported this to the SRE team on Slack. As of this writing, a temporary fix has been applied.

The temporary fix was to rm -rf all temporary folders generated by Archiva.

In this task we should:

  • Figure out root case for this outage
  • Fix the root case, or implement an alert so the likelihood of a similar outage diminishes.

Event Timeline

LSobanski renamed this task from Root cause Archiva outage from 2024-09-24 to Root cause Archiva outage from 2023-09-24.Sep 29 2023, 4:16 PM
BTullis triaged this task as High priority.Oct 11 2023, 8:48 AM
BTullis subscribed.

I think that we can resolve this task, given that we know we intend to deprecate Archiva. Feel free to reopen if you believe differently.