Page MenuHomePhabricator

wikidata-dev instances causing git "Internal error during upload-pack" every 5 minutes
Closed, ResolvedPublic

Description

While browsing Gerrit logs on logstash, I found out two WMCS instances are doing git fetches against the primary Gerrit every 5 minutes to update some MediaWiki repository.

For some reason, each operation causes an error on the server side:

messageInternal error during upload-pack from /srv/gerrit/git/mediawiki/core.git
typeorg.eclipse.jetty.io.EofException
threadHTTP POST /r/mediawiki/core/git-upload-pack

The traffic comes from:

fedprops-euspecies.wikidata-dev.eqiad1.wikimedia.cloud.172.16.2.3Created by @Addshore
wb-reconcile.wikidata-dev.eqiad1.wikimedia.cloud.172.16.6.4Created by @Lucas_Werkmeister_WMDE

The Gerrit server side trace indicates the socket got terminated before all data got written by the server:

Server side stacktrace
org.eclipse.jetty.io.EofException
	at org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:279)
	at org.eclipse.jetty.io.WriteFlusher.flush(WriteFlusher.java:422)
	at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:277)
	at org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:381)
	at org.eclipse.jetty.server.HttpConnection$SendCallback.process(HttpConnection.java:804)
	at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241)
	at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:223)
	at org.eclipse.jetty.server.HttpConnection.send(HttpConnection.java:528)
	at org.eclipse.jetty.server.HttpChannel.sendResponse(HttpChannel.java:915)
	at org.eclipse.jetty.server.HttpChannel.write(HttpChannel.java:987)
...
Caused by: java.io.IOException: Connection reset by peer
	at java.base/sun.nio.ch.FileDispatcherImpl.writev0(Native Method)
	at java.base/sun.nio.ch.SocketDispatcher.writev(SocketDispatcher.java:51)
...

It might not be the sole systems or users triggered the issues, but those two instances stand out since they update several repositories every five minutes.

My aim for this task is to get rid of the server side error. How? Well I don't know what is the cause of it.

Things that might help diagnose the issue:

  • which commands are used to update the repository (most probably git)
  • get the git version being used (git --version)
    • we might want to try a newer git version from -backports
  • check whether git protocol v2 is turned on (`git config --get protocol.version)
    • Can be changed in /etc/gitconfig

We might consider fetching from gerrit-replica.wikimedia.org instead of the primary server. But that is not essential since the queries hit the in memory cache and I don't think they cause any performance trouble on the server.

Acceptance Criteria: 🏕️🌟(August 2021)

  • The git-updater does not cause Gerrit server side stack traces

Event Timeline

The wb-reconcile instance uses our git-updater ansible role to keep a MediaWiki checkout up to date. For reasons we haven’t figured out yet (T286292), it regularly fills up the hard drive (the .git/objects store appears to grow unbounded for some reason), so I’m thrilled to hear it’s causing server-side issues as well.

  • command: git
  • version: git version 2.20.1
  • check whether git protocol v2 is turned on: yes
lucaswerkmeister-wmde@wb-reconcile:~$ grep -A2 protocol /etc/gitconfig
# git::systemconfig for 'protocol_v2'
[protocol]
version = 2
lucaswerkmeister-wmde@wb-reconcile:~$ git config protocol.version
2

I believe fedprops-euspecies is set up similarly, but I’m not sure.

Honestly, at this point, we should probably just turn off the automatic updater at least for the reconcile instance. (The Federated Properties team would have to decide if they still need their automatic updates.) It causes a variety of issues and we don’t update the software (WikibaseReconcileEdit extension) very frequently anymore anyways.

Mentioned in SAL (#wikimedia-cloud) [2021-07-27T12:47:49Z] <Lucas_WMDE> wb-reconcile Edited the mediawiki user’s crontab to disable all automatic updates (T287459, T286292)

Thanks @Lucas_Werkmeister_WMDE as far as I can tell that is the proper git and v2 is usually nice. The update script looks really straightforward:

cd ${GIT_PATH}
git pull origin master 2>> "$ERROR_LOG" | tee -a ${GIT_LOG}
git submodule update 2>> "$ERROR_LOG" | tee -a ${GIT_LOG}

Which looks, well straightforward. Note that today the error only shows up for:

mediawiki/core
mediawiki/skins/Vector
mediawiki/extensions/UniversaLanguageSelector
mediawiki/extensions/Wikibase

Maybe those repositories can be recreated from scratch? It might be an issue with the local .git directory. T286292 indicates that the repo keeps filing, which might well be a bug in git :-\

Note: the error does not threaten the Gerrit server. It just slightly spam the error log.

Addshore raised the priority of this task from Low to Medium.Aug 4 2021, 9:56 AM

The wb-reconcile instance is already gone. The fedprops-euspecies instance still exists, but the main configured web proxy in Horizon, https://eu-invasive-species-federated-properties.wmflabs.org/, yields 502 Bad Gateway. The query service proxy https://eu-invasive-species-query.wmflabs.org/ works, but the underlying Blazegraph has apparently not been updated in a bit over a year (query).

Given that this never threatened the Gerrit server in the first place, I think we might just end up closing this task. (But maybe we should check if anyone still needs that euspecies instance, and properly delete it if not.)

hashar claimed this task.

It was still an issue last time I checked but I have been unable to reproduce it. I agree with Lucas, given it is harmless, there is no point in spending more time on this. Thank you!