wikidata-dev instances causing git "Internal error during upload-pack" every 5 minutes
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	hashar
	Jul 27 2021, 11:29 AM

Description

While browsing Gerrit logs on logstash, I found out two WMCS instances are doing git fetches against the primary Gerrit every 5 minutes to update some MediaWiki repository.

For some reason, each operation causes an error on the server side:

message	Internal error during upload-pack from /srv/gerrit/git/mediawiki/core.git
type	org.eclipse.jetty.io.EofException
thread	HTTP POST /r/mediawiki/core/git-upload-pack

The traffic comes from:

fedprops-euspecies.wikidata-dev.eqiad1.wikimedia.cloud.	172.16.2.3	Created by @Addshore
wb-reconcile.wikidata-dev.eqiad1.wikimedia.cloud.	172.16.6.4	Created by @Lucas_Werkmeister_WMDE

The Gerrit server side trace indicates the socket got terminated before all data got written by the server:

Server side stacktrace

org.eclipse.jetty.io.EofException
	at org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:279)
	at org.eclipse.jetty.io.WriteFlusher.flush(WriteFlusher.java:422)
	at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:277)
	at org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:381)
	at org.eclipse.jetty.server.HttpConnection$SendCallback.process(HttpConnection.java:804)
	at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241)
	at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:223)
	at org.eclipse.jetty.server.HttpConnection.send(HttpConnection.java:528)
	at org.eclipse.jetty.server.HttpChannel.sendResponse(HttpChannel.java:915)
	at org.eclipse.jetty.server.HttpChannel.write(HttpChannel.java:987)
...
Caused by: java.io.IOException: Connection reset by peer
	at java.base/sun.nio.ch.FileDispatcherImpl.writev0(Native Method)
	at java.base/sun.nio.ch.SocketDispatcher.writev(SocketDispatcher.java:51)
...

It might not be the sole systems or users triggered the issues, but those two instances stand out since they update several repositories every five minutes.

My aim for this task is to get rid of the server side error. How? Well I don't know what is the cause of it.

Things that might help diagnose the issue:

which commands are used to update the repository (most probably git)
get the git version being used (git --version)
- we might want to try a newer git version from -backports
check whether git protocol v2 is turned on (`git config --get protocol.version)
- Can be changed in /etc/gitconfig

We might consider fetching from gerrit-replica.wikimedia.org instead of the primary server. But that is not essential since the queries hit the in memory cache and I don't think they cause any performance trouble on the server.

Acceptance Criteria: 🏕️🌟(August 2021)

The git-updater does not cause Gerrit server side stack traces

Related Objects

Mentioned In: T329452: 500 server error when pulling Pywikibot i18n
T286292: The ansible-role setup fills up the disk
Mentioned Here: T286292: The ansible-role setup fills up the disk

Event Timeline

hashar created this task.Jul 27 2021, 11:29 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 27 2021, 11:29 AM

The wb-reconcile instance uses our git-updater ansible role to keep a MediaWiki checkout up to date. For reasons we haven’t figured out yet (T286292), it regularly fills up the hard drive (the .git/objects store appears to grow unbounded for some reason), so I’m thrilled to hear it’s causing server-side issues as well.

command: git
version: git version 2.20.1
check whether git protocol v2 is turned on: yes

lucaswerkmeister-wmde@wb-reconcile:~$ grep -A2 protocol /etc/gitconfig
# git::systemconfig for 'protocol_v2'
[protocol]
version = 2
lucaswerkmeister-wmde@wb-reconcile:~$ git config protocol.version
2

I believe fedprops-euspecies is set up similarly, but I’m not sure.

Honestly, at this point, we should probably just turn off the automatic updater at least for the reconcile instance. (The Federated Properties team would have to decide if they still need their automatic updates.) It causes a variety of issues and we don’t update the software (WikibaseReconcileEdit extension) very frequently anymore anyways.

Mentioned in SAL (#wikimedia-cloud) [2021-07-27T12:47:49Z] <Lucas_WMDE> wb-reconcile Edited the mediawiki user’s crontab to disable all automatic updates (T287459, T286292)

Addshore added a project: [DEPRECATED] wdwb-tech.Jul 28 2021, 8:42 AM

Addshore moved this task from Inbox to To Prioritize on the [DEPRECATED] wdwb-tech board.Jul 28 2021, 9:17 AM

Thanks @Lucas_Werkmeister_WMDE as far as I can tell that is the proper git and v2 is usually nice. The update script looks really straightforward:

cd ${GIT_PATH}
git pull origin master 2>> "$ERROR_LOG" | tee -a ${GIT_LOG}
git submodule update 2>> "$ERROR_LOG" | tee -a ${GIT_LOG}

Which looks, well straightforward. Note that today the error only shows up for:

mediawiki/core

mediawiki/skins/Vector

mediawiki/extensions/UniversaLanguageSelector

mediawiki/extensions/Wikibase

Maybe those repositories can be recreated from scratch? It might be an issue with the local .git directory. T286292 indicates that the repo keeps filing, which might well be a bug in git :-\

Note: the error does not threaten the Gerrit server. It just slightly spam the error log.

Addshore moved this task from To Prioritize to Triaged Medium (50+) on the [DEPRECATED] wdwb-tech board.Aug 3 2021, 1:33 PM

Addshore moved this task from Incoming to Prioritized Wikidata Tech Backlog (prioritised from top to bottom) on the Wikidata-Campsite board.Aug 3 2021, 1:47 PM

Addshore updated the task description. (Show Details)Aug 4 2021, 8:30 AM

Addshore raised the priority of this task from Low to Medium.Aug 4 2021, 9:56 AM

Is this one still an issue?

The wb-reconcile instance is already gone. The fedprops-euspecies instance still exists, but the main configured web proxy in Horizon, https://eu-invasive-species-federated-properties.wmflabs.org/, yields 502 Bad Gateway. The query service proxy https://eu-invasive-species-query.wmflabs.org/ works, but the underlying Blazegraph has apparently not been updated in a bit over a year (query).

Given that this never threatened the Gerrit server in the first place, I think we might just end up closing this task. (But maybe we should check if anyone still needs that euspecies instance, and properly delete it if not.)

It was still an issue last time I checked but I have been unable to reproduce it. I agree with Lucas, given it is harmless, there is no point in spending more time on this. Thank you!

Dzahn mentioned this in T329452: 500 server error when pulling Pywikibot i18n.Feb 14 2023, 7:19 PM

wikidata-dev instances causing git "Internal error during upload-pack" every 5 minutesClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

wikidata-dev instances causing git "Internal error during upload-pack" every 5 minutes
Closed, ResolvedPublic
Actions