Page MenuHomePhabricator

CI unstable/mass erroring (2019-02-17)
Closed, ResolvedPublic

Description

At this moment, CI/zuul is failing on most patches with a variety of errors like:

  • Content-Length mismatch, received $received bytes out of the expected $expected
  • SlowTimer [60000ms] at curl (this one lags CI until the tests are cancelled so patches are left untested).

Recent examples:

UBN as it blocks development.

Event Timeline

MarcoAurelio triaged this task as Unbreak Now! priority.Feb 17 2019, 6:58 PM
MarcoAurelio created this task.
greg lowered the priority of this task from Unbreak Now! to High.Feb 17 2019, 7:00 PM
greg subscribed.
18:51:10          +greg-g | hauskatze: there is still on-going maintenance with the WMCS VPS project (afaik) so there may not be anything we can do right no
                          | especially on a sunday.                                                                                                         
18:54:26          +greg-g | all of the failures I'm seeing right now (phan/seccheck) are of this type:                                                      
18:54:27          +greg-g | every Wednesday at 16:00 UTC (we always keep the meeting at 17:00 MEZ)                                                          
18:54:31          +greg-g | grr, mispaste                                                                                                                   
18:54:40          +greg-g | 18:21:08   [Composer\Downloader\TransportException]                                                                             
18:54:40          +greg-g | 18:21:08   Content-Length mismatch, received 215489 bytes out of the expected 330567                                            
18:54:59          +greg-g | that was from https://integration.wikimedia.org/ci/job/release-quibble-vendor-mysql-php70-docker/364/console, which isn't phan  
                          | related                                                                                                                         
18:55:02          +greg-g | it's composer                                                                                                                   
18:57:44          +greg-g | castor-save is having issues, which was one of the instances that was curropted by the wmcs hardware outage

Setting to high to set expectations that this may not be resolved before normal business hours. There are a multitude of confounding circumstances involved (WMCS hardware outage and the recovery there of, including castor, etc).

hashar subscribed.

Recent examples:

https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Wikibase/+/491057/
https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/AbuseFilter/+/491059/

I have rechecked both and they passed fine. I also looked at various jobs using composer and apparently they are no more affected.

Content-Length mismatch, received $received bytes out of the expected $expected
SlowTimer [60000ms] at curl (this one lags CI until the tests are cancelled so patches are left untested).

The SlowTimer would suggest an issue with packagist.org. Someone might have cleaned the CI cache which might have helped. I can't confirm which one was the root cause, I am in vacations today and can't really do the whole forensic. But it seems to be fixed now :-)