Page MenuHomePhabricator

java.io.NotSerializableException: groovy.lang.IntRange in Jenkins CI
Closed, ResolvedPublic

Description

Tests pass, but CI fails during (or directly after) teardown stage.

Failure was triggered by this patch:

https://gerrit.wikimedia.org/r/c/mediawiki/services/function-evaluator/+/708116

Error cannot be replicated locally, but, when running the Blubber test variant locally, the container does hang for a while after tests complete.

Event Timeline

dduvall triaged this task as Medium priority.
dduvall moved this task from Backlog to CI on the Release Pipeline board.
dduvall subscribed.

This is likely a Groovy-CPS related error stemming from Jenkins, specifically our integration/pipelinelib shared library. I'll have a look.

More info on this: it looks like the error is specifically triggered when running a Python subprocess that imports code. The PYTHONPATH isn't the problem, as far as I can tell--the tests all still pass, after all. Here is a minimal patch that reproduces the error (look specifically at executor.py):

https://gerrit.wikimedia.org/r/c/mediawiki/services/function-evaluator/+/708326

@cmassaro I don't think the error has anything to do with your project's code. I believe it's related to some implementation within our pipelinelib shared library.

Currently I'm suspecting something within the parallel execution handler of our PipelineBuilder—you have a custom execution graph which requires parallel stage execution—and some edge case that exposes Jenkins issue JENKINS-63074. I'll keep digging.

Change 708595 had a related patch set uploaded (by Dduvall; author: Dduvall):

[integration/pipelinelib@master] WIP: trying to reproduce CPS IntRange serialization bug JENKINS-63074

https://gerrit.wikimedia.org/r/708595

Reproducing this bug seems quite difficult. I've managed to write a system level test that can do so intermittently. It also seems the more independent execution branches I create, the more likely the bug is to surface. The relevant test pipeline config is:

pipelines:
  repro-T287507:
    blubberfile: blubber.yaml
    stages:
      - name: stage1
        build: bash
        run:
          arguments: ['sleep 5']

      - name: stage2
        build: bash
        run:
          arguments: ['sleep 5']

      - name: stage3
        build: bash
        run:
          arguments: ['exit 1']

      - name: stage4
        build: bash
        run:
          arguments: ['sleep 5']

    execution:
      - [stage1]
      - [stage2]
      - [stage3]
      - [stage4]

The failure in one of the branches seems to be a factor in the edge case that is surfacing the bug. It's also curious to see a serialization error for both RegexMatcher and IntRange which makes me suspect that one of the threads is perhaps somewhere in ExecutionContext.NodeContext#interpolate (the only place I can think of where there is both a regex and an intrange in use. I'm going to see if I can factor out the IntRange there and see if that avoids the edge case.

Reproducing this bug seems quite difficult.

The test patch https://gerrit.wikimedia.org/r/c/mediawiki/services/function-evaluator/+/708326 reliably triggers it in current production CI/PipelineLib; not sure if that can help test your patch?

Thanks, @Jdforrester-WMF. I was able to get it more reliably repro'd with the following—turns out the failure had little to do with exposing the bug.

pipelines:
  repro-T287507:
    blubberfile: blubber.yaml
    stages:
      - name: stage1
        build: bash
        run:
          arguments: ['sleep 10']

      - name: stage2
        build: bash
        run:
          arguments: ['sleep 10']

      - name: stage3
        build: bash
        run:
          arguments: ['sleep 2']

      - name: stage4
        build: bash
        run:
          arguments: ['sleep 10']

    execution:
      - [stage1]
      - [stage2]
      - [stage3]
      - [stage4]

Now I'm flailing around with @NonCPS tags and gutting implementation to find where the problem arises in our lib—if anywhere. 🎉

[low priority] @dduvall Hey, any further work on this one? It's a tad irritating to have to recheck patches when they refuse to pass (we've no longer got a patch that fails 100% of the time).

[low priority] @dduvall Hey, any further work on this one? It's a tad irritating to have to recheck patches when they refuse to pass (we've no longer got a patch that fails 100% of the time).

Yes, sorry, I was close to finding the root cause of this, but I wasn't quite able to do that before taking more parental leave. However, I'm back now and I believe I might have a fix. I'll push it up for review today.

Change 708595 merged by jenkins-bot:

[integration/pipelinelib@master] Replace regex based variable parser with a lexer

https://gerrit.wikimedia.org/r/708595

@Jdforrester-WMF the workaround should be deployed already (we pull the plugin from its master branch) if you want to verify.

@Jdforrester-WMF the workaround should be deployed already (we pull the plugin from its master branch) if you want to verify.

Thank you! I owe you a strong drink. Unfortunately my reproduction case no longer applies to HEAD. I'm going to declare that it works, and if we do get this again I'll owe you a crate. :-)