Page MenuHomePhabricator

[Investigate] some revisions frequently return TaskRevokedError
Closed, ResolvedPublic

Description

Lately, some revisions are frequently returning an error instead of the expected ORES features and scores.

Here's the one that is in my test suite, which has intermittently been failing for the last week or so: https://ores.wikimedia.org/v2/scores/enwiki/wp10/641962088?features

{
  "scores": {
    "enwiki": {
      "wp10": {
        "scores": {
          "641962088": {
            "error": {
              "message": "revoked",
              "type": "TaskRevokedError"
            }
          }
        },
        "version": "0.5.0"
      }
    }
  }
}

Currently, it seems to work fine when the features param is not included: https://ores.wikimedia.org/v2/scores/enwiki/wp10/641962088

{
  "scores": {
    "enwiki": {
      "wp10": {
        "scores": {
          "641962088": {
            "prediction": "Stub",
            "probability": {
              "B": 0.05442654153740996,
              "C": 0.0410184196903844,
              "FA": 0.0030839593054811758,
              "GA": 0.00289370787403999,
              "Start": 0.11865341961498213,
              "Stub": 0.7799239519777024
            }
          }
        },
        "version": "0.5.0"
      }
    }
  }
}

Event Timeline

I think that we probably have the error sitting in celery's backend. See https://stackoverflow.com/questions/41501362/celery-how-to-remove-results-of-tasks-in-redis-after-read

Here's how the task ID gets formatted: https://github.com/wiki-ai/ores/blob/master/ores/scoring_context.py#L44

I think we should expect that the above example will have the task_id: "enwiki:wp10:0.5.0:641962088"

Docs for celery backend settings suggest that we can configure the duration at which results are cached. The default is supposed to be 1 day, but this has stuck around for more than one day.

Halfak triaged this task as High priority.Jul 3 2017, 4:00 PM

I'm poking at the Redis db and found some interesting things...

oresrdb.svc.eqiad.wmnet:6379> ttl "celery-task-meta-enwiki:wp10:0.5.0:641962088"                  
(integer) 81089         

oresrdb.svc.eqiad.wmnet:6379> get "celery-task-meta-enwiki:wp10:0.5.0:641962088"                  
"\x80\x02}q\x00(X\b\x00\x00\x00childrenq\x01]q\x02X\x06\x00\x00\x00resultq\x03}q\x04(X\x05\x00\x00\x00scoreq\x05}q\x06(X\n\x00\x00\x00predictionq\acnumpy.core.multiarray\nscalar\nq\bcnumpy\ndtype\nq\tX\x02\x00\x00\x00U4q\nK\x00K\x01\x87q\x0bRq\x0c(K\x03X\x01\x00\x00\x00<q\rNNNK\x10K\x04K\btq\x0ebc_codecs\nencode\nq\x0fX\x10\x00\x00\x00S\x00\x00\x00t\x00\x00\x00u\x00\x00\x00b\x00\x00\x00q\x10X\x06\x00\x00\x00latin1q\x11\x86q\x12Rq\x13\x86q\x14Rq\x15X\x0b\x00\x00\x00probabilityq\x16}q\x17(h\bh\tX\x02\x00\x00\x00U2q\x18K\x00K\x01\x87q\x19Rq\x1a(K\x03h\rNNNK\bK\x04K\btq\x1bbh\x0fX\b\x00\x00\x00F\x00\x00\x00A\x00\x00\x00q\x1ch\x11\x86q\x1dRq\x1e\x86q\x1fRq G?iC\x88\x0b~\xdd\x98h\bh\tX\x02\x00\x00\x00U1q!K\x00K\x01\x87q\"Rq#(K\x03h\rNNNK\x04K\x04K\btq$bh\x0fX\x04\x00\x00\x00C\x00\x00\x00q%h\x11\x86q&Rq'\x86q(Rq)G?\xa5\x00]\xc65%\xbah\bh\tX\x02\x00\x00\x00U4q*K\x00K\x01\x87q+Rq,(K\x03h\rNNNK\x10K\x04K\btq-bh\x0fX\x10\x00\x00\x00S\x00\x00\x00t\x00\x00\x00u\x00\x00\x00b\x00\x00\x00q.h\x11\x86q/Rq0\x86q1Rq2G?\xe8\xf5#\x13c\x8f\xd2h\bh\tX\x02\x00\x00\x00U2q3K\x00K\x01\x87q4Rq5(K\x03h\rNNNK\bK\x04K\btq6bh\x0fX\b\x00\x00\x00G\x00\x00\x00A\x00\x00\x00q7h\x11\x86q8Rq9\x86q:Rq;G?g\xb4\x8b\x95\xdc\x9b#h\bh\tX\x02\x00\x00\x00U1q<K\x00K\x01\x87q=Rq>(K\x03h\rNNNK\x04K\x04K\btq?bh\x0fX\x04\x00\x00\x00B\x00\x00\x00q@h\x11\x86qARqB\x86qCRqDG?\xab\xdd\xcb\xaf\xe0\b\nh\bh\tX\x02\x00\x00\x00U5qEK\x00K\x01\x87qFRqG(K\x03h\rNNNK\x14K\x04K\btqHbh\x0fX\x14\x00\x00\x00S\x00\x00\x00t\x00\x00\x00a\x00\x00\x00r\x00\x00\x00t\x00\x00\x00qIh\x11\x86qJRqK\x86qLRqMG?\xbe`\x12\x0c\xce\x0e\xcauuX\b\x00\x00\x00featuresqN}qO(X(\x00\x00\x00feature.wikitext.revision.external_linksqPK\x04X%\x00\x00\x00feature.enwiki.main_article_templatesqQK\x00X'\x00\x00\x00feature.wikitext.revision.content_charsqRMM\x02X#\x00\x00\x00feature.enwiki.revision.image_linksqSK\x00X#\x00\x00\x00feature.wikitext.revision.templatesqTK\x00X&\x00\x00\x00feature.enwiki.revision.category_linksqUK\x00X.\x00\x00\x00feature.wikitext.revision.headings_by_level(2)qVK\x00X-\x00\x00\x00feature.english.stemmed.revision.stems_lengthqWK]X#\x00\x00\x00feature.wikitext.revision.wikilinksqXK\x00X)\x00\x00\x00feature.enwiki.revision.infobox_templatesqYK\x00X.\x00\x00\x00feature.wikitext.revision.headings_by_level(3)qZK\x00X&\x00\x00\x00feature.enwiki.revision.cite_templatesq[K\x00X\"\x00\x00\x00feature.wikitext.revision.ref_tagsq\\K\x00X\x1f\x00\x00\x00feature.wikitext.revision.charsq]MM\x02X$\x00\x00\x00feature.enwiki.revision.cn_templatesq^K\x00uuX\t\x00\x00\x00tracebackq_NX\x06\x00\x00\x00statusq`X\a\x00\x00\x00SUCCESSqau."

The revocation is only 2 hours old (assuming TTL is set to 24h on creation), and the task meta seems to contain a stack trace. My working guess is that this revision is erroring out, and will continue to error out again once the TaskRevoked expires.

Here's the decoded data, performed using pickle.loads(b"LITERAL..."). There was no traceback.

{
    "children": [],
    "result": {
        "features": {
            "feature.english.stemmed.revision.stems_length": 93,
            "feature.enwiki.main_article_templates": 0,
            "feature.enwiki.revision.category_links": 0,
            "feature.enwiki.revision.cite_templates": 0,
            "feature.enwiki.revision.cn_templates": 0,
            "feature.enwiki.revision.image_links": 0,
            "feature.enwiki.revision.infobox_templates": 0,
            "feature.wikitext.revision.chars": 589,
            "feature.wikitext.revision.content_chars": 589,
            "feature.wikitext.revision.external_links": 4,
            "feature.wikitext.revision.headings_by_level(2)": 0,
            "feature.wikitext.revision.headings_by_level(3)": 0,
            "feature.wikitext.revision.ref_tags": 0,
            "feature.wikitext.revision.templates": 0,
            "feature.wikitext.revision.wikilinks": 0
        },
        "score": {
            "prediction": "Stub",
            "probability": {
                "B": 0.05442654153740996,
                "C": 0.0410184196903844,
                "FA": 0.0030839593054811758,
                "GA": 0.00289370787403999,
                "Start": 0.11865341961498213,
                "Stub": 0.7799239519777024
            }
        }
    },
    "status": "SUCCESS",
    "traceback": null
}

Celery supports event logging, which might help us monitor and debug. It doesn't look like we've enabled this?

Reading celery-3.1.23/worker/job.py, these are the conditions that cause a task to be in the "revoked" state:

  • Task "expires" timestamp elapses.
  • Job received a signal.

One thing on my mind is, how we construct the task IDs and whether the "options" are included. I couldn't find anything in the celery cache or score cache for the ?features request, hopefully it's not being stored as the opaque UUID?

I'm either grossly misunderstanding something here, or (unlikely) have accidentally found the bug. The "message": "revoked" property is surprising, this value is the result of a string cast str(e), and since TaskRevokedError is an empty subclass of Exception, it should have the default behavior of returning ex.message as its stringified value. At the beginning of the chain, in Celery we create the error with TaskRevokedError(uuid), where uuid is the task's ID.

Is it possible that the task ID is being set to "revoked" at some point? Or that we've damaged the error while bubbling it to the surface?

I just stumbled upon steps to reproduce locally:

  • Use scoring_system: local_celery in config
  • ores applications.wsgi
  • Request a score via the web interface and let it time out.
  • Stop ores
  • ores applications.celery
  • ores applications.wsgi
  • Request the same score as above

et voila! TaskRevokedError.

I'm seeing a pretty mundane failure, the key celery-task-meta-testwiki:revid:0.0.0:641962090 contains (pickled):

{'traceback': None, 'children': [], 'result': TaskRevokedError('revoked',), 'status': 'REVOKED'}

TTL on the key is 24 hours.

Note that the real-world error contained good data under the task-meta key.

The example bad record in production has cleared itself. I recommend we close this task until it happens again...

@Ragesoss Have you seen any of these errors in the last 24h or so?

@awight Yes, ~12 hours ago.

I marked the spec as 'pending' so it doesn't break the build, but that test failed in one of the travis-ci jobs the last time I pushed to master.

Ragesoss mentioned that the failure was intermittent, which makes it sound like the bad data was only cached at one data center. In my earlier tests, I only investigated eqiad.

For next time, log into both scb1001 and scb2001, and:

curl 'http://0.0.0.0:8081/v2/scores/enwiki/wp10/641962088?features'

Halfak renamed this task from On ORES, some revisions frequently return TaskRevokedError to [Investigate] some revisions frequently return TaskRevokedError .Jul 10 2017, 4:53 PM
Halfak closed this task as Resolved.
Halfak moved this task from Review to Completed on the Machine-Learning-Team (Active Tasks) board.