Page MenuHomePhabricator

Wikimedia\Rdbms\ChronologyProtector::initPositions: expected but failed to find position index.
Closed, ResolvedPublic

Description

Around 3000 warnings per hour since 2018-05-09T19:31:36. I don't know if this is a really bad breakage or it is just an informative text, I am going to assume the latter based on the level used. Probably some debugging introduced on the latest release?

{
  "_index": "logstash-2018.05.09",
  "_type": "mediawiki",
  "_id": "AWNGYzcQIqXYUKtvgxO6",
  "_version": 1,
  "_score": null,
  "_source": {
    "server": "www.mediawiki.org",
    "unique_id": "WvNMkwpAIDEAALd35wsAAAAN",
    "level": "WARNING",
    "ip": "10.64.0.105",
    "wiki": "mediawikiwiki",
    "channel": "DBReplication",
    "mwversion": "1.32.0-wmf.3",
    "message": "Wikimedia\\Rdbms\\ChronologyProtector::initPositions: expected but failed to find position index.",
    "type": "mediawiki",
    "normalized_message": "Wikimedia\\Rdbms\\ChronologyProtector::initPositions: expected but failed to find position index.",
    "url": "/wiki/Special:CentralAutoLogin/createSession?token=<token>&type=1x1&from=enwiki&proto=https&cpPosIndex=7",
    "tags": [
      "syslog",
      "es",
      "es"
    ],
    "reqId": "WvNMkwpAIDEAALd35wsAAAAN",
    "referrer": "https://en.wikipedia.org/",
    "@timestamp": "2018-05-09T19:31:36.000Z",
    "http_method": "GET",
    "cpPosIndex": 7,
    "@version": 1,
    "host": "mw1328",
    "waitTimeMs": 5000.3664493561,
    "shard": "s3",
    "indexReached": null
  },
  "fields": {
    "@timestamp": [
      1525894296000
    ]
  },
  "sort": [
    1525894296000
  ]
}

The errors seem constant, so it doesn't look like a one-time lagging issue. Only added Releng in case it is a train blocker, remove them otherwise.

Event Timeline

jcrespo created this task.May 10 2018, 3:17 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 10 2018, 3:17 PM

The logging level went from INFO to WARNING. I suppose this has been happening for a very long time then.

Change 432541 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Give ChronologyProtector cookies a lower TTL

https://gerrit.wikimedia.org/r/432541

Imarlier moved this task from Inbox to Doing on the Performance-Team board.May 14 2018, 7:58 PM

Change 432541 merged by jenkins-bot:
[mediawiki/core@master] Avoid unnecessary WaitConditionLoop delays in ChronologyProtector

https://gerrit.wikimedia.org/r/432541

Gilles assigned this task to aaron.May 21 2018, 8:26 PM
aaron added a comment.May 24 2018, 6:31 PM

I suspect that some RESTBase service forwards a user's cookies (for permissions) but uses a local IP, judging from the logs. Since the CP position redis key is based on the client IP/agent hash, then it will not be found and will timeout. I don't know if the agent is passed through or not.

E.g.:

ip: 10.64.0.104
url: /w/api.php?inprop=protection%7Csubjectid&titles=Q1778821&rvlimit=1&format=json&action=query&rvprop=content%7Cids%7Cflags%7Ctimestamp%7Cuser%7Ccomment%7Csize&rawcontinue=&prop=revisions%7Cinfo

Change 435122 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: include client ID hash in ChronologyProtector cookies

https://gerrit.wikimedia.org/r/435122

Change 435122 merged by jenkins-bot:
[mediawiki/core@master] rdbms: include client ID hash in ChronologyProtector cookies

https://gerrit.wikimedia.org/r/435122

Change 439324 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Add more logging to ChronologyProtector::initPositions()

https://gerrit.wikimedia.org/r/439324

Change 439324 merged by jenkins-bot:
[mediawiki/core@master] Add more logging to ChronologyProtector::initPositions()

https://gerrit.wikimedia.org/r/439324

Change 439326 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@wmf/1.32.0-wmf.7] Add more logging to ChronologyProtector::initPositions()

https://gerrit.wikimedia.org/r/439326

Change 439326 merged by jenkins-bot:
[mediawiki/core@wmf/1.32.0-wmf.7] Add more logging to ChronologyProtector::initPositions()

https://gerrit.wikimedia.org/r/439326

I am getting more messages about this- I guess this is expected based on the latest deploy. They don't seem to have any pattern, except a percentage of the total requests, e.g. enwiki and commons at the top (around 100 warning per minutes more or less).

Change 439778 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: make getCPInfoFromCookieValue() stricter about allowed values

https://gerrit.wikimedia.org/r/439778

Change 440030 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@wmf/1.32.0-wmf.8] rdbms: make getCPInfoFromCookieValue() stricter about allowed values

https://gerrit.wikimedia.org/r/440030

Change 440031 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@wmf/1.32.0-wmf.7] rdbms: make getCPInfoFromCookieValue() stricter about allowed values

https://gerrit.wikimedia.org/r/440031

Change 439778 merged by jenkins-bot:
[mediawiki/core@master] rdbms: make getCPInfoFromCookieValue() stricter about allowed values

https://gerrit.wikimedia.org/r/439778

Change 440030 merged by jenkins-bot:
[mediawiki/core@wmf/1.32.0-wmf.8] rdbms: make getCPInfoFromCookieValue() stricter about allowed values

https://gerrit.wikimedia.org/r/440030

Change 440031 merged by jenkins-bot:
[mediawiki/core@wmf/1.32.0-wmf.7] rdbms: make getCPInfoFromCookieValue() stricter about allowed values

https://gerrit.wikimedia.org/r/440031

aaron added a comment.Jun 13 2018, 7:25 AM

There are now only ~10/min of these now. I still see no 'redis' channel errors, but I wonder if the random eviction model of redis is at play. redis 3.0 is a bit better at LRU per https://redis.io/topics/lru-cache than our 2.8.

Please excuse my ignorance, but you are talking redis for sessions, not for the jobqueue (which is, or is close to be, deprecated, right?).

aaron added a comment.Jun 13 2018, 8:53 AM

Please excuse my ignorance, but you are talking redis for sessions, not for the jobqueue (which is, or is close to be, deprecated, right?).

Right, the mc* server redis instances used by getMainObjectStash().

aaron added a comment.Jun 13 2018, 9:04 AM

That said, from mc1019, I see:

used_memory_human:292.78M
used_memory_rss:353058816
used_memory_peak:331823400
used_memory_peak_human:316.45M

...the limit is 500Mb and evicted_keys shows 0. That suggests no eviction is at play here.

If keys are not were they would be expected, then I'd assume that is consistent (re-)hashing, though again I see no errors.

I'm not sure what causes the remaining small trickle of timeouts.

Vvjjkkii renamed this task from Wikimedia\Rdbms\ChronologyProtector::initPositions: expected but failed to find position index. to 86caaaaaaa.Jul 1 2018, 1:10 AM
Vvjjkkii removed aaron as the assignee of this task.
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
Marostegui renamed this task from 86caaaaaaa to Wikimedia\Rdbms\ChronologyProtector::initPositions: expected but failed to find position index.Jul 1 2018, 6:39 PM
Marostegui assigned this task to aaron.
Marostegui lowered the priority of this task from High to Medium.
Marostegui updated the task description. (Show Details)
CommunityTechBot renamed this task from Wikimedia\Rdbms\ChronologyProtector::initPositions: expected but failed to find position index to Wikimedia\Rdbms\ChronologyProtector::initPositions: expected but failed to find position index..Jul 5 2018, 6:40 PM
CommunityTechBot raised the priority of this task from Medium to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
Krinkle triaged this task as High priority.Jul 11 2018, 2:13 AM
Krinkle moved this task from Untriaged to Rdbms library on the Wikimedia-Rdbms board.
Krinkle added a subscriber: Krinkle.

Moving to "Production impact". This is still happening around 900-1000 times a day, and the impact is the backend response to the user being blocked for 5-6 seconds.

Looking at a few recent in Logstash under mediawiki-errors for "ChronologyProtector", this seems to include regular page views from modern browsers (e.g. latest Chrome). So that should rule out bugs in client handling of cookies (e.g. bots or apps).

@Krinkle may I ask to update the description with your sensible analysis? It still has my ignorant comments about the issue, and that may mislead readers.

aaron added a comment.Jul 11 2018, 9:07 AM

Given how low server_failure_limit is, it might help to lower server_retry_timeout from 30s to something < 5s. Consistent hash ejections seem like the most obvious thing that could cause an acknowledged write to be seen as not being there for any of the next 5 seconds.

Change 445110 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: fix value of ChronologyProtector::POSITION_COOKIE_TTL

https://gerrit.wikimedia.org/r/445110

Change 445110 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: fix value of ChronologyProtector::POSITION_COOKIE_TTL
https://gerrit.wikimedia.org/r/445110

Not sure how that was missed before.

Change 445112 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@wmf/1.32.0-wmf.10] rdbms: fix value of ChronologyProtector::POSITION_COOKIE_TTL

https://gerrit.wikimedia.org/r/445112

Change 445113 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@wmf/1.32.0-wmf.12] rdbms: fix value of ChronologyProtector::POSITION_COOKIE_TTL

https://gerrit.wikimedia.org/r/445113

Change 445112 merged by jenkins-bot:
[mediawiki/core@wmf/1.32.0-wmf.10] rdbms: fix value of ChronologyProtector::POSITION_COOKIE_TTL

https://gerrit.wikimedia.org/r/445112

Change 445113 merged by jenkins-bot:
[mediawiki/core@wmf/1.32.0-wmf.12] rdbms: fix value of ChronologyProtector::POSITION_COOKIE_TTL

https://gerrit.wikimedia.org/r/445113

aaron closed this task as Resolved.Jul 11 2018, 12:43 PM

Change 445110 merged by jenkins-bot:
[mediawiki/core@master] rdbms: fix value of ChronologyProtector::POSITION_COOKIE_TTL

https://gerrit.wikimedia.org/r/445110

Since Jul 11 this is happening very rarely (from 1100 per day to ~8 per day).

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM