Page MenuHomePhabricator

Special:LintErrors page had not been updated more than 90 mins after a page has been edited to fix errors
Closed, ResolvedPublic

Description

This edit fixed all the obsolete <tt> tag errors on the mw:Continuous_Integration/Zuul page but the LintErrors page continues to list it as of 90+ mins after that edit.

Not sure if this is simply a delay in when Parsoid is reparsing the page via ChangeProp or if this is a bug in the Linter extension code. IIRC, the Linter code removes stale entries from the database after a page is parsed via Parsoid.

Event Timeline

We have had some events being lost today in CP due to the update to a newer version of the Kafka native driver, but that shouldn't have affected this case, as the loss pertains only to issuing dependency updates. I have forced a purge on mw:Continuous_integration/Zuul, let's see if that helps.

<mobrovac> re T160573
<mobrovac> have you noticed that today?
<subbu> yes .. i am mostly checking if that page linked in the phab task went through page prop or not
<subbu> *change prop
<subbu> the linter ext was deployed today (@ 2017-03-15T18:25:16Z )
<subbu> and James_.F edited that page.
<subbu> so, trying ot figure out if this is a bug in linter extension code or a bug in changeprop / parsoid.
<mobrovac> uh bad timing
<mobrovac> we had some event loss today in CP due to the migration to the new kafka native driver lib
<subbu> i see ... i suppose i can issue a purge vent on the page to force a reparse.
<subbu> can you update the phab task with that info?
<mobrovac> but wait
<mobrovac> how is CP involved here?
<subbu> page is edited => CP should notice that => issue reparse req. to Parsoid => linter should remove those lint errors from its db
<subbu> as i said eralier .. i am trying to figure out what in that chain is broken.
<mobrovac> I just purged the CI page
<mobrovac> ah ok
<mobrovac> i see now, so the linter is triggered on a parsoid api req?
<subbu> parsoid emits lint errors on prase to the m/w api which goes to the linter extension.
<subbu> so, yes.
<mobrovac> oh oh i see
<mobrovac> then it's not CP-related
<mobrovac> that event should have been processed
<subbu> mobrovac, the purge fixed it .. https://www.mediawiki.org/wiki/Special:LintErrors/obsolete-tag is updated now.
<mobrovac> the event loss was only for dependencies
<mobrovac> ah hm ok
<subbu> so, only explanation is that the reparse didn't happen or the event got lost somewhere.
<subbu> curious what happened there .. if you want to dig into it.
<subbu> will update the ticket with this IRC transcript.
ssastry renamed this task from Special:LintErrors page isn't updated more than 90 mins after a page has been edited to fix errors to Special:LintErrors page had not been updated more than 90 mins after a page has been edited to fix errors.Mar 15 2017, 9:24 PM
ssastry reassigned this task from Legoktm to mobrovac.
ssastry triaged this task as Medium priority.
ssastry added a subscriber: Legoktm.

Just found another case...

https://www.mediawiki.org/w/index.php?diff=2427878&oldid=2382189&title=Extension%3AUploadLogo&type=revision

but:

mysql:wikiadmin@db1077 [mediawikiwiki]> select * from linter where linter_id=30130;
+-----------+-------------+------------+--------------+------------+-----------------------+
| linter_id | linter_page | linter_cat | linter_start | linter_end | linter_params         |
+-----------+-------------+------------+--------------+------------+-----------------------+
|     30130 |      636946 |          3 |         1568 |       1638 | {"items":["options"]} |
+-----------+-------------+------------+--------------+------------+-----------------------+
1 row in set (0.00 sec)

Question: Do we keep logs for pages/events that changeprop processed? I'll try to investigate this from the parsoid/linter/jobqueue side too.

The specific edit that should have fixed this was revid=2427878. If changeprop did process that event, I'm starting to suspect that it might be a race condition when parsoid contacts MW...:

		if ( !$title || !$title->getArticleID()
			|| $title->getLatestRevID() != $params['revision']
		) {
			$this->dieUsage( 'Invalid, non-existent, or outdated title', 'invalid-title' );
		}

Here's the event from Kafka that was processed:

{  
   "comment":"Lint errors",
   "database":"mediawikiwiki",
   "meta":{  
      "domain":"www.mediawiki.org",
      "dt":"2017-03-21T13:56:40+00:00",
      "id":"35735e8f-0e3e-11e7-9813-b083fecf0c72",
      "request_id":"54e949da-94aa-49ee-a487-f5110a9668f6",
      "schema_uri":"mediawiki/revision/create/1",
      "topic":"mediawiki.revision-create",
      "uri":"https://www.mediawiki.org/wiki/Extension:UploadLogo"
   },
   "page_id":636946,
   "page_is_redirect":false,
   "page_namespace":102,
   "page_title":"Extension:UploadLogo",
   "performer":{  
      "user_groups":[  
         "sysop",
         "translationadmin",
         "*",
         "user",
         "autoconfirmed"
      ],
      "user_id":472859,
      "user_is_bot":false,
      "user_text":"Shirayuki"
   },
   "rev_content_format":"wikitext",
   "rev_content_model":"wikitext",
   "rev_id":2427878,
   "rev_len":1866,
   "rev_minor_edit":false,
   "rev_parent_id":2382189,
   "rev_sha1":"qi27jrcf6nn0webnlfelkkmbeorasex",
   "rev_timestamp":"2017-03-21T13:56:40Z"
}

The event was not retried, so apparently it got processed successfully.

Change 344998 had a related patch set uploaded (by Legoktm):
[operations/mediawiki-config@master] Linter: whitelist parsoid canaries too

https://gerrit.wikimedia.org/r/344998

Change 344998 merged by jenkins-bot:
[operations/mediawiki-config@master] Linter: whitelist parsoid canaries too

https://gerrit.wikimedia.org/r/344998

Mentioned in SAL (#wikimedia-operations) [2017-03-27T19:56:12Z] <legoktm@tin> Synchronized wmf-config/InitialiseSettings.php: Linter: whitelist parsoid canaries too - https://gerrit.wikimedia.org/r/#/c/344998/ - T160573 (duration: 00m 44s)

After looking into the parsoid logs, I noticed that wtp2001 wasn't whitelisted, and it turns out that we were dropping events from all of the four canary servers, about 9% of events in total. I'm not closing this yet because I saw a few other errors in the logs that need investigating, but I think this should fix the largest source of issues.

No more reports of delayed updates have come in since, closing as resolved.