Page MenuHomePhabricator

Popups build failing consistently for a week
Closed, ResolvedPublic

Description

https://integration.wikimedia.org/ci/job/selenium-daily-beta-Popups/

chrome.Dwelling_on_a_valid_page_link.I_should_see_a_page_preview is failing consistently since 24th.

Popups build is usually stable so this is likely due to a performance regression - the feature it still working but the test timing is off.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 30 2019, 5:04 PM
Jdlrobson triaged this task as High priority.Jan 30 2019, 5:07 PM
Jdlrobson updated the task description. (Show Details)
This comment was removed by tstarling.
Jdlrobson updated the task description. (Show Details)

Looks like there was an edit that day it broke which broke the summary.

Made an edit and waiting for cache to update
https://en.wikipedia.beta.wmflabs.org/w/index.php?title=Main_Page&type=revision&diff=388612&oldid=377057&diffmode=source

Looks like https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/summary/Main_Page is returning an empty string for a summary after an edit (it wasnt before).

I've made some subsequent edits but it looks like the response is in cache and might needs flushing... anyone in RI able to check that?

Change 487540 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/extensions/Popups@master] Adapt Popups browser tests to recent breaking change

https://gerrit.wikimedia.org/r/487540

Change 487540 merged by jenkins-bot:
[mediawiki/extensions/Popups@master] Adapt Popups browser tests to recent breaking change

https://gerrit.wikimedia.org/r/487540

Change 487567 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/extensions/Popups@master] QA: Test page in Selenium needs a lead section

https://gerrit.wikimedia.org/r/487567

Change 487567 merged by jenkins-bot:
[mediawiki/extensions/Popups@master] QA: Test page in Selenium needs a lead section

https://gerrit.wikimedia.org/r/487567

That change was merged on January 16 of 2018, not 2019. It doesn't seem likely that it's the culprit here.

I don't know why your test was passing despite Popups not handling the mainpage type, but FWIW, that type isn't new, either: https://gerrit.wikimedia.org/r/#/c/mediawiki/services/mobileapps/+/402835/

Jdlrobson added a comment.EditedFeb 5 2019, 4:32 PM

@Mholloway it was merged then yes ... but when was it deployed? Summaries are also cached so could easily be a week or more before manifesting on the beta cluster browser tests.

This is definitely new I assure you and this patch was definitely the cause. There is no way Popups frontend could handle the mainpage type and the main page links have been hovering up to this point.

Edit:

and this patch was definitely the cause

seems like an edit may have been the actual cause and this has actually been broken for a year.

The main page was edited the day the browser tests failed. Before that the last edit was in March 2018 and before that April 2017. I suspect one of those has been cached ever since?

I recall testing this exact workflow in December last year and main page link was working fine on beta cluster.

What is the caching policy for pages that haven't been edited?

Jdlrobson claimed this task.Feb 5 2019, 6:38 PM
Mholloway added a comment.EditedFeb 5 2019, 6:49 PM

Yes, after chatting with Petr in -services I believe this was a case of very old content stored in RESTBase being served until the page was edited. RESTBase doesn't enforce any time to expire on its own.

The mainpage summary type has been in MCS for a long time (since https://gerrit.wikimedia.org/r/#/c/mediawiki/services/mobileapps/+/402835/ was deployed on Jan 8, 2018)[1]. Unfortunately I didn't document the rationale for it in T184430 (it was probably decided in a video chat) and can't find it anywhere else. If it's useless to clients, maybe it's best just to take it out.

[1] https://github.com/wikimedia/mediawiki-services-mobileapps-deploy/commit/1bfd4b071d0b6c7a91289cca850a78ee3b3549ea

My best recollection/guess is that it was intended to give clients a hint as to why extracts were being returned empty.

Jdlrobson closed this task as Resolved.Feb 5 2019, 6:58 PM

Yes, after chatting with Petr in -services I believe this was a case of very old content stored in RESTBase being served until the page was edited. RESTBase doesn't enforce any time to expire on its own.

Is there any follow ups we need to do here? e.g. make sure those are flushed?

If it's useless to clients, maybe it's best just to take it out.

The tests are passing (and now the test is more appropriate) so I think we can resolve this task. Deciding what to do with the main page case can be followed up with in T215080. I suspect it's too small an edge case for us to worry too much. Will report back if we need any changes upstream.

Yes, after chatting with Petr in -services I believe this was a case of very old content stored in RESTBase being served until the page was edited. RESTBase doesn't enforce any time to expire on its own.

Is there any follow ups we need to do here? e.g. make sure those are flushed?

From an architectural point of view I believe RESTBase is behaving as intended here (i.e., by serving a stored response until the underlying content changes), though considering the infrequency of editing, maybe we should disable RESTBase storage in the Beta Cluster (unless Core Platform wants it for their own purposes). Something to chat about in the biweekly sync meeting.

If it's useless to clients, maybe it's best just to take it out.

The tests are passing (and now the test is more appropriate) so I think we can resolve this task. Deciding what to do with the main page case can be followed up with in T215080. I suspect it's too small an edge case for us to worry too much. Will report back if we need any changes upstream.

👍

(unless Core Platform wants it for their own purposes). Something to chat about in the biweekly sync meeting.

How would we test things in beta if we disable storage? :)

As a workaround - can you bypass Varnish in beta cluster? Are the tests running inside our labs network? If yes, you can also add cache-control: no-cache to your request to bypass the storage.

It might make sense to purge pages after a month regardless. If the endpoint logic has changed in the summary endpoint wouldn't the summary be outdated?

If the endpoint logic has changed in the summary endpoint wouldn't the summary be outdated?

This is true and that's why summary uses semantic versioning, however, we only enforce re-rendering of the content in case a major version changes.