Page MenuHomePhabricator

[Regression] All query pages (maintenance reports) updates are broken since 1.21wmf3
Closed, ResolvedPublic

Description

29782 (given by User:Malyacko on Village pump (technical))

Hi there!

I often use the Special:SpecialPages links to quickly access places I watch for maintenance (mostly Special:BrokenRedirects). I can access such reports in other ways, but it seems the currently active reports linked on the SpecialPages page aren't updating, and haven't since November 4.

The live pages seem to be working fine, but the ones which require caching aren't updating.

No big deal, but thought somebody should know.

Thanks for all you folks' fine work.

Scott Douglas
Harvard, IL


Version: 1.21.x
Severity: critical
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=41919

Details

Reference
bz42152

Event Timeline

bzimport raised the priority of this task from to Unbreak Now!.Nov 22 2014, 1:08 AM
bzimport set Reference to bz42152.

romaine.wiki wrote:

We also have this same problem since November 1 on the Dutch Wikipedia (nl.wikipedia.org). Please keep on updating these pages!

Romaine

It's obviously halted everywhere...

(In reply to comment #2)

It's obviously halted everywhere...

And does it correspond to the dates of 1.21wmf3 deployment everywhere?

I know what's causing the breakage, but I'm not sure why.

Currently running manually in a screen session on hume

Probably take a little while to run.

Reedy: Is bug 41919 the same issue (dup)?

  • Bug 41919 has been marked as a duplicate of this bug. ***

Clarifying summary: it's all query pages and it's for all wikis since the moment they switched to 1.21wmf3.

TBH, I couldn't care less bug stays open

  • This bug has been marked as a duplicate of bug 41918 ***

I think this is caused by bug 42210, but I'm not sure for definite

Relatedly, my script run is upto frwiki currently...

Manual run has been completed across all wikis now.

Lowering priority/severity

This shouldn't be closed at least until the automatic script runs once with complete success.

(In reply to comment #12)

This shouldn't be closed at least until the automatic script runs once with
complete success.

Indeed. Which is why I just changed the Importance of the bug as there had at least been a clusterwide run recently

(In reply to comment #13)

(In reply to comment #12)

This shouldn't be closed at least until the automatic script runs once with
complete success.

Indeed. Which is why I just changed the Importance of the bug as there had at
least been a clusterwide run recently

But the run was "manual". Is it sustainable to run it that way?

Is it better to kick the can down the road or keep the older bug report open until "completely" solved?

I submitted a hack/config variable addition to prevent the broken item from being executed and breaking the other updates.

Aaron has also committed what should be a fix to the hanging query, but its in need of testing and review. Deployment can then happen.

Obviously this is preferred.

Running it manually can be done easily enough with 1 shell command.

Great. I'm glad that the manual run is easy, but good luck to Aaron with the testing and review of the more durable fix.

Removing as a deployment blocker, since we're not holding up 1.21wmf5 over this one. Sam: a couple questions: 1. which commit from Aaron are we waiting on? 2. Should this one be assigned to Aaron?

Aarons proposed fix is g33856

My "workaround" (read, disabling the broken code) was g33846

Can the FlaggedRevs update be disabled so that the automated run will succeed and continue running for all other pages (or was that already done? Note sure whether Sam considered it or did it).

The automatic runs were every few days or so. It has now been more than a week since the manual replacement run. Depending on a person's memory/schedule is not ideal from a user PoV.

(In reply to comment #20)

Can the FlaggedRevs update be disabled so that the automated run will succeed
and continue running for all other pages (or was that already done? Note sure
whether Sam considered it or did it).

Is this confirmed to be a Flagged Revisions issue only? I've only checked a handful of wikis (not finding anything disproving it).

(In reply to comment #20)

Can the FlaggedRevs update be disabled so that the automated run will succeed
and continue running for all other pages (or was that already done? Note sure
whether Sam considered it or did it).

*bump*

Adding bug 38865 as blocker. This is an unacceptable regression. Yes, it was introduced in the previous deployment. But it seems the only way to get things done here is to block the next deployment.

People use these pages to work off of every day. For all I care uninstall FlaggedRevs globally, I don't care. Restore it please.

Revision to disable that module has been pushed to the cluster

Aarons fix is still to be reviewed and as such, deployed

It isn't fixed until it is deployed.

(In reply to comment #24)

Revision to disable that module has been pushed to the cluster

Removing blocking 38865 as the module has been temporarily disabled and the situation does not block WMF-deployment itself anymore.

Aarons fix is still to be reviewed and as such, deployed

https://gerrit.wikimedia.org/r/#/c/33856/ is now merged.
RESOLVED FIXED means that a fix has been merged in the codebase, not that it has been deployed on servers. Restoring previous state.

It has now been more than a week since the last update of the special pages on en.wikt. Has the fix been deployed or is it time for another manual run?

(In reply to comment #27)

It has now been more than a week since the last update of the special pages
on
en.wikt. Has the fix been deployed or is it time for another manual run?

As it currently stands, it's now even more broken than it was before, as the job isn't at least being partially run. Due to the cronjobs not being puppetised and hume being re-installed

Then it would seem inappropriate to call this "RESOLVED FIXED". Or is a new bug report the better course of action?

I don't know why this was closed, only the FR one should be.

So we're running into the same issue again...

Aaron / Reedy: Any idea who could investigate the underlying reason?

I don't know why this was closed, only the FR one should be.

Sorry - my fault, see comment 26.

sumanah wrote:

Our users depend on these reports and if they're very out-of-date and broken then we should make it a priority to fix this issue. I'm bumping it up to Immediate and I'll bring it up in a managers' meeting today.

So, here's the situation as I understand it talking to Aaron. Aaron discovered that there were several cron jobs that were supposed to run on hume that weren't running, so he pointed this out to Peter Youngmeister, who deployed this:
https://gerrit.wikimedia.org/r/#/c/37946/

...on December 10. So, this *should* be fixed now. If it's not, reopen and ping the on-duty ops person.

Looking at this page (Special:ValidationStatistics):
http://de.wikipedia.org/wiki/Spezial:Sichtungsstatistik

...it appears this has updated.

It seems to be finally fixed, for real. Thanks.