Page MenuHomePhabricator

inconsistent revision id and html content
Closed, ResolvedPublic

Description

Author: anthonyzhang

Description:
the html crawled from wikipedia page http://en.wikipedia.org/wiki/Netherlands

When we crawled the wikipedia page http://en.wikipedia.org/wiki/Netherlands the responsed HTML has the following content:

<div class="printfooter"> Retrieved from "<a href="http://en.wikipedia.org/w/index.php?title=Netherlands&amp;oldid=543458973">

So it should be revision 543458973's html content. But it also has this content: "Netherland people are also homosexual." which is the previous revision 543458897's content. It is a terrible inconsistency.

The current HTML is fixed, I put the snapshot at the attached file. Please take a look.

Thanks!


Version: wmf-deployment
Severity: normal

Attached:

Details

Reference
bz46014

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:38 AM
bzimport set Reference to bz46014.
bzimport added a subscriber: Unknown Object (MLST).

How did you crawl that page? Please provide steps to reproduce.

[Not a bug report about Bugzilla itself, hence moving]

anthonyzhang wrote:

We sent standard HTTP request with URL "http://en.wikipedia.org/wiki/Netherlands" to Wikipedia's server and got the returned HTTP response, then stored it to the attached file.

I can't reproduce the same HTTP response now, the Netherlands page is completely updated now. Is it possible for this kind of inconsistency? Is the HTTP response I got is unexpected?

(In reply to comment #2)

We sent standard HTTP request with URL
"http://en.wikipedia.org/wiki/Netherlands" to Wikipedia's server and got the
returned HTTP response, then stored it to the attached file.

With which tool? With which command? This is a bit vague so far.

anthonyzhang wrote:

I used a Google internal library written in C++. It generated HTTP request in HTTP protocol 1.1. It is similar with command "wget http://en.wikipedia.org/wiki/Netherlands"

What's the difference between different tools? Why does it matter?

When MediaWiki generates the HTML content of one page, it should use the matched revision id and wikitext, right?

anthonyzhang wrote:

gentle ping.

(In reply to comment #2)

I can't reproduce the same HTTP response now, the Netherlands page is
completely updated now.

It would be great to know at which exact time you ran your query to compare it with the timestamps of the changes on the wikipage. It might either have been "bad timing" or that some caching servers were not updated yet, in this case mw1039.

(In reply to comment #4)

What's the difference between different tools? Why does it matter?

It's helpful to know any parameters when somebody tries to reproduce your problem.

khullah wrote:

I have seen the same inconsistency with the article http://en.wikipedia.org/wiki/Curry%E2%80%93Howard_correspondence

Reading the article, it shows this text:

"In programming language theory and proof theory, the Curry–Howard correspondence (also known as the Curry–Howard isomorphism or equivalence, or the proofs-as-programs and propositions- or formulae-as-types interpretation) is the direct relationship between computer programs and mathematical proofs. It is a generalization of a syntactic analogy between systems of formal logic and computational calculi.

It is the link between Logic and Computation that is usually attributed to H. B. Curry and W. A. Howard, although the idea is related to theoperational interpretation of intuitionistic logic given in variousformulations by Brouwer, Heyting and Kolmogorov.

Origin, scope, and consequences (...)"

When I try to edit, the code shows (in visual edit OR source edit):

"In programming language theory and proof theory, the Curry–Howard correspondence (also known as the Curry–Howard isomorphism or equivalence, or the proofs-as-programs and propositions- or formulae-as-types interpretation) is the direct relationship between computer programs and mathematical proofs. It is a generalization of a syntactic analogy between systems of formal logic and computational calculi that was first discovered by the American mathematician Haskell Curry and logician William Alvin Howard.[citation needed]

Origin, scope, and consequences (...)"

Comparing the two last edits, it can be seen that the text shown is not from the latest edit, but the previous one.

Revision as of 08:01, 18 August 2013 (oldid=569063806)
Latest revision as of 08:01, 18 August 2013 (oldid=569063819)

Hope this helps confirm the reported bug.

That is interesting. Adding some other people who might know why that is happening.

We keep seeing such issue, one more on Aug. 21:

We were fetching revid=569587966, which is a fix after a vandalism :
http://en.wikipedia.org/w/index.php?title=Abdullah_of_Saudi_Arabia&oldid=569587966
But the content still contains bad editing from revid=569587944
http://en.wikipedia.org/w/index.php?title=Abdullah_of_Saudi_Arabia&oldid=569587944

The HTML (rev=569587966) we got looks like: (details in attachment)

 1 HTTP/1.0 200 OK^M
 2 X-Content-Type-Options: nosniff^M
 3 Content-Language: en^M
 4 Last-Modified: Wed, 21 Aug 2013 15:53:30 GMT^M
 5 Content-Encoding: gzip^M
 6 Content-Length: 69500^M
 7 Content-Type: text/html; charset=UTF-8^M
 8 Date: Wed, 21 Aug 2013 15:54:05 GMT^M
 9 Server: Apache^M
10 Cache-Control: private, s-maxage=0, max-age=0, must-revalidate^M
11 Vary: Accept-Encoding,Cookie^M
12 Age: 124^M
13 X-Cache: HIT from cp1017.eqiad.wmnet^M
14 X-Cache-Lookup: HIT from cp1017.eqiad.wmnet:3128^M
15 X-Cache: MISS from cp1010.eqiad.wmnet^M
16 X-Cache-Lookup: MISS from cp1010.eqiad.wmnet:80^M
17 Connection: keep-alive^M
18 ^M
19 <!DOCTYPE html>
20 <html lang="en" dir="ltr" class="client-nojs">

......

40 mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Abdullah_of_Saudi_Arabia","wgTitle":"Abdullah of Saudi Arabia     ","wgCurRevisionId":569587966,"wgArticleId":19186951,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages      using citations with accessdate and no URL","All articles with dead external links","Articles with dead external links from August 2013","Wikipedia indefinitely move-protect     ed pages","Use dmy dates from August 2013",

......

66                                                                 <div id="mw-content-text" lang="en" dir="ltr" class="mw-content-ltr"><p><b>Hey King OF ISLAMIC World, Hey Kin     g OF SAUDI ARABIA. Please Dont Support to Terrorist Militaray of Egypt. They are killing Muslims out of law and out of Democracy. I know you know everything But You are not      Muslim.Thats why You are support to kill Muslims. I humbly request to people of saudi arabia please show the straight way to the YOUR'S king.</b></p>
67 <table class="infobox vcard" style="font-size: 88%; text-align: left; width: 22em">

......

1787 <noscript><img src="//en.wikipedia.org/w/index.php?title=Special:CentralAutoLogin/start&amp;type=1x1&amp;from=enwiki" alt="" title="" width="1" height="1" style="border: non e; position: absolute;" /></noscript></div> <div class="printfooter">
1788 Retrieved from "<a href="http://en.wikipedia.org/w/index.php?title=Abdullah_of_Saudi_Arabia&amp;oldid=569587966">http://en.wikipedia.org/w/in dex.php?title=Abdullah_of_Saudi_Arabia&amp;oldid=569587966</a>" </div>
1789 <div id='catlinks' class='catlinks'><div id="mw-normal-catlinks" class="mw-no

......

1890 <li id="coll-create_a_book"><a href="/w/index.php?title=Special:Book&amp;bookcmd=book_creator&amp;referer=Abdullah+of+Saudi+Arabia">Create a book</a> </li>
1891 <li id="coll-download-as-rl"><a href="/w/index.php?title=Special:Book&amp;bookcmd=render_article&amp;arttitle=Abdullah+of+Saudi+Arabia&amp;oldid=5695 87966&amp;writer=rl">Download as PDF</a></li>
1892 <li id="t-print"><a href="/w/index.php?title=Abdullah_of_Saudi_Arabia&amp;printable=yes" title="Printable version of this page [p]" accesskey="p">Pri ntable version</a></li>
1893 </ul>

......

stale cache of Abdullah of Saudi Arabia on Aug. 21

The header part of this attachment is the HTTP response headers. The following is the HTML we get.

Attached:

The revision is generated at 1377593916 (Tue Aug 27 08:58:36 UTC 2013), and then we fetch it is at 1377596440 (Tue Aug 27 09:40:40 UTC 2013) - ~40 minutes later.

But we still get a stale cache:

The HTML contains both content from the claimed revision "570370203"

"""
<div class="printfooter">

Retrieved from "<a href="http://en.wikipedia.org/w/index.php?title=Russell_Crowe&amp;oldid=570370203">http://en.wikipedia.org/w/index.php?title=Russell_Crowe&amp;oldid=570370203</a>"                          </div>

"""

and stale content from the previous revision "570370195"

"""
is a ball sack loving faggot who I actually enjoyed in
"""

Bryan/Brian, since you guys have been working on this for images, do you have thoughts about how the cache is purged when revisions are rolled back or deleted? Google is finding that the removed html is often still returned. Any ideas would help.

Re Zark Khullah in comment 7 - Are you logged in or logged out when this happens.

To clarify, is this only google getting the old versions, or is it also logged out users that have cleared all their cookies? Or do logged in users sometimes get the old version.

Additionally, is this only for things that were reverted within roughly a minute of the edit being made (aka ClueBot_NG reverts). There may be some sort of race condition with the reverts so close to the original edit.

[I checked for the cache purging in general, and it seems to be working on the pages i tested, so its not a site wide outage of cache purging]

Bawolff-- it's logged out users (ops asked them to crawl anonymously, so they hit the cache). I hadn't looked into how the reverts were made, if they we were cluebot vs. manual, but that could be an issue. All the examples have been very blatant stuff that Cluebot could have picked up.

We discussed this in the office a few days ago. The updated revision ID in the HTML means that the front-end cache was properly purged.

Some possible issues to check:

  • Bad parser cache validation.
  • PHP reading some information from MySQL master (the revision id used for the footer) while using lagged slaves for other info (the revision id used for parser cache validation / parsing).

Since this is relatively rare it could be a race condition, where an anon request happens shortly after an update. It might help to correlate edit timestamps with render timestamps in the bad HTML and slave lag at the time.

Hi,
Any update on this?
It looks like the bug always appears with ClueBot reverting a vandalism (not every reverting trigger it though). We just noticed one more on http://en.wikipedia.org/w/index.php?title=Barbra_Streisand&oldid=571125240

the phenomenon is similar to the one I reported before.

greg added a comment.Sep 3 2013, 9:45 PM

Assigning to Aaron for now to do some deeper digging here.

TheDJ added a comment.Sep 6 2013, 9:39 AM

Again, cluebot reverts, coincidence ? seems like we might be missing some cache clear signals from the api entry points ?

Hi wiki-dev,

is there any update on this issue?

Thanks

Change 85917 had a related patch set uploaded by Aaron Schulz:
Reduce chance for parser cache race conditions

https://gerrit.wikimedia.org/r/85917

Change 85917 merged by jenkins-bot:
Reduce chance for parser cache race conditions

https://gerrit.wikimedia.org/r/85917

(In reply to comment #22)

Change 85917 merged by jenkins-bot:
Reduce chance for parser cache race conditions

Aaron: Is more work needed, or can this be closed as RESOLVED FIXED?

aaron added a comment.Oct 18 2013, 9:17 PM

Assuming this is fixed now after that change (aside from maybe a few older cached entries, which should all be gone by a month).

title=St._Louis_Cardinals, oldid=579097247

old version of St. Louis Cardinals, right before we hit stale cache:
http://en.wikipedia.org/w/index.php?title=St._Louis_Cardinals&amp;oldid=579097247

Attached:

stale cache of title=St._Louis_Cardinals, oldid=579097280

The new (i.e. staled) revision of the article that we get when crawling it:
http://en.wikipedia.org/w/index.php?title=St._Louis_Cardinals&amp;oldid=579097280

Attached:

(I don't know how to reopen this bug)

We actually met with stale cache again on Oct. 28. I've attached the staled version (rev=579097280).

Again, it seems 2 revisions (579097247, 579097280) happen in very short time window, and the HTML of the latter (579097280) was rendered using old wikitext.

The strange thing is that part of the new HTML is changed:
The old one looks like:
.. ”wgRevisionId”: 579097247..
<p>
…a gay butt sex team based…
</p>

<h2>Further reading</h2>
...

The new one (staled one) looks like:
.. ”wgRevisionId”: 579097280..
<p>
…a gay butt sex team based…
</p>

<h2>Further reading</h2><span class=”mw-editsection”>...</span>
...

I am reopening this based on last comments.

Aaron: Could you take a look at this again?
Any specific / more information that could be gathered / provided?

aaron added a comment.Nov 19 2013, 5:28 PM

What instances were encountered after 11/18/13? I said there would be stale items for a month after I closed this (the maximum cache time for our proxies).

The one we encountered (i've upload 2 attachments) happened on 12:41, 28 October 2013.

Aaron: Sorry, didn't read comment 24 closely.

Closing this ticket again as there is no indication that there is still a problem. Please inform us if this problem still happens after November 18, 2013.

de.wikipedia.org/w/index.php?title=Albanien&oldid=125280145

a vandalized revision of Albanien on dewiki, at 2013-12-09 17:07:12

Attached:

de.wikipedia.org/w/index.php?title=Albanien&oldid=125280167

a fix for "Albanien" on dewiki, at 2013-12-09 17:07:58

Attached:

@Andre,

we met with another stale case on Dec. 9. I've uploaded 2 snapshot of the stale. As you can see, the vandalized content "ist ein dummes Land mit Zigeunern" appear in both revision.

Could you reopen the bug and assign properly?

Thanks

(In reply to comment #32)

Created attachment 14054 [details]
de.wikipedia.org/w/index.php?title=Albanien&oldid=125280145

a vandalized revision of Albanien on dewiki, at 2013-12-09 17:07:12

I believe the original bug report was for having incorrect revision when viewing the current version, not when viewing oldids. Are you reporting that an old version showed the incorrect revision?

Attached:

I used "oldid" in attachment name and descriptions just to make sure people can easily check those history revisions.

The staled content happened when we fetch rev:125280167, and we simply crawl it via "http://de.wikipedia.org/wiki/Albanien". No "oldid" is used.

So it's still same bug I think.

stale cache of title= Roslindale, oldid=583299083

This html was fetched 100 minutes after the creation of the revision via
http://en.wikipedia.org/Roslindale
(i.e. no oldid is used)

As you can see, the bad content ("COW ...") in previous revision remains there.

So the content remains stale for 100 minutes.

Attached:

anthonyzhang wrote:

stale cache of title=Memorial_to_the_Murdered_Jews_of_Europe, oldid=599805302

The source code of this HTML has revision id 599805302, but it has the HTML text "Holyhoax" from the previous revision (id is 599805297). We crawled this HTML about 2 hours after revision 599805302 was made.

Attached:

Change 122847 had a related patch set uploaded by Anomie:
Include parsed revision ID in parser cache

https://gerrit.wikimedia.org/r/122847

Change 122847 merged by jenkins-bot:
Include parsed revision ID in parser cache

https://gerrit.wikimedia.org/r/122847

Anomie added a comment.Apr 1 2014, 7:28 PM

Please reopen if this bug occurs again after 1.23wmf21 is deployed, see https://www.mediawiki.org/wiki/MediaWiki_1.23/Roadmap for the schedule (short version: after April 10).