MobileFrontend corrupts parser cache for regular page views
Closed, ResolvedPublic

Description

Starting at about 4. July there are many (aproximatly one every two days) reports on [[de:Wikipedia:Fragen_zur_Wikipedia]] (the dewiki village pump) about issues obviously caused by badly nested HTML. Purging fixes the issue, but it seems like tidy wasn't executed in these cases. Since this kind of issue occurs now definitly more frequently than before, it should be investigated why tidy refuses to work so often.


Version: unspecified
Severity: major
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=58042

Details

Reference
bz38273
bzimport set Reference to bz38273.
Schnark created this task.Jul 10 2012, 8:12 AM
TheDJ added a comment.Jul 10 2012, 1:40 PM

If this happens, can people use the "View source" feature of their browser and pick and the bottom of it look for "<!-- Served by mw## in 2.259 secs. -->" and note the mw## id before purging the file ?

That will probably help in pinpointing the problem further.

Reedy added a comment.Jul 10 2012, 1:41 PM

(In reply to comment #1)

If this happens, can people use the "View source" feature of their browser and
pick and the bottom of it look for "<!-- Served by mw## in 2.259 secs. -->" and note the mw## id before purging the file ?

That will probably help in pinpointing the problem further.

Or <!-- Served by srv#### in 2.259 secs. -->

Reedy added a comment.Jul 10 2012, 1:51 PM

Don't even need to do that, looking at the dpkg output suggests multiple are missing it

Reedy added a comment.Jul 10 2012, 2:11 PM

(In reply to comment #3)

Don't even need to do that, looking at the dpkg output suggests multiple are
missing it

bleh, ignore me

mw53 just served an untidy html for [[de:Keith Jarrett]]

TheDJ added a comment.Jul 19 2012, 9:52 AM

Can someone with shell access do a sanity check on that host please ?

reedy@mw53:~$ which tidy
/usr/bin/tidy
reedy@mw53:~$ tidy --version
HTML Tidy for Linux released on 25 March 2009
reedy@mw53:~$ php /usr/local/apache/common-local/multiversion/MWScript.php eval.php enwiki

echo $wgTidyConf

/usr/local/apache/common-local/php-1.20wmf7/includes/tidy.conf

echo $wgTidyBin

tidy

Need to check the source for existence of "Tidy was unable to run" or "Tidy found serious XHTML errors"

We have a similar problem on frwiki. As far as I know, the first error was reported on 25 June and there are at least 10 reports since this date. Today, I have loaded a page (Richard Feynman) twice. The server was mw6 the first time and srv243 the second time, and I obtained exactly the same incorrect rendering (then I purged the cache and it fixed the problem).

The </div> of <div id="content" class="mw-body"> is after <!-- Served ...
see : http://imageshack.us/f/801/capturedcran20120720011.png/

Served by mw4 in 0.208 secs. on dewiki

http://de.wikipedia.org/wiki/DB_City_Night_Line: Served by srv240 in 0.351 secs. (A few days ago.)

Served by mw30 in 0.196 secs. ([[de:Galatasaray Istanbul]])

<div clear="all" style="clear:both;" /><br />
<div style="background-color:#888; height:1px; width:8em;"/>

became

<p><br style="clear:both;" clear="all"/>
</br>
</p>
<div style="background-color:#888; height:1px; width:8em;"/>

Served by srv229 in 0.118 secs. ([[de:Glosche]])
Served by mw11 in 0.111 secs. ([[de:Toupet]])

(In reply to comment #12)

<div clear="all" style="clear:both;" /><br />
<div style="background-color:#888; height:1px; width:8em;"/>

became

<p><br style="clear:both;" clear="all"/>
</br>
</p>
<div style="background-color:#888; height:1px; width:8em;"/>

Sorry, I wrote nonsense there. That strange
<br style="clear:both;" clear="all"/>
</br>
already was in the wikitext. I didn't notice it, because I use a script to automatically clean up some errors.

Right now and not yet fixed with purging: http://de.wikipedia.org/wiki/Bundespr%C3%A4sident_%28Deutschland%29: Served by srv260 in 0.155 secs

Increase priority, because HTML tidy is missing more often on pages, at least on de.wp.

HTML tidy is still missing on some pages.

Is there no solution?

As a workaround, I replaced <div ... /> by <div ... ></div> in the most used templates on frwiki. I did not see any complain since I did that, 20 days ago. The display of some pages is still broken, but this is virtually invisible.

This is still happening on en.wp - on 14 August 2012 mw4 and mw44 both served invalid HTML (see http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)/Archive_102#Are_the_HTML_generators_out_of_sync_on_some_servers.3F

Today, a few minutes ago, I saw exactly the same problem on a different article: checking, I found that I was again served invalid HTML, this time by srv272

The basic problem concerns two table cells which each contain an unordered list with several items. The last </li> of each list and the </ul> immediately following were not in the proper place, but placed later on: either in between one </th> and the next <td>, or between a </tr> and the </table> following.

This needs timely further investigation, because it breaks many pages on many wikis. Some user get confused, because after a time it is gone away, because someone other purged the page.

Please have a look at this. Thanks.

I think this may be a dupe: bug 40121. Interesting thing: the NewPP limit report is missing. Page served by mw15.

Same problem seen in the navbox at the bottom of http://en.wikipedia.org/wiki/Operation_Nougat - this was served by mw20. Omitting all attributes, the textual content of enclosures, and all correctly-paired tags which do not enclose bad tags, the mis-ordered tags are:

<table> <tr> <td> <div> <ul> <li> </div>
<div> </li> </ul> </div> <table> <tr> <td> <div> <ul> <li> </div>
<div> </li> <li> </div>
        </td>
      </tr>
      <tr>
        <td>
          <div>
                    </li>
                    <li>
          </div>
        </td>
      </tr>
      <tr>
        <td>
          <div>
                    </li>
                  </ul>
          </div>
        </td>
      </tr>
    </table>
  </td>
</tr>

</table>

I think I found the problem why Tidy sometimes isn't executed:

./includes/job/RefreshLinksJob.php calls
ParserOptions::newFromUserAndLang( new User, $wgContLang )
while in other places makeParserOptions from ./includes/WikiPage.php is called, which additionally calls
$options->enableLimitReport();
$options->setTidy( true );

This also explains why the limit report is missing.

This means that this bug report is in the wrong component, but since I neither know where it actually belongs to nor how to change both product and component, I'm just leaving this as is.

btw: The structure in the previous comment reminds me a bit of Alice in Wonderland: https://en.wikisource.org/wiki/Alice%27s_Adventures_in_Wonderland/Chapter_3

  • Bug 40121 has been marked as a duplicate of this bug. ***

Happened again with server mw44. BTW the long and sad tale (which I first read in about 1973) was not forefront - I wanted to illustrate the mismatching by means of indent levels. For indents as deep as twelve levels, tabs are impractical so I used two spaces.

A new one: srv200 did this too

Comment 23 contains what could be a patch. Could somebody competent look at this?

(Updating component and project.)

(In reply to comment #23)

I think I found the problem why Tidy sometimes isn't executed:

./includes/job/RefreshLinksJob.php calls

ParserOptions::newFromUserAndLang( new User, $wgContLang )

while in other places makeParserOptions from ./includes/WikiPage.php is called,
which additionally calls

$options->enableLimitReport();
$options->setTidy( true );

This also explains why the limit report is missing.

RefreshLinksJob doesn't save the results of this parse, so people shouldn't getting these results, and avoiding Tidy calls here makes a lot of sense as the results are used only for link updates.

I tried to reproduce the issue by creating a page with wrongly nested syntax and changing linked/transcluded pages, but everything displayed correctly.

But all reports about broken layout that say something about the NewPP limit report mention that it is missing. (Latest report: [[de:Frankfurt (Main) Hauptbahnhof]] served by mw29 and srv264)

So either there is some other place where ParserOptions is created without enabling Tidy and LimitReport, or under some strange circumstances I wasn't able to replicate, RefreshLinksJob does save the result to cache.

Tim, could you please poke at this? This seems like your kind of thing. :-)

[[de:Blue October]]: no tidy, no NewPP limit report
Saved in parser cache with key dewiki:pcache:idhash:3450708-0!*!0!!de!4!* and timestamp 20120912181102
Served by srv271 in 0.190 secs

Looks like a MobileFrontend bug. Call stack:

  • require
  • ApiMain::execute
  • ApiMain::executeActionWithErrorHandling
  • ApiMain::executeAction
  • ApiMobileView::execute
  • ApiMobileView::getData
  • WikiPage::getParserOutput
  • PoolCounterWork::execute
  • PoolWorkArticleView::doWork
  • ParserCache::save

ApiMobileView makes a new default ParserOptions, it doesn't get one from Article::getParserOptions() etc. where tidy and limit reports are enabled.

Was deployed at least a week ago.

Add Comment