Page MenuHomePhabricator

Some pages have a 'null' lastmod field in the sitemap!
Closed, ResolvedPublic

Description

Author: dan.bolser

Description:
I'm creating a sitemap for my wiki with the following command:

php /memberroot/dmb/public_html/metabase/mw/maintenance/generateSitemap.php \
--fspath /memberroot/dmb/public_html/metabase/mw/sitemap \
--server http://metadatabase.org \
--urlpath http://metadatabase.org/sitemap

When I load this into google webmaster tools, almost everything works fine. However, a couple of pages have a weird 'null' lastmod field:

<url>
        <loc>http://metadatabase.org/wiki/Main_Page</loc>
        <lastmod></lastmod>
        <priority>1.0</priority>
</url>

and:

<url>
        <loc>http://metadatabase.org/wiki/Help:About</loc>
        <lastmod></lastmod>
        <priority>0.5</priority>
</url>

It's always these two pages!

This causes Google to barf with an error about an incorrect date format.


Version: 1.17.x
Severity: normal

Details

Reference
bz29687

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:27 PM
bzimport set Reference to bz29687.
bzimport added a subscriber: Unknown Object (MLST).

dan.bolser wrote:

Here is the exact error message from Google Webmaster Tools:

6680 Invalid date
An invalid date was found. Please fix the date or formatting before resubmitting.

Parent tag: url
Tag: lastmod
Value:

Problem detected on: Jul 3, 2011

Dan, can you check what the page_touched values for these rows in the page table are?

Normally this should carry a timestamp, which in MediaWiki on MySQL is stored as a 14-character string (YYYYMMDDHHMMSS). Null or empty *should* end up formatting the current time, though there may be some bad values or such.

dan.bolser wrote:

I debugged this a bit with TimStarling, but the results are still a bit confusing.

It seems that MediaWiki is corrupting the page_touched field!

I recently imported the data for this wiki from MW 1.11 (using MySQL dump 10.13 Distrib 5.1.56) into MW 1.17 (using MySQL 4.1.22-standard-log). Since that import, I touched a couple of pages (guess which?) and discovered this problem with the site map.

Since reporting the two problem pages, I ran a process that touched many pages, and here is the state of the page_touched field:

mysql> select page_touched, count(*) from mb_page group by page_touched limit 20;
+----------------+----------+

page_touchedcount(*)

+----------------+----------+

2.01107052046E606
2.01107052047E1179
2.01107052048E1116
2.01107052049E1255
2.01107052092E2
2.01107052094E1
2.01107052095E1
2.01107052096E5
2.01107052097E275
2.01107052098E132
2.01107052227E2
2.01107052229E1
2.0110705223E+1
200708101303141
200906092111251
201003151749181
201107052140061

+----------------+----------+
17 rows in set (0.01 sec)

With help from TimStarling, I checked the data in my import 'dump' file, and concluded that it looks fine. I re-imported the dump, and it looked fine in the database (no 'corruption' like the above). I then followed the 1.17 DB update procedure (first using the GUI and then again using the CLI), and both looked fine (no corruption).

Then I 'recovered' the correct page_touched field using the update (to keep my changes post import).

Then I edited a page, and saw the same corruption!

Before edit:

mysql> select * from mb_page where page_title = "Main_Page";
+---------+----------------+------------+-------------------+--------------+------------------+-------------+----------------+----------------+-------------+----------+

page_idpage_namespacepage_titlepage_restrictionspage_counterpage_is_redirectpage_is_newpage_randompage_touchedpage_latestpage_len

+---------+----------------+------------+-------------------+--------------+------------------+-------------+----------------+----------------+-------------+----------+

47920Main_Page93887000.94013338073720090811074455150931069

+---------+----------------+------------+-------------------+--------------+------------------+-------------+----------------+----------------+-------------+----------+
2 rows in set (0.01 sec)

After edit:

mysql> select * from mb_page where page_title = "Main_Page";
+---------+----------------+------------+-------------------+--------------+------------------+-------------+----------------+----------------+-------------+----------+

page_idpage_namespacepage_titlepage_restrictionspage_counterpage_is_redirectpage_is_newpage_randompage_touchedpage_latestpage_len

+---------+----------------+------------+-------------------+--------------+------------------+-------------+----------------+----------------+-------------+----------+

47920Main_Page93888000.9401333807372.01107060996E150971074

+---------+----------------+------------+-------------------+--------------+------------------+-------------+----------------+----------------+-------------+----------+
2 rows in set (0.02 sec)

As I said, in the interim I had touched many pages. Going to Google Webmaster Tools, I now see many errors!

Seems pretty clear, now that I've set it out, that MW 1.17 (+ extensions) is borking the page_touched field on this version of MySQL, leading to an error in the sitemap.

Dan: Is this still a problem?
If so, which MW version do you use nowadays?

Krinkle closed this task as Resolved.EditedJan 7 2020, 7:58 PM
Krinkle claimed this task.
Krinkle subscribed.

From what I can tell, the root cause here is data corruption in the database. The best I can tell, the source of the corruption has been fixed sometime between 2012 and now.

The failure mode could be improved, for example, when such an invalid value in encountered, a fatal exception should be thrown instead of silently continuing because the return value from the internal Revision method that the sitemap script embeds is violated by returning an empty string (likely got there by casting boolean false). However in any event, it would not fix the bad value itself.

Closing for now as I believe this failure to now also be in place as we have a number of more recent reports about timestamp conversion causing exceptions which have helped trace down cases where we still store or accept bad dates. As such, there is no further action item here.