tag_summary missing records
Closed, ResolvedPublic

Description

tag_summary duplicates data in change_tag, but is missing some records.

Eg:

select * from change_tag where ct_rev_id = 563615370;
+-----------+-----------+-----------+--------------+-----------+

ct_rc_idct_log_idct_rev_idct_tagct_params

+-----------+-----------+-----------+--------------+-----------+

589674173NULL563615370visualeditorNULL

+-----------+-----------+-----------+--------------+-----------+

select * from tag_summary where ts_rev_id = 563615370;
Empty set (0.01 sec)

Cause unknown at time of writing.

Relevant recent activity:

https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#VisualEditor_tag_not_working_correctly

https://bugzilla.wikimedia.org/show_bug.cgi?id=40867


Version: unspecified
Severity: major

Details

Blocks
T42867: Update change tag indexes
Reference
bz51254
bzimport set Reference to bz51254.
Springle created this task.Jul 12 2013, 7:18 PM

Only seems to affect en.wp right now (works correctly on pl.wp and mw.org, for example).

Sean and Asher narrowed this down to a problem with the schema change tool that we use, and are working on a strategy to fix the data. This looks like it's strictly a db-related problem that once fixed should stay fixed (assuming we don't try another similar schema migration before an upstream fix is made to the migration tool)

Was it determined if any other databases apart from en.wp's one were affected?

Reedy added a comment.Jul 12 2013, 9:00 PM

(In reply to comment #3)

Was it determined if any other databases apart from en.wp's one were
affected?

Not sure. The wikis that potentially may have this issue are:

+ 'arwiki' => true,
+ 'commonswiki' => true,
+ 'cswiki' => true,
+ 'dewiki' => true,
+ 'elwiki' => true,
+ 'enwiki' => true,
+ 'enwikisource' => true,
+ 'enwiktionary' => true,
+ 'eswiki' => true,
+ 'etwiki' => true,
+ 'fawiki' => true,
+ 'fiwiki' => true,
+ 'frwiki' => true,
+ 'hewiki' => true,
+ 'huwiki' => true,
+ 'idwiki' => true,
+ 'itwiki' => true,
+ 'jawiki' => true,
+ 'ltwiki' => true,
+ 'mrwiki' => true,
+ 'nlwiki' => true,
+ 'plwiki' => true,
+ 'ptwiki' => true,
+ 'rowiki' => true,
+ 'ruwiki' => true,
+ 'simplewiki' => true,
+ 'svwiki' => true,
+ 'trwiki' => true,
+ 'ukwiki' => true,
+ 'zhwiki' => true,

cf bug 40867#c6

Firstly, we've determined this problem occurred due to an (apparent) bug in pt-online-schema-change when using a combination of:

  • A table without primary key
  • A table with unique indexes that all include nullable columns
  • An unfortunately timed REPLACE statement in normal db traffic

Posc does online table alteration by:

  • Creating a copy of the table with altered schema
  • Setting triggers on the original table to keep the copy updated
  • Copying data across using a batch process

In this case, posc set a DELETE trigger on tag_summary using a poor UNIQUE index (ts_log_id) with low cardinality and a nullable field. Then during the batching process, an external REPLACE statement with ts_log_id=NULL caused many too many rows to be deleted in the temporary table being altered. Given that many rows in tag_summary have ts_log_id=NULL, the table was massively reduced in size.

Now to the fix:

We've checked the other wikis and found no problems; only enwiki was affected.

Furthermore, only enwiki.tag_summary was affected. We've verified that enwiki.change_tag is complete and did not suffer the same problem. This was based on:

  • Index cardinality and table size information collected before running the schema migration
  • An investigation of the events in the binary log surrounding the migration period

Currently we are rebuilding tag_summary based on change_tag data. That will complete within 30 mins at the time of writing this comment.

enwiki.tag_summary rebuild is complete.

swalling wrote:

Just checked this on-wiki as well. Seems fixed.

Sorry to add to what I'm sure was a bit of a hectic day for someone, but I'm still seeing lingering bits of corruption. Perhaps some sort of edge case that wasn't handled correctly by the rebuild? 99.9% of tags may be okay at this point, but here are some example that still seem to be errors.

A API query of 200 revisions tags as flagged as "blanking":

http://en.wikipedia.org/w/api.php?action=query&list=recentchanges&rctag=blanking&rclimit=200&rcprop=user%7Ccomment%7Ctitle%7Ctags%7Ctimestamp|ids&rccontinue=2013-07-12T22:20:40Z|589061595

While this query returns 200 entries, we find that only 188 of them report as actually having the "blanking" tag.

The remainder are things like

rcid="590123889" timestamp="2013-07-12T14:30:16Z"
<tag>visualeditor</tag>

rcid="590032703" timestamp="2013-07-12T00:33:31Z" 
<tag>mobile edit</tag>

Where some other tag is reported but the expected "blanking" tag is not reported.

For another example of this issue see the API query for the "visualeditor-needcheck" tag:

http://en.wikipedia.org/w/api.php?action=query&list=recentchanges&rctag=visualeditor-needcheck&rclimit=200&rcprop=user%7Ccomment%7Ctitle%7Ctags%7Ctimestamp|ids

This tag should only be applied if the "visualeditor" tag is also present, but we observe that most of the results have either "visualeditor" or "visualeditor-needcheck" but not both. A few entries even have other tags entirely.

What appears to have happened is that rebuild didn't correctly handle cases where a single revision was subject to multiple tags. Instead it looks as though the rebuilt table applies at most one tag to each of the historical revisions. Most of the time that's okay since few revisions actually have multiple tags, but it still leaves a bit of corruption and missing data on the rare cases when a revision is expected to have multiple tags.

(In reply to comment #8)

A API query of 200 revisions tags as flagged as "blanking":
While this query returns 200 entries, we find that only 188 of them report as
actually having the "blanking" tag.

That's still the case today.

greg added a comment.Jul 15 2013, 5:31 PM

Lowering priority a bit since I don't there is data loss here (the table that was used to recreate the data still exists).

James: Assigning to you to determine the priority for getting around to fixing this data (since it affects VE related data, and you know what metrics are being tracked).

Am investigating whether the tag_summary rebuild was conceptually flawed with regard to revisions with multiple tags, or not.

Also dumping enwiki binlogs on a slave (we have a month's worth) and pulling out all change_tag queries. Will reload them offline and join against a copy of change_tag to prove whether it is, in fact, completely intact.

As Robert suggested in comment 8, the rebuild process missed some rows where revisions had multiple tags.

The script has been fixed and will run in batches on enwiki today. More info shortly...

Btw, change_tag still looks complete to me; the binlog shows no problems there. Should just be the tag_summary rebuild logic at fault.

Rebuild #2 of tag_summary has completed and the reports in comment 8 look better (to me). Anyone care to verify...

(In reply to comment #14)

Rebuild #2 of tag_summary has completed and the reports in comment 8 look
better (to me). Anyone care to verify...

Appears to work for me, yes. Might be worth waiting for others to weigh-in, but from my POV this is fixed.

Much better, but I'm still seeing some issues:

Looking for 500 "blanking" tags gives 498 "blanking" plus 2 labeled as just "mobile edit".

http://en.wikipedia.org/w/api.php?action=query&list=recentchanges&rctag=blanking&rclimit=500&rcprop=user%7Ccomment%7Ctitle%7Ctags%7Ctimestamp

As a follow up, the two problematic tags I note in Comment 16 are both recent. It is possible they have a different underlying cause than the previous corruption. For example, this might represent a logic error in how the "mobile edit" tag is being recorded.

swalling wrote:

(In reply to comment #16)

Much better, but I'm still seeing some issues:

Looking for 500 "blanking" tags gives 498 "blanking" plus 2 labeled as just
"mobile edit".

http://en.wikipedia.org/w/api.
php?action=query&list=recentchanges&rctag=blanking&rclimit=500&rcprop=user%7C
comment%7Ctitle%7Ctags%7Ctimestamp

There are other strange things going on with tags...

http://en.wikipedia.org/wiki/Wikipedia_talk:Tags#Incorrect_tagging

Not sure if it's related or if we should file a separate bug for incorrect tagging. I think mobile is also suffering from this issue (or was as of yesterday).

Whatever is causing that (maybe just a misconfigured local filter?), it's most likely not related to this bug.

That was bug 52077. Closing this.

Add Comment