Page MenuHomePhabricator

Timestamp column in EventLogging tables have incompatible collation
Closed, DeclinedPublic

Description

Working on T216185, I chose to use the PrefUpdate EventLogging tables in the log database as they provide data going back the couple of years that we're looking for. In doing so, I ran into the problem that the timestamp column in PrefUpdate_5563398 has collation defined, whereas the one in PrefUpdate_5563398_15423246 does not, as shown in the SHOW CREATE TABLE output below (condensed for brevity):

CREATE TABLE `PrefUpdate_5563398` (
...
`timestamp` varchar(14) COLLATE utf8_unicode_ci NOT NULL,
...
) ENGINE=TokuDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci `compression`='tokudb_zlib'
CREATE TABLE `PrefUpdate_5563398_15423246` (
...
`timestamp` varchar(14) DEFAULT NULL,
...
) ENGINE=TokuDB AUTO_INCREMENT=68969415 DEFAULT CHARSET=utf8 `compression`='tokudb_zlib'

This results in the following unexpected result when trying to naively combine data from both tables:

SELECT *
FROM (
    (SELECT timestamp
     FROM PrefUpdate_5563398_15423246
     WHERE timestamp >= '20160101000000'
     AND timestamp < '20160201000000')
UNION ALL
    (SELECT timestamp
     FROM PrefUpdate_5563398
     WHERE timestamp >= '20180101000000'
     AND timestamp < '20180201000000')
) AS pu
LIMIT 5;
ERROR 1271 (HY000): Illegal mix of collations for operation 'UNION'

One solution to the problem is to force collation for the non-collated column by changing the first sub-select so it starts with: SELECT timestamp COLLATE utf8_unicode_ci.

I have not investigated whether other EventLogging schema tables in the log database is also affected by this.

Event Timeline

Milimetric subscribed.

This sucks but we're not likely to work on it, as we're moving away from mysql. We don't want to be mean though, so we can help sqoop this stuff into Hadoop if you need to use your painful workaround too much.

This sucks but we're not likely to work on it, as we're moving away from mysql. We don't want to be mean though, so we can help sqoop this stuff into Hadoop if you need to use your painful workaround too much.

Understood, but do we know the reason for this discrepancy? Does it have to do with the general changes to the event capsule (T179625, see also T179540) that happened inbetween PrefUpdate_5563398_15423246 and PrefUpdate_5563398 ?
If it's an issue that affects more than one EL schema, it would be worth documenting it on e.g. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging , with a link to @nettrom_WMF's workaround.

This sucks but we're not likely to work on it, as we're moving away from mysql. We don't want to be mean though, so we can help sqoop this stuff into Hadoop if you need to use your painful workaround too much.

Understood, but do we know the reason for this discrepancy? Does it have to do with the general changes to the event capsule (T179625, see also T179540) that happened inbetween PrefUpdate_5563398_15423246 and PrefUpdate_5563398 ?

PS: actually I think that happened later; rather, the relevant capsule change was T160454 (I also just added that to the documentation).

If it's an issue that affects more than one EL schema, it would be worth documenting it on e.g. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging , with a link to @nettrom_WMF's workaround.

I haven't looked into it, but the naming of PrefUpdate_5563398_15423246 is unusual. IIRC, tables with an extra suffix are some kind of backup or archive table that exist because of either some migration or bug. I'm sure it has real data in it, but possibly the collation mismatch has something to do with some old data issue or migration?

I haven't looked into it, but the naming of PrefUpdate_5563398_15423246 is unusual. IIRC, tables with an extra suffix are some kind of backup or archive table that exist because of either some migration or bug. I'm sure it has real data in it, but possibly the collation mismatch has something to do with some old data issue or migration?

I guess this comment crossed with T216658#4973658 - see the explanation there. One could for example double-check if any of the changes implemented in T160454 could have affected collation.