Handle message group stats caching for long IDs
Long message groups names (especially for translatable pages and
especially on meta.wikimedia.org) were silently truncated in the
database. This was masked by the fact that we automatically
re-calculated any missing stats.
It is likely that this is one of the main cause for contention on
the stats table, for which we have made many changes to use less
queries, to delay their execution and so on.
It seems that on meta.wikimedia.org the number of affected message
groups has recently grown so much (they're now 97) the issue is no
longer fully masked by the execution being limited to two seconds.
The clue that led me to this discovery was that the API
for language statistics returned both truncated and non-truncated group
ids. As a side effect, this fix will make those API results smaller
by filtering non-existing groups away.
For now I solved the issue internally in MessageGroupStats class by
truncating the message groups ids which are too long and appending a
hash to them to avoid collisions. I don't see a major security impact
on collisions here, but I used a 20 bytes prefix of hex encoded sha256
hash, just for the sake of not using algorithms that are known
to be broken. This limits the ids to 72 bytes (including separator)
which seems to be an okay balance of human readability and space use.
Due to both this bug and this change and other reasons, the stats
table can contain unused rows. For now for some queries (specifically
MessageGroupStats::forLanguage() we load those rows but then ignore
them. It would be possible to detect those rows at that point and
schedule a clean-up. But for simplicity, for now we just ignore them.
A manual full purge of the table can be done from time to time, such
as when upgrading MediaWiki.
I checked the users of the tgs_group field in Translate extension. There
were none outside the sql file and this class. I don't believe any
other code would access these tables directly.