Page MenuHomePhabricator

Support hash-based deduplication in KeyValueDependencyStore
Closed, DeclinedPublic

Description

KeyValueDependencyStore should be able to use a key scheme of (entity key => blob hash => blob hash key => blob) to massively de-duplicate the numerous identical entries.

From IRC:

<•AaronSchulz> Aaron Schulz we could have key => hash value => hash key => list of deps
14:06 <•Krinkle> Timo Tijhof (testwiki)> SELECT COUNT(*) FROM module_deps; |     1474 |
14:06 (testwiki)> SELECT COUNT(DISTINCT(md_deps)) FROM module_deps LIMIT 2; |216 |
14:07 commonswiki: 38932 vs 255
14:09 so the 38932 rows would reduce from a JSON array of 1..10 relative file paths, to a hash
14:09 and then 255 additional rows
14:10 s/rows/keys (assuming only for KeyValue)
14:11 well, fewer additional rows I suppose, since we can share that part across wikis
14:13 I think we currently average about ~ 120 bytes per row (incl key)
14:13 e.g.
14:13 | SpecialConstraintReportPage | vector|de | ["resources/lib/ooui/wikimedia-ui-base.less","resources/src/mediawiki.less/mediawiki.ui/variables.less"]

Event Timeline

Depending on if and how quickly we move mainstash to sql, we might be able to do this later instead.

Would also be worth doing a rough napkin calculation in terms of how much space it would take up with vs without this, given that we'd still need to have the same number of keys, and same metadat field values, but a shorter value for one of its fields. And then for all unique values we'd add a lot of new keys as well with the larger values there.

I don't have a sense of how this plays out, but it doesn't seem impossible that it might end up larger, or smaller but still "too big".

Krinkle triaged this task as Medium priority.Jun 2 2020, 8:15 PM
Krinkle removed a project: Epic.

Some more info:

aaron@mwmaint1002:~$ mwscript eval.php --wiki=testwiki
> echo strlen(json_encode(['paths'=>'ref:af18298daea159b6ca5283c0c1aa45e7155e4412', 'asOf' => time()]));
74
>
aaron@mwmaint1002:~$ mwscript sql.php --wiki=mediawikiwiki
> select sum(len) + 71 * sum(n) as bytes_if_hashed, sum(n*len) as bytes_if_inline from (select SHA1(md_deps) as deps_hash, count(*) as n, max(LENGTH(md_deps)) as len from module_deps group by deps_hash) tmp;
stdClass Object
(
    [bytes_if_hashed] => 717663
    [bytes_if_inline] => 2565027
)

>
aaron@mwmaint1002:~$ mwscript sql.php --wiki=wikidatawiki
> select sum(len) + 71 * sum(n) as bytes_if_hashed, sum(n*len) as bytes_if_inline from (select SHA1(md_deps) as deps_hash, count(*) as n, max(LENGTH(md_deps)) as len from module_deps group by deps_hash) tmp;
stdClass Object
(
    [bytes_if_hashed] => 6266199
    [bytes_if_inline] => 26180160
)

>
aaron@mwmaint1002:~$ mwscript sql.php --wiki=commonswiki
> select sum(len) + 71 * sum(n) as bytes_if_hashed, sum(n*len) as bytes_if_inline from (select SHA1(md_deps) as deps_hash, count(*) as n, max(LENGTH(md_deps)) as len from module_deps group by deps_hash) tmp;
stdClass Object
(
    [bytes_if_hashed] => 4333020
    [bytes_if_inline] => 18545998
)

aaron@mwmaint1002:~$ mwscript sql.php --wiki=svwiki
> select sum(len) + 71 * sum(n) as bytes_if_hashed, sum(n*len) as bytes_if_inline from (select SHA1(md_deps) as deps_hash, count(*) as n, max(LENGTH(md_deps)) as len from module_deps group by deps_hash) tmp;
stdClass Object
(
    [bytes_if_hashed] => 500686
    [bytes_if_inline] => 1649965
)

>
aaron@mwmaint1002:~$ mwscript sql.php --wiki=enwiki
> select sum(len) + 71 * sum(n) as bytes_if_hashed, sum(n*len) as bytes_if_inline from (select SHA1(md_deps) as deps_hash, count(*) as n, max(LENGTH(md_deps)) as len from module_deps group by deps_hash) tmp;
stdClass Object
(
    [bytes_if_hashed] => 9878934
    [bytes_if_inline] => 41584304
)

We have 18 x 500mb redis instances = 9000MB, or about 9MB/wiki.

Assuming sessions are moved and the worst-ish case of "all wikis like enwiki", it would be a little tight.

Declined, assuming that the DB main stash has enough GB of space.