Support hash-based deduplication in KeyValueDependencyStore
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	aaron
	May 26 2020, 10:08 PM

Description

KeyValueDependencyStore should be able to use a key scheme of (entity key => blob hash => blob hash key => blob) to massively de-duplicate the numerous identical entries.

From IRC:

<•AaronSchulz> Aaron Schulz we could have key => hash value => hash key => list of deps
14:06 <•Krinkle> Timo Tijhof (testwiki)> SELECT COUNT(*) FROM module_deps; |     1474 |
14:06 (testwiki)> SELECT COUNT(DISTINCT(md_deps)) FROM module_deps LIMIT 2; |216 |
14:07 commonswiki: 38932 vs 255
14:09 so the 38932 rows would reduce from a JSON array of 1..10 relative file paths, to a hash
14:09 and then 255 additional rows
14:10 s/rows/keys (assuming only for KeyValue)
14:11 well, fewer additional rows I suppose, since we can share that part across wikis
14:13 I think we currently average about ~ 120 bytes per row (incl key)
14:13 e.g.
14:13 | SpecialConstraintReportPage | vector|de | ["resources/lib/ooui/wikimedia-ui-base.less","resources/src/mediawiki.less/mediawiki.ui/variables.less"]

Related Objects
Search...

Status	Assigned	Task
Resolved	aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
Resolved	Krinkle	T270223 FY2021-2022: Enable basic Multi-DC operations for read traffic (tracking)
Resolved	aaron	T252951 ResourceLoader DepStore lock acquired twice?
Resolved	Krinkle	T113916 Switch ResourceLoader file dependency tracking to MultiDC-friendly backend
Declined	aaron	T253697 Support hash-based deduplication in KeyValueDependencyStore

Event Timeline

aaron created this task.May 26 2020, 10:08 PM

• Gilles moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.Jun 2 2020, 8:12 PM

Depending on if and how quickly we move mainstash to sql, we might be able to do this later instead.

Would also be worth doing a rough napkin calculation in terms of how much space it would take up with vs without this, given that we'd still need to have the same number of keys, and same metadat field values, but a shorter value for one of its fields. And then for all unique values we'd add a lot of new keys as well with the larger values there.

I don't have a sense of how this plays out, but it doesn't seem impossible that it might end up larger, or smaller but still "too big".

Krinkle triaged this task as Medium priority.Jun 2 2020, 8:15 PM

Krinkle removed a project: Epic.

Krinkle moved this task from Inbox to Accepted Enhancement on the MediaWiki-ResourceLoader board.Jun 15 2020, 11:27 PM

Some more info:

aaron@mwmaint1002:~$ mwscript eval.php --wiki=testwiki
> echo strlen(json_encode(['paths'=>'ref:af18298daea159b6ca5283c0c1aa45e7155e4412', 'asOf' => time()]));
74

>
aaron@mwmaint1002:~$ mwscript sql.php --wiki=mediawikiwiki
> select sum(len) + 71 * sum(n) as bytes_if_hashed, sum(n*len) as bytes_if_inline from (select SHA1(md_deps) as deps_hash, count(*) as n, max(LENGTH(md_deps)) as len from module_deps group by deps_hash) tmp;
stdClass Object
(
    [bytes_if_hashed] => 717663
    [bytes_if_inline] => 2565027
)

>
aaron@mwmaint1002:~$ mwscript sql.php --wiki=wikidatawiki
> select sum(len) + 71 * sum(n) as bytes_if_hashed, sum(n*len) as bytes_if_inline from (select SHA1(md_deps) as deps_hash, count(*) as n, max(LENGTH(md_deps)) as len from module_deps group by deps_hash) tmp;
stdClass Object
(
    [bytes_if_hashed] => 6266199
    [bytes_if_inline] => 26180160
)

>
aaron@mwmaint1002:~$ mwscript sql.php --wiki=commonswiki
> select sum(len) + 71 * sum(n) as bytes_if_hashed, sum(n*len) as bytes_if_inline from (select SHA1(md_deps) as deps_hash, count(*) as n, max(LENGTH(md_deps)) as len from module_deps group by deps_hash) tmp;
stdClass Object
(
    [bytes_if_hashed] => 4333020
    [bytes_if_inline] => 18545998
)

aaron@mwmaint1002:~$ mwscript sql.php --wiki=svwiki
> select sum(len) + 71 * sum(n) as bytes_if_hashed, sum(n*len) as bytes_if_inline from (select SHA1(md_deps) as deps_hash, count(*) as n, max(LENGTH(md_deps)) as len from module_deps group by deps_hash) tmp;
stdClass Object
(
    [bytes_if_hashed] => 500686
    [bytes_if_inline] => 1649965
)

>
aaron@mwmaint1002:~$ mwscript sql.php --wiki=enwiki
> select sum(len) + 71 * sum(n) as bytes_if_hashed, sum(n*len) as bytes_if_inline from (select SHA1(md_deps) as deps_hash, count(*) as n, max(LENGTH(md_deps)) as len from module_deps group by deps_hash) tmp;
stdClass Object
(
    [bytes_if_hashed] => 9878934
    [bytes_if_inline] => 41584304
)

We have 18 x 500mb redis instances = 9000MB, or about 9MB/wiki.

Assuming sessions are moved and the worst-ish case of "all wikis like enwiki", it would be a little tight.

Declined, assuming that the DB main stash has enough GB of space.

Support hash-based deduplication in KeyValueDependencyStoreClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Support hash-based deduplication in KeyValueDependencyStore
Closed, DeclinedPublic
Actions

Related Objects
Search...