Page MenuHomePhabricator

How to update/delete ExternalStore entries?
Closed, DeclinedPublic

Description

The ExternalStore class (and the new ExternalStoreMedium class) has methods for storing new data, but apparently, there are no methods for updating or deleting records. This came up in T246539#5942051.

Perhaps it's not too common to delete or update records in ES, given its current usage, but I wonder if there are other reasons why these operations are currently impossible. If not, we should consider implementing them (and I can do that).

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
OpenNone
OpenNone
ResolvedNone
ResolvedDaimona
ResolvedPRODUCTION ERRORDaimona
ResolvedPRODUCTION ERRORDaimona
StalledNone
StalledNone
ResolvedDaimona
OpenNone
ResolvedDaimona
ResolvedDaimona
ResolvedPRODUCTION ERRORDaimona
ResolvedPRODUCTION ERRORDaimona
ResolvedDaimona
ResolvedDaimona
ResolvedPRODUCTION ERRORDaimona
ResolvedUrbanecm
DeclinedDaimona

Event Timeline

I skimmed through the ExternalStoreMedium subclasses, and it seems possible to make them all CRUD-capable. The only exception being Http, which is read-only.

OT: I don't really like isReadOnly, as it seems to be a mild violation of the LSP. IMHO, there should be different interfaces for read-only and read+write stores.

I don't think it's generally desired... Except when you recompress stuff maybe?

I don't think it's generally desired... Except when you recompress stuff maybe?

Exactly my thought. The use case for the other task is switching from PHP serialization to JSON. I'll try writing a patch for U and D, then.

Change 577234 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/core@master] externalstore: Add methods for deletion and update

https://gerrit.wikimedia.org/r/577234

Adding this capability to ExternalStore is a significant strategic decision in terms of its contract and what other parts of MW can expect from it.

Tagging Platform Engineering and @tstarling to advise on the given use case(s) and decide accordingly.

The ES clusters which are not currently being written to have read-only mode enabled in MariaDB, so you can't modify them in any way. They also don't have replication running. So ES is managed by the DBAs as an append-only cluster, a strategy which is enabled by the restricted API presented by MediaWiki. Presenting ES as an append-only store provides some flexibility for implementation options which would not be there if we allowed deletion and update.

I don't think it's worth changing the system just for this script. I suggest just writing new ES blobs and orphaning the old ES rows. Considering the ridiculous inefficiency of text storage in ES, this seems like a minor crime. The old recompression system dropped orphaned rows, so if we ever run that again, storage space would be freed at that time.

The ES clusters which are not currently being written to have read-only mode enabled in MariaDB, so you can't modify them in any way. They also don't have replication running. So ES is managed by the DBAs as an append-only cluster, a strategy which is enabled by the restricted API presented by MediaWiki. Presenting ES as an append-only store provides some flexibility for implementation options which would not be there if we allowed deletion and update.

I don't think it's worth changing the system just for this script. I suggest just writing new ES blobs and orphaning the old ES rows. Considering the ridiculous inefficiency of text storage in ES, this seems like a minor crime. The old recompression system dropped orphaned rows, so if we ever run that again, storage space would be freed at that time.

I strongly agree with this.

The ES clusters which are not currently being written to have read-only mode enabled in MariaDB, so you can't modify them in any way. They also don't have replication running. So ES is managed by the DBAs as an append-only cluster, a strategy which is enabled by the restricted API presented by MediaWiki. Presenting ES as an append-only store provides some flexibility for implementation options which would not be there if we allowed deletion and update.

Thanks for the explanation, I wasn't aware of this fact. Perhaps it should be documented somewhere (code and/or mw.org)?

I don't think it's worth changing the system just for this script. I suggest just writing new ES blobs and orphaning the old ES rows. Considering the ridiculous inefficiency of text storage in ES, this seems like a minor crime. The old recompression system dropped orphaned rows, so if we ever run that again, storage space would be freed at that time.

OK, I'll just make it append new rows then. However, I have a question: if I make it print a list of orphaned records, would it be possible for DBAs to delete them? I might be naïve, but leaving tons of orphaned records is painful for me. We're talking about roughly 20 millions records on enwiki, as can be seen with

SELECT COUNT(DISTINCT afl_var_dump) FROM abuse_filter_log

(quarry link).

OK, I'll just make it append new rows then. However, I have a question: if I make it print a list of orphaned records, would it be possible for DBAs to delete them? I might be naïve, but leaving tons of orphaned records is painful for me. We're talking about roughly 20 millions records on enwiki, as can be seen with

SELECT COUNT(DISTINCT afl_var_dump) FROM abuse_filter_log

(quarry link).

Based on

The ES clusters which are not currently being written to have read-only mode enabled in MariaDB, so you can't modify them in any way. They also don't have replication running.

It's basically a no, unless the DBAs un-mark things readonly, delete on each replica... Or as Tim said, recompression might do that at a later date

OOI, do we have an idea of how much space it's actually using? Yes it's not great, but I imagine in comparison to other stuff...

It's basically a no, unless the DBAs un-mark things readonly, delete on each replica...

This looks annoying, so I'll take it as 'no' unless there's some quick way to do that.

Or as Tim said, recompression might do that at a later date

Better than nothing!

OOI, do we have an idea of how much space it's actually using? Yes it's not great, but I imagine in comparison to other stuff...

Interesting question! Unfortunately, no. As I wrote on T246539, the plan was to measure the difference to gather this data *after* running the script. Estimating it now is complicated, however you can use the following facts:

  • One million records (quarry source + quarry link in my previous comment) are not for edits; these are usually smaller because they do not contain wikitext stuff

For records after 2013 (rEABF42bd0d84f4244ca2304c6d161f71d90a6a53030c)

  • Each record takes (on average) at least double the space of an ordinary revision, because it contains the wikitext of before and after the edit.
  • Records may contain additional variables; some are small (like usernames), others are big (like PSTed or parsed wikitext)

For records before 2013

  • These records contain serialized class, hence they're bigger than just serialized arrays. Especially because the list of serialized classes includes Revisions, Titles, WikiPages, Articles, ContextSources and whatnot. These classes are in place of otherwise small variables (i.e. we may have a whole User object instead of a plain username).
  • These records still contain old and new wikitext, hence they're always bigger than their equivalent counterpart
  • In the commit message of rEABF42bd0d84f4244ca2304c6d161f71d90a6a53030c, Hoo estimated that these records are 10 times bigger than the others.

Note that replacing records with the new serialization basically means saving space for old records (JSON-encoded arrays instead of serialized PHP classes).


That said, I wanted to give a numerical answer, so I tried the following on my local wiki (which has 8182 AF-related text entries, uncompressed, in the post-2013 format).

CREATE TABLE text_experiment ( foo mediumblob );
INSERT INTO text_experiment ( SELECT old_text FROM text JOIN abuse_filter_log on ( old_id = REPLACE(CONVERT( afl_var_dump using utf8), 'tt:', '') ) )
SELECT ROUND( ( data_length / 1024 / 1024), 2 ) FROM information_schema.TABLES WHERE table_schema = "my_wiki" AND table_name = "text_experiment";

and got a total of 177 MB.
Now, assume the following:

  • Per above, let's just consider roughly 20 millions rows for edits on enwp;
  • In 2009-2013, suppose that 1/5 of the total rows were created (=> roughly 4 million rows ); the remaining 16m rows use the new format;
  • Suppose that gzip can reduce the file size to 20% of the original [do we have precise numbers?]
  • Suppose that the data on my local wiki is similar to real data, that it's distributed uniformly, etc.
  • Suppose that old records take 10 times the space of new records

We get:

  • Average size of a new record (compressed): 177 / 8200 / 5 = 4.32 kB [this seems quite low]
  • Size of all new records on enwiki: ( 177 / 8200 / 5 ) * 16*10e6 = 69 GB
  • Size of all old records on enwiki: ( 177 / 8200 / 5 ) * 10 * 4*10e6 = 173 GB
  • Total size on enwiki: 173 +69 = 242 GB

This is very imprecise. The compression ratio is almost random, and various assumptions may be wrong. 242 GB is also not too bad, and certainly not as bad as I expected.

Change 577234 abandoned by Daimona Eaytoy:
externalstore: Add methods for deletion and update

Reason:
T246938#5955203

https://gerrit.wikimedia.org/r/577234

Writing docs is T247383, fixing the script is T246539. No more actionables here.