Compress data at external storage
Open, Stalled, MediumPublic
Actions

Assigned To

None

Authored By

	• jcrespo
	Jul 21 2015, 4:25 PM

Description

External storage can be compressed by about a factor of 20 by running the maintenance scripts trackBlobs.php and recompressTracked.php. There has been recently a space shortage on ES hosts (T105843), and that is going to be fixed short term in hardware, but investigation on software/operations should ensue.

Two important issues, however, must be fixed first:

Record Flow references to ExternalStore in text T106363
The script has been untested for years, and it should be checked before its execution (T106388, plus additional general testing)

References:

See T119056: External Storage on codfw (es2005-2010) is consuming 100-90GB of disk space per server and per month and it has 370GB available.

Related Objects
Search...

Status	Assigned	Task
Stalled	None	T106386 Compress data at external storage
Resolved	• jcrespo	T105843 new external storage cluster(s)
Stalled	None	T106388 Audit all existing code to ensure that any extension currently or previously adding blobs to ExternalStore has been registering a reference in the text table (and fix up if wrong)
Declined	None	T106363 Migrate Flow content to new separate logical External Store in production
Resolved	• Mattflaschen-WMF	T119568 Run External Store migration for real on Beta
Resolved	• Mattflaschen-WMF	T119567 Run Flow External Store migration in dry-run mode on Beta
Resolved	• Mattflaschen-WMF	T119566 Add dry-run mode to Flow External Store migration script
Resolved	• Mattflaschen-WMF	T128417 Set up Flow-specific External Store cluster on Beta (secondary to the main one)
Resolved	• Mattflaschen-WMF	T95871 Use External Store on Beta Cluster
Resolved	• Mattflaschen-WMF	T136887 External Store dry run wrongly detects failed insert if $wgCompressRevisions is true
Resolved	matthiasmullie	T133074 Script to allow migrating Flow content between External Store clusters
Declined	None	T138049 Dry run of Flow External Store migration in production
Declined	Tgr	T107610 Setup separate logical External Store for Flow in production
Resolved	• jcrespo	T153440 Create a full backup of all external storage records that would be easy to restore/setup a temporary delayed slave
Resolved	Daimona	T34478 AbuseFilter not setting utf-8 flag
Resolved	• Urbanecm	T246539 Dry-run, then actually run updateVarDumps
Declined	Daimona	T246938 How to update/delete ExternalStore entries?
Resolved	Daimona	T252696 Find a good way to run the updateVarDumps script on large wikis

Event Timeline

• jcrespo created this task.Jul 21 2015, 4:25 PM

• jcrespo raised the priority of this task from to Needs Triage.

• jcrespo updated the task description. (Show Details)

• jcrespo added projects: MediaWiki-libs-Rdbms, acl*sre-team.

• jcrespo subscribed.

Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptJul 21 2015, 4:25 PM

• jcrespo updated the task description. (Show Details)Jul 21 2015, 4:26 PM

• jcrespo set Security to None.

• jcrespo added subtasks: T105843: new external storage cluster(s), T106363: Migrate Flow content to new separate logical External Store in production.

Jdforrester-WMF edited subtasks, added: T106388: Audit all existing code to ensure that any extension currently or previously adding blobs to ExternalStore has been registering a reference in the text table (and fix up if wrong); removed: T106363: Migrate Flow content to new separate logical External Store in production.Jul 21 2015, 4:31 PM

Jdforrester-WMF merged a task: T106387: Re-compress External Storage in production using trackBlobs.php and recompressTracked.php.

Jdforrester-WMF added subscribers: greg, Legoktm, tstarling, Jdforrester-WMF.

• jcrespo updated the task description. (Show Details)Jul 21 2015, 4:40 PM

PleaseStand subscribed.Jul 22 2015, 10:32 AM

• Mattflaschen-WMF added a subtask: T106363: Migrate Flow content to new separate logical External Store in production.Jul 27 2015, 6:10 PM

Can the current External Store be backed up before being decommissioned?

@Mattflaschen once new servers are available, data will be transferred without losing records or availability. Data will not be backuped, in the traditional sense, but it will be available on different servers on multiple datacenters, and those different servers can be managed separatelly.

@jcrespo: Flow doesn't currently record it's entries in text: they're stored separately (extension1, flow_revision.rev_content) to be easily accessible cross-wiki. We'd rather not store or duplicate them in tables that are specific to a wiki.
So instead of trying to duplicate Flow's references to DB://cluster24/... & DB://cluster25/... into text, I suggest we set up a new ExternalStore cluster specific to Flow. We're building a script that will loop all existing Flow entries in cluster24 & cluster25 and recreate them elsewhere. Then Flow is no longer blocking anything on cluster24 & cluster25 and trackBlobs.php and recompressTracked.php can be run.
Is that a sane idea?

Eventually, Flow's data might also need to be re-compressed the way recompressTracked.php currently does it, and if we don't move our data to text, we can't use those scripts for Flow's ExternalStore entries.
I think it's better to then recreate recompressTracked.php's functionality specific to Flow, rather than trying to re-use text table & recompressTracked.php: the most complex parts in that script are not so much the compression (which seems mostly isolated in ConcatenatedGzipHistoryBlob easily reusable) but iterating over and updating text, and grouping per page.page_id, all of which are pretty much irrelevant to Flow and we would already have an alternative for with the previous script that would let us copy Flow entries to a new ES cluster.

Is that a sane idea?

@matthiasmullie I do not have enough architecture and flow knowledge to agree or disagree with your suggestion (I only created this task because it was suggested by Tim). I can only own the original issue (lack of space) and have a support role on the compression. Please, other subscribers, comment on this (here or probably better on the more specific T106388).

The only thing I can comment is the operational point of view. A new cluster requires new resources (which requires budget- and we are limited by that). Please agree first if that is sane from the application point of view (or other options are a better idea) and then submit a more detailed request of the consequences in connections, shards, storage changes, latency requirements, etc. so I can better advise on physical requirements for high availability, performance and reliability. You can find me at IRC as jynus. Please note that with higher capacity servers it could be possible to consolidate several services on the same hardware.

My only request is to please, reduce the scope of short-term architectural changes to the minimum possible, as we are on a timer (less than 3 months to failure), and further improvements could be done after the immediate space problem is solved. In particular, new hardware deployment will not be blocked by this ticket.

I think trackBlobs would currently fail the initial integrity check (at least on dewiki and probably other wikis which are old enough) due to the presence of HistoryBlobStub objects in the text table. More details on T108495.

Glaisher subscribed.Aug 9 2015, 3:36 PM

mysql:wikiadmin@db2038.codfw.wmnet [dewiki]> select count(*) from text where old_flags LIKE '%object%' AND old_flags NOT LIKE '%external%' AND LOWER(CONVERT(LEFT(old_text,22) USING latin1)) = 'o:15:"historyblobstub"';
+----------+
| count(*) |
+----------+
|     3449 |
+----------+
1 row in set (1 min 46.46 sec)

mysql:wikiadmin@db2034.codfw.wmnet [enwiki]> select count(*) from text where old_flags LIKE '%object%' AND old_flags NOT LIKE '%external%' AND LOWER(CONVERT(LEFT(old_text,22) USING latin1)) = 'o:15:"historyblobstub"';
+----------+
| count(*) |
+----------+
|     2602 |
+----------+
1 row in set (9 min 56.51 sec)

• GWicke subscribed.Aug 15 2015, 6:04 PM

• MoritzMuehlenhoff triaged this task as Medium priority.Aug 21 2015, 8:29 AM

The new servers are about to arrive. There are 2 options for the immediate migration, before compression:

Stop writing to es2 (blobs_cluster24) and es3 (blobs_cluster25), put those clusters in read-only (old cluster configuration) mode and start writing to blobs_cluster26 and blobs_cluster27, on 2 new shards. Eventually migrate es[123] to a new ro set of servers. This may be easier and faster to do, but for a long time, the new servers would be under-utilized.
Clone es2 and es3 "as is", on the new servers, with larger space and capacity. Maybe eventually create new clusters, but on the same physical host, before compression. This requires a data copy, but it is not a huge issue, and usage will be good from the beginning.

I would prefer the second option, assuming thing can can be reorganized (old clusters moved around) afterwards with no issue (after all, compression should be that, a reorganization of the data). Anyone has any comments?

• jcrespo mentioned this in T105843: new external storage cluster(s).Aug 25 2015, 3:29 PM

It sounds like either will work for Flow.

If there's anything we need to do (e.g. during the time window when switching to the clone), let us know.

We will have to go through a process for Flow later, before the compression.

I am soon going to resolve T105843#1614596. Please be prepared soon to recontinue this task!

Please keep me in the loop, so that I can help, and provide a rollback plan in case something goes wrong.

Noted, thanks. What's the status now?

Data migration done. I would recommend starting testing the compression on a slave on codfw to avoid production impact.

• jcrespo closed subtask T105843: new external storage cluster(s) as Resolved.Sep 27 2015, 3:18 PM

We should start by creating a new cluster (enwiki table has surpased the 1 TB size). Inserting to a separate table will be faster.

I can own this, but I need support from someone with good mediawiki database code knowledge to +1 my configuration patches.

Augmenting the priority because codfw will fail in 1-2 months unless we do this.

• jcrespo added a parent task: T88445: MediaWiki active/active datacenter investigation and work (tracking).Oct 1 2015, 11:06 AM

This was mentioned in yesterday's SoS: @jcrespo: I'd like to make sure you get the help you need, but I'm not sure who to task with that. Do you have anyone in mind?

@greg No, I do not know who should be the best fit. This is a "mediawiki-core" change. Who maintains the mysql ORM?

If the answer to that is "nobody" then, someone with some mediawiki/PHP experience. *I will be doing the actual maintenance*, but I lack lots of experience with mediawiki codebase. This will be a stab at its internal guts, and need a second pair of eyes before I bring the whole editing workflow down.

In T106386#1693559, @jcrespo wrote:

@greg No, I do not know who should be the best fit. This is a "mediawiki-core" change. Who maintains the mysql ORM?

If the answer to that is "nobody" then, someone with some mediawiki/PHP experience. *I will be doing the actual maintenance*, but I lack lots of experience with mediawiki codebase. This will be a stab at its internal guts, and need a second pair of eyes before I bring the whole editing workflow down.

There isn't really anyone specifically. Probably a handful of people that could help.

A big question would be whether you want someone in an European based TZ?

Europe is ok, but not a hard blocker.

Krenair mentioned this in T118789: Content for (deleted) revision 5422832 can't be loaded on dewiki.Nov 16 2015, 11:26 PM

• Mattflaschen-WMF updated the task description. (Show Details)Nov 19 2015, 4:45 PM

Catrope closed subtask T106363: Migrate Flow content to new separate logical External Store in production as Resolved.Dec 10 2015, 4:33 AM

• Mattflaschen-WMF reopened subtask T106363: Migrate Flow content to new separate logical External Store in production as Open.Dec 16 2015, 9:53 PM

Tickets to read before starting this: T22757.

aaron removed a parent task: T88445: MediaWiki active/active datacenter investigation and work (tracking).Aug 3 2016, 5:25 PM

Lowering to 'normal', looks like we aren't under high pressure now for ES compression

• jcrespo closed subtask T106388: Audit all existing code to ensure that any extension currently or previously adding blobs to ExternalStore has been registering a reference in the text table (and fix up if wrong) as Declined.May 16 2017, 3:42 PM

Jdforrester-WMF mentioned this in T106388: Audit all existing code to ensure that any extension currently or previously adding blobs to ExternalStore has been registering a reference in the text table (and fix up if wrong).May 16 2017, 4:10 PM

• jcrespo reopened subtask T106388: Audit all existing code to ensure that any extension currently or previously adding blobs to ExternalStore has been registering a reference in the text table (and fix up if wrong) as Open.May 16 2017, 4:19 PM

Krinkle moved this task from Untriaged to Unused on the MediaWiki-libs-Rdbms board.Jul 28 2018, 10:09 PM

• jcrespo mentioned this in T107610: Setup separate logical External Store for Flow in production.Oct 4 2018, 8:20 AM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 8:12 PM

Krinkle edited projects, added MediaWiki-Core-Revision-backend; removed MediaWiki-libs-Rdbms.Feb 9 2019, 11:36 PM

Krinkle moved this task from Untriaged to ExternalStore on the MediaWiki-Core-Revision-backend board.

greg unsubscribed.Feb 12 2019, 11:21 PM

Catrope mentioned this in T106363: Migrate Flow content to new separate logical External Store in production.Aug 28 2019, 10:24 AM

Anomie mentioned this in T241312: Remove support for HistoryBlobCurStub.Dec 23 2019, 2:49 PM

Tgr closed subtask T106363: Migrate Flow content to new separate logical External Store in production as Declined.Feb 12 2020, 11:58 PM

This can't be done until Flow's data isn't going to be deleted when the recompression is run. But now the tasks for moving Flow's data out of the MW ExternalStore have been declined...

Anomie changed the status of subtask T106388: Audit all existing code to ensure that any extension currently or previously adding blobs to ExternalStore has been registering a reference in the text table (and fix up if wrong) from Open to Stalled.Feb 14 2020, 10:57 PM

Aklapper removed subscribers: Anomie, • Mattflaschen-WMF.Oct 16 2020, 5:02 PM

Krinkle added a project: Performance-Team (Radar).Dec 14 2021, 9:43 PM

Krinkle edited projects, added Performance-Team; removed Performance-Team (Radar).Dec 14 2021, 10:47 PM

Krinkle moved this task from Inbox, needs triage to To-do: Goals, prioritized next 4 Quarters on the Performance-Team board.

larissagaulia moved this task from To-do: Goals, prioritized next 4 Quarters to Backlog: Future Goals, non-prioritized on the Performance-Team board.Jan 19 2023, 12:28 PM

Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.Aug 17 2023, 2:57 PM

Krinkle removed a subscriber: • GWicke.

Krinkle moved this task from Limbo to Perf recommendation on the Performance-Team (Radar) board.Aug 17 2023, 3:23 PM

Krinkle edited projects, added Wikimedia-Performance-recommendation; removed Performance-Team (Radar).Aug 18 2023, 8:42 PM