Page MenuHomePhabricator

Migrate Flow content to new separate logical External Store in production
Closed, DeclinedPublic

Description

ExternalStore is nearly full and ops will either buy more storage and/or compress the data.
According to Tim, the script to compress ES data omits entries missing from text table.

I believe we are not currently storing references in text. Let's make sure we do that soon enough. <- outdated, might go with another approach

We plan to solve this by setting up a new External Store (one that will only be used by Flow) then migrating Flow to use that (details at T107610: Setup separate logical External Store for Flow in production). That will then free up the non-Flow one to use the normal compression procedure.

Use:

foreachwikiindblist flow.dblist extensions/Flow/maintenance/FlowExternalStoreMoveCluster.php

Related Objects

StatusSubtypeAssignedTask
StalledNone
StalledNone
DeclinedNone
Resolved Mattflaschen-WMF
Resolved Mattflaschen-WMF
Resolved Mattflaschen-WMF
Resolved Mattflaschen-WMF
Resolved Mattflaschen-WMF
Resolved Mattflaschen-WMF
Resolvedmatthiasmullie
DeclinedNone
DeclinedTgr
Resolvedjcrespo
ResolvedDaimona
ResolvedUrbanecm
DeclinedDaimona
ResolvedDaimona

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Mattflaschen-WMF lowered the priority of this task from Unbreak Now! to High.Aug 5 2015, 11:12 PM
Mattflaschen-WMF renamed this task from Don't block trackBlobs.php and recompressTracked.php to Migrate Flow content to new separate logical External Store.Aug 6 2015, 11:42 PM
Mattflaschen-WMF updated the task description. (Show Details)
Mattflaschen-WMF added a subscriber: tstarling.

T105843#1654271 indicates the new hardware phase is almost done, so adding this back to Current.

Change 226544 merged by jenkins-bot:
Move Flow ExternalStore entries to separate cluster

https://gerrit.wikimedia.org/r/226544

Catrope subscribed.

It hasn't actually been migrated yet.

Thanks to @Volans, new codfw external storage servers were setup. Would the old servers be helpful in any way for this task? (they would have similar specs and contents to real servers- but they are not in production right now. Otherwise, they may be erased, etc: T129452

Yes, they could potentially be used for the dedicated Flow External Store.

@matthiasmullie No, those can be around for testing for a bit longer, but they are old and out of warranty (which means they cannot be used for production, only for testing). I was asking if they could be used as test hosts for the script (before being fully decommissioned).

You really want to use the new servers for the final service (10x times faster) and we have 20 TB free on those (and 0 on these).

@matthiasmullie No, those can be around for testing for a bit longer, but they are old and out of warranty (which means they cannot be used for production, only for testing). I was asking if they could be used as test hosts for the script (before being fully decommissioned).

You really want to use the new servers for the final service (10x times faster) and we have 20 TB free on those (and 0 on these).

Okay. I forget, are we doing a test run in production, or just Beta? If we're doing one on prod, we could use it.

(BTW, you @-ed Matthias, when I think you meant to reply to me; we both worked on it though).

Sorry about that, I just preset tab.

are we doing a test run in production

Well, considering we now have a reasonable way to test it closer to production, I was just providing more options in case they were needed (if only to see how much it would take on non-trivial datasets like beta). But it was just a suggestion/offering (which we didn't have before).

Mattflaschen-WMF renamed this task from Migrate Flow content to new separate logical External Store to Migrate Flow content to new separate logical External Store in production.Apr 12 2016, 9:12 PM

This ticket still needs to happen. However, I am thinking if we should refactor the external storage servers in a different way other than regular compression (parent ticket). Making the external storage transparent for the application (a pure key-value store) and use a similar strategy, but implemented by opensource mainstream software such as RocksDB or other. I will ask around.

Removing @Mattflaschen-WMF as task assignee to avoid cookie-licking.
(Matt, if you still like/plan to work on this, feel very welcome to re-claim via your personal Phab account - thanks!)

daniel subscribed.

Pinging TechCom for a quick check-in on this.

kostajh subscribed.

@daniel & TechCom most of the setup work is happening on T107610, fyi.

I discussed this a little bit with @kostajh and @Tgr. Summarizing my comments here.

Back in 2015, SRE/CPT wanted to recompress the ExternalStore data to make more space. The script that performs the recompression assumes that all ES URLs pointing to blobs are in the text table. It's able to find "orhpaned" blobs that aren't pointed to from the text table and preserve them, but the recompression process changes the URLs of each blob. The URLs are updated in the text table, but for orphaned blobs there is no text table row to update. MW core's revision storage uses the text table (although T183490 proposes to change that), as does AbuseFilter. The only thing that stores content in ES but doesn't use the text table is Flow, which instead puts ES URLs directly in the flow_revision table (kind of like what T183490 proposes, except without a separate content table). That means that running the recompression script as-is would cause us to lose all the Flow data in ES (i.e. the content of all Flow posts).

The initial proposal was to have Flow add rows to the text table when it inserts things into ES, and backfill the existing ES pointers into the text table. This was rejected because it would mean moving the source of truth for these pointers, and using a per-wiki table while everything else in Flow uses global tables (see T106386#1487961).

It was then proposed to move all Flow entries to a separate ES cluster, so that the original ES cluster only contains text-table-tracked blobs and can be safely recompressed. This is what's currently planned. It's already been done in beta labs, but hasn't been done in production yet, mostly because this doesn't seem to be a priority for anyone. (The Growth team hasn't proactively worked on it for a while, and SRE/CPT haven't asked us to.)

An alternative approach would be to add a hook to the recompression script notifying Flow of changes in orphaned blob URLs and allowing it to update them itself, but that could be more work than performing the separate store migration that we already have code for.

An alternative approach would be to add a hook to the recompression script notifying Flow of changes in orphaned blob URLs and allowing it to update them itself, but that could be more work than performing the separate store migration that we already have code for.

Also it would mean having a different setup on beta and production, unless we undo the beta migration somehow.
And hooks seem like a fragile mechanism for something that would cause content loss on failure.

It's already been done in beta labs, but hasn't been done in production yet, mostly because this doesn't seem to be a priority for anyone. (The Growth team hasn't proactively worked on it for a while, and SRE/CPT haven't asked us to.)

@jcrespo / @daniel, do you have any feedback on the priority of this task?

@jcrespo Priority or availability to work on it (they are not the same)? CC @Marostegui

I mean, this task only exists because T106386: Compress data at external storage exists. Is that something intended to happen soon? Or is it something that's a good idea in theory but no one really cares about it ATM? How urgent is it to fix Flow being a blocker?

Untagging TechCom, since this has been decoupled from the text table and content table.

Per T107610#5878347. Feel free to reopen if there's a clear need and timeframe for this.