Page MenuHomePhabricator

Migrate Flow content to new separate logical External Store in production
Open, HighPublic

Description

ExternalStore is nearly full and ops will either buy more storage and/or compress the data.
According to Tim, the script to compress ES data omits entries missing from text table.

I believe we are not currently storing references in text. Let's make sure we do that soon enough. <- outdated, might go with another approach

We plan to solve this by setting up a new External Store (one that will only be used by Flow) then migrating Flow to use that (details at T107610: Setup separate logical External Store for Flow in production). That will then free up the non-Flow one to use the normal compression procedure.

Use:

foreachwikiindblist flow.dblist extensions/Flow/maintenance/FlowExternalStoreMoveCluster.php

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
matthiasmullie renamed this task from Record Flow references to ExternalStore in `text` to Don't block trackBlobs.php and recompressTracked.php.Jul 28 2015, 12:25 PM
matthiasmullie updated the task description. (Show Details)

Suggestion: move Flow's ExternalStore entries on cluster24 & cluster25 elsewhere.
We'll need ops to weigh in here and set up that cluster. This was suggested here: https://phabricator.wikimedia.org/T106386#1487961
If agreed, we can use https://gerrit.wikimedia.org/r/#/c/226544/ to complete the move.

After that, we'll probably need to make a 2nd script to enable the same kind of recompression recompressTracked.php currently provides, but for Flow entries. This is less urgent.

Mattflaschen-WMF lowered the priority of this task from Unbreak Now! to High.Aug 5 2015, 11:12 PM
Mattflaschen-WMF renamed this task from Don't block trackBlobs.php and recompressTracked.php to Migrate Flow content to new separate logical External Store.Aug 6 2015, 11:42 PM
Mattflaschen-WMF updated the task description. (Show Details)
Mattflaschen-WMF added a subscriber: tstarling.

T105843#1654271 indicates the new hardware phase is almost done, so adding this back to Current.

Change 226544 merged by jenkins-bot:
Move Flow ExternalStore entries to separate cluster

https://gerrit.wikimedia.org/r/226544

Catrope closed this task as Resolved.Dec 10 2015, 4:33 AM
Catrope added a subscriber: Catrope.
Mattflaschen-WMF reopened this task as Open.Dec 16 2015, 9:53 PM

It hasn't actually been migrated yet.

Thanks to @Volans, new codfw external storage servers were setup. Would the old servers be helpful in any way for this task? (they would have similar specs and contents to real servers- but they are not in production right now. Otherwise, they may be erased, etc: T129452

Yes, they could potentially be used for the dedicated Flow External Store.

jcrespo added a comment.EditedMar 10 2016, 5:17 PM

@matthiasmullie No, those can be around for testing for a bit longer, but they are old and out of warranty (which means they cannot be used for production, only for testing). I was asking if they could be used as test hosts for the script (before being fully decommissioned).

You really want to use the new servers for the final service (10x times faster) and we have 20 TB free on those (and 0 on these).

@matthiasmullie No, those can be around for testing for a bit longer, but they are old and out of warranty (which means they cannot be used for production, only for testing). I was asking if they could be used as test hosts for the script (before being fully decommissioned).
You really want to use the new servers for the final service (10x times faster) and we have 20 TB free on those (and 0 on these).

Okay. I forget, are we doing a test run in production, or just Beta? If we're doing one on prod, we could use it.

(BTW, you @-ed Matthias, when I think you meant to reply to me; we both worked on it though).

Sorry about that, I just preset tab.

are we doing a test run in production

Well, considering we now have a reasonable way to test it closer to production, I was just providing more options in case they were needed (if only to see how much it would take on non-trivial datasets like beta). But it was just a suggestion/offering (which we didn't have before).

Mattflaschen-WMF renamed this task from Migrate Flow content to new separate logical External Store to Migrate Flow content to new separate logical External Store in production.Apr 12 2016, 9:12 PM
Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptApr 19 2016, 5:40 PM
Restricted Application added a project: Collaboration-Team-Triage. · View Herald TranscriptOct 4 2016, 10:54 PM

This ticket still needs to happen. However, I am thinking if we should refactor the external storage servers in a different way other than regular compression (parent ticket). Making the external storage transparent for the application (a pure key-value store) and use a similar strategy, but implemented by opensource mainstream software such as RocksDB or other. I will ask around.

Aklapper removed Mattflaschen-WMF as the assignee of this task.Apr 18 2018, 1:51 PM

Removing @Mattflaschen-WMF as task assignee to avoid cookie-licking.
(Matt, if you still like/plan to work on this, feel very welcome to re-claim via your personal Phab account - thanks!)

daniel added a subscriber: daniel.

Pinging TechCom for a quick check-in on this.

Restricted Application added a project: Growth-Team. · View Herald TranscriptTue, Aug 27, 6:16 PM
kostajh moved this task from Inbox to Revisit on the Growth-Team board.Wed, Aug 28, 7:27 AM
kostajh added a subscriber: kostajh.

@daniel & TechCom most of the setup work is happening on T107610, fyi.

Catrope added a subscriber: Tgr.Wed, Aug 28, 10:24 AM

I discussed this a little bit with @kostajh and @Tgr. Summarizing my comments here.

Back in 2015, SRE/CPT wanted to recompress the ExternalStore data to make more space. The script that performs the recompression assumes that all ES URLs pointing to blobs are in the text table. It's able to find "orhpaned" blobs that aren't pointed to from the text table and preserve them, but the recompression process changes the URLs of each blob. The URLs are updated in the text table, but for orphaned blobs there is no text table row to update. MW core's revision storage uses the text table (although T183490 proposes to change that), as does AbuseFilter. The only thing that stores content in ES but doesn't use the text table is Flow, which instead puts ES URLs directly in the flow_revision table (kind of like what T183490 proposes, except without a separate content table). That means that running the recompression script as-is would cause us to lose all the Flow data in ES (i.e. the content of all Flow posts).

The initial proposal was to have Flow add rows to the text table when it inserts things into ES, and backfill the existing ES pointers into the text table. This was rejected because it would mean moving the source of truth for these pointers, and using a per-wiki table while everything else in Flow uses global tables (see T106386#1487961).

It was then proposed to move all Flow entries to a separate ES cluster, so that the original ES cluster only contains text-table-tracked blobs and can be safely recompressed. This is what's currently planned. It's already been done in beta labs, but hasn't been done in production yet, mostly because this doesn't seem to be a priority for anyone. (The Growth team hasn't proactively worked on it for a while, and SRE/CPT haven't asked us to.)

An alternative approach would be to add a hook to the recompression script notifying Flow of changes in orphaned blob URLs and allowing it to update them itself, but that could be more work than performing the separate store migration that we already have code for.

daniel moved this task from Inbox to Watching on the TechCom board.Wed, Aug 28, 8:27 PM
Tgr added a comment.Mon, Sep 16, 10:33 AM

An alternative approach would be to add a hook to the recompression script notifying Flow of changes in orphaned blob URLs and allowing it to update them itself, but that could be more work than performing the separate store migration that we already have code for.

Also it would mean having a different setup on beta and production, unless we undo the beta migration somehow.
And hooks seem like a fragile mechanism for something that would cause content loss on failure.

Tgr added a comment.Mon, Sep 16, 10:34 AM

It's already been done in beta labs, but hasn't been done in production yet, mostly because this doesn't seem to be a priority for anyone. (The Growth team hasn't proactively worked on it for a while, and SRE/CPT haven't asked us to.)

@jcrespo / @daniel, do you have any feedback on the priority of this task?

@jcrespo Priority or availability to work on it (they are not the same)? CC @Marostegui

Tgr added a comment.Mon, Sep 16, 11:06 AM

I mean, this task only exists because T106386: Compress data at external storage exists. Is that something intended to happen soon? Or is it something that's a good idea in theory but no one really cares about it ATM? How urgent is it to fix Flow being a blocker?

Untagging TechCom, since this has been decoupled from the text table and content table.