Page MenuHomePhabricator

Export all current (wiki_id, page_id) data from ml_cache.page_paragraph_tone_scores (Cassandra)
Closed, ResolvedPublic

Description

From Slack:

I have a request/question regarding Revise Tone - would it be possible to export all current (wiki_id, page_id) data from Cassandra, maybe as a csv file for example?
The reason: we need this to update the existing entries in Cassandra and the Search weighted tags by recomputing old suggestions in Cassandra against their latest revisions, because we deployed this change (T412210) to use HTML instead of Wikitext for the Revise Tone Task Generator. This better filters out quotes and certain sections of articles, and the Growth team needs this for their A/B test launches this thursday.

Event Timeline

Eevans triaged this task as Medium priority.Mon, Jan 12, 9:51 PM

So as a Rule of Thumb: We should not be relying on the ability to run any "all of" query, or really any query that we didn't specifically model for, and/or that doesn't otherwise have a reasonably bounded result set cardinality. This is basically a scatter-gather and full merge over all the nodes in the cluster. As the size of a dataset grows, these get very expensive (prohibitively so) and time consuming to run. It's also problematic because the set of [wiki_id,page_id] tuples here isn't deterministic, and we should assume it has changed by the time the query completes; I don't think this is a good strategy for this.

All of that said, the table is currently small enough to reasonably do a COPY from the cql shell, and I have attached the output of that below. Let me know if this will work.

Boldly closing this; Please reopen if there continues to be work needed here.