Maniphest T320831

Section Level Image Suggestions - Data Persistence Request
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	lbowmaker
	Oct 14 2022, 5:45 PM

Description

The existing image suggestions data pipeline suggests images at an article level. There is a new data pipeline being built that will suggest images at an article section level.

The output of the new data pipeline is expected to the be the same as the article level suggestions but with the addition of a field to contain the section identifier.

When the new data pipeline is built the existing article level data pipeline will continue to run and the output consumed as it's done currently.

Write Frequency and Method:

Weekly bulk from Airflow > Cassandra connector job

Size and Growth:

Still being investigated as the final algorithm is worked out -> https://phabricator.wikimedia.org/T315976
Size is not known yet but realistically would be a multiple of the existing page level image suggestions (10x?)

Access:

Cassandra Data Gateway

Timeline:

SD writing data to production by end of year/early Jan 23.

Questions for Data Persistence:

Based on the above does it make sense to just modify the existing schema to add a new field? I don't think we need new tables.
I don’t think we need to have the new section id field as part of the key. Requests would still be on wiki/page_id (requestors wouldn't likely know the section id - SD correct me if I'm wrong?) - Growth may need to add a filter on items with a section id
Do we have a Cassandra test instance that SD engineers could use during their development phases?

Update

This is an example row currently available in Cassandra for article-level image suggestions:

 wiki   | page_id | id                                   | image                  | confidence | found_on | kind                        | origin_wiki | page_rev
--------+---------+--------------------------------------+------------------------+------------+----------+-----------------------------+-------------+----------
 anwiki |    3326 | 839c1112-97ae-11ed-89f7-bc97e1581854 | 14_Agosto_2016_(1).jpg |         80 |     null | {'istype-commons-category'} | commonswiki |  1853430

For Section-Level-Image-Suggestions, we agreed to add the following fields:

page_qid: string - the page Wikidata ID
section_index: int - the numeric index as seen by mwparserfromhell
section_title: string - the section title extracted from wikitext, e.g., Discography. The special string ### zero ### is reserved for lead sections.

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T311814 [EPIC] Section-level image suggestions data pipeline
Resolved	mfossati	T320831 Section Level Image Suggestions - Data Persistence Request
Resolved	Eevans	T328670 Add section title column to image_suggestions.suggestions table schema
Open	None	T328778 Cassandra test cluster as a staged pathway to production for image suggestions data pipelines

Event Timeline

lbowmaker created this task.Oct 14 2022, 5:45 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 14 2022, 5:45 PM

LSobanski edited projects, added Cassandra; removed Data-Persistence.Oct 17 2022, 11:26 AM

CBogen added a project: Section-Level-Image-Suggestions.Oct 17 2022, 1:12 PM

CBogen added a parent task: T311814: [EPIC] Section-level image suggestions data pipeline.Oct 17 2022, 1:16 PM

• EChetty edited projects, added Data-Engineering-Planning; removed Data-Engineering.Oct 18 2022, 6:03 PM

The existing image suggestions data pipeline suggests images at an article level. There is a new data pipeline being built that will suggest images at an article section level.

The output of the new data pipeline is expected to the be the same as the article level suggestions but with the addition of a field to contain the section identifier.

When the new data pipeline is built the existing article level data pipeline will continue to run and the output consumed as it's done currently.

Write Frequency and Method:

Weekly bulk from Airflow > Cassandra connector job

Size and Growth:

Still being investigated as the final algorithm is worked out -> https://phabricator.wikimedia.org/T315976

Size is not known yet but realistically would be a multiple of the existing page level image suggestions (10x?)

What is it that makes it a multiple of the existing (page-based) suggestions? Will the algorithm be somehow producing that many more results? Will we be (for example) storing N * num_sections suggestions (where N is the current per-page limit)? I guess what I'm wondering is whether this will result in a corresponding change to the size of result responses as well.

Access:

Cassandra Data Gateway

Timeline:

SD writing data to production by end of year/early Jan 23.

Questions for Data Persistence:

Based on the above does it make sense to just modify the existing schema to add a new field? I don't think we need new tables.

Based on my understanding yes. If the only change to the model is to add an attribute to the image entries (i.e. if there is no change to the PRIMARY KEY group), then it's a non-breaking change.

I don’t think we need to have the new section id field as part of the key. Requests would still be on wiki/page_id (requestors wouldn't likely know the section id - SD correct me if I'm wrong?) - Growth may need to add a filter on items with a section id

Yes, this is important. If there is a need to query by-section, then changes to the data model will be needed. The sooner we establish this, the better.

Do we have a Cassandra test instance that SD engineers could use during their development phases?

No, we don't. :(

In T320831#8343829, @Eevans wrote:

[ ... ]

Size and Growth:

Still being investigated as the final algorithm is worked out -> https://phabricator.wikimedia.org/T315976

Size is not known yet but realistically would be a multiple of the existing page level image suggestions (10x?)

What is it that makes it a multiple of the existing (page-based) suggestions? Will the algorithm be somehow producing that many more results? Will we be (for example) storing N * num_sections suggestions (where N is the current per-page limit)? I guess what I'm wondering is whether this will result in a corresponding change to the size of result responses as well.

[ ... ]

I don’t think we need to have the new section id field as part of the key. Requests would still be on wiki/page_id (requestors wouldn't likely know the section id - SD correct me if I'm wrong?) - Growth may need to add a filter on items with a section id

Yes, this is important. If there is a need to query by-section, then changes to the data model will be needed. The sooner we establish this, the better.

To expound on this a bit more:

The stated timeline (EOY to early January) is pretty short, especially given the constraints the holidays impose. If this is as simple as it initially seems (adding an attribute to the image entry) then what is proposed shouldn't be a problem. For DP's part, I think we'd just need to (re)evaluate/document any new capacity requirements, and carry out the schema change. If it turns out to be more complicated, then the sooner we know the better we can establish a timeline.

Do we have a Cassandra test instance that SD engineers could use during their development phases?

No, we don't. :(

I'll add to this one too; We do not have a test/staging environment for this cluster. But let's see if we can figure out what -at a minimum- would be necessary to test these changes, and maybe we can figure something out.

Eevans moved this task from Backlog to Next on the Cassandra board.Oct 27 2022, 7:56 PM

Having spoken to @kostajh I think we'll need a section_name field as well as (or instead of) section_id so that Growth can filter out sections on their end

Ladsgroup subscribed.Oct 31 2022, 2:40 PM

• EChetty moved this task from Backlog to Radar on the Data-Engineering-Planning board.Nov 7 2022, 10:51 AM

Did we decide definitively which fields need to be added to the data model? If not then we ought to asap ...

@Cparle I think we wrapped up that discussion, and agreed to expose the following fields in Cassandra:

wiki
page ID
revision ID
page QID
section index, i.e., numeric index as seen by mwparserfromhell
section title, i.e., a string extracted from wikitext, e.g., Discography. The special string ### zero ### is reserved for lead sections.

Updated the task description, closing.

I had originally understood this ticket to be a request for action/resources from Data-Persistence. If that is the case, should it have been closed? Is there a follow-up ticket coming?

Also...

...

Do we have a Cassandra test instance that SD engineers could use during their development phases?

The answer to this is now: Yes.

Eevans reopened this task as Open.Jan 30 2023, 7:29 PM

In T320831#8571643, @Eevans wrote:

Do we have a Cassandra test instance that SD engineers could use during their development phases?

The answer to this is now: Yes.

That's good news @Eevans, thanks for the heads up! Can you please indicate how to connect to it?

In T320831#8582199, @mfossati wrote:

In T320831#8571643, @Eevans wrote:

Do we have a Cassandra test instance that SD engineers could use during their development phases?

The answer to this is now: Yes.

That's good news @Eevans, thanks for the heads up! Can you please indicate how to connect to it?

Ok, maybe I oversold this... let me start again. :)

We have a cluster that exists for testing, for some value of "testing". I think in this context the meaning is probably closer to staging than it is to experimenting, for which Cloud VPS is probably a better fit. So for example, as a sink to receive data from a pipeline that hasn't yet been cleared for production, but maybe not the place to actively develop new applications against. At least, this is my current thinking. I'd like to hear what folks need first.

So as a step 1, can you create a separate ticket with your requirements here?

mfossati added a subtask: T328670: Add section title column to image_suggestions.suggestions table schema.Feb 3 2023, 2:28 PM

In T320831#8583283, @Eevans wrote:

We have a cluster that exists for testing, for some value of "testing". I think in this context the meaning is probably closer to staging than it is to experimenting, for which Cloud VPS is probably a better fit.
So for example, as a sink to receive data from a pipeline that hasn't yet been cleared for production, but maybe not the place to actively develop new applications against. At least, this is my current thinking. I'd like to hear what folks need first.

Fully agree, a staged pathway to production is exactly what we need.
In other words, something we can feel free to feed with data and to eventually wipe it clean.

So as a step 1, can you create a separate ticket with your requirements here?

Here you are: T328778: Cassandra test cluster as a staged pathway to production for image suggestions data pipelines

As a subtask for you @Eevans , T328670: Add section title column to image_suggestions.suggestions table schema would be the final requirement for production.

I think we can safely close this ticket as soon as the subtasks are resolved.

mfossati added a subtask: T328778: Cassandra test cluster as a staged pathway to production for image suggestions data pipelines.Feb 3 2023, 3:37 PM

Eevans closed subtask T328670: Add section title column to image_suggestions.suggestions table schema as Resolved.Mar 16 2023, 2:52 PM

JArguello-WMF removed a project: Data-Engineering-Planning.Jun 29 2023, 10:01 PM

Eevans triaged this task as Medium priority.Apr 5 2024, 8:45 PM

In T320831#8585544, @mfossati wrote:

In T320831#8583283, @Eevans wrote:

We have a cluster that exists for testing, for some value of "testing". I think in this context the meaning is probably closer to staging than it is to experimenting, for which Cloud VPS is probably a better fit.
So for example, as a sink to receive data from a pipeline that hasn't yet been cleared for production, but maybe not the place to actively develop new applications against. At least, this is my current thinking. I'd like to hear what folks need first.

Fully agree, a staged pathway to production is exactly what we need.
In other words, something we can feel free to feed with data and to eventually wipe it clean.

So as a step 1, can you create a separate ticket with your requirements here?

Here you are: T328778: Cassandra test cluster as a staged pathway to production for image suggestions data pipelines

As a subtask for you @Eevans , T328670: Add section title column to image_suggestions.suggestions table schema would be the final requirement for production.

I think we can safely close this ticket as soon as the subtasks are resolved.

Doing a little hygiene work here: T328778 feels more like a follow-up task, than a dependency for completion of this one, so I'll be bold and close this. Feel free to re-open it if I got that wrong!

Eevans closed this task as Resolved.Apr 5 2024, 8:52 PM

Section Level Image Suggestions - Data Persistence RequestClosed, ResolvedPublicActions

Description

Update

Related ObjectsSearch...

Event Timeline

Section Level Image Suggestions - Data Persistence Request
Closed, ResolvedPublic
Actions

Related Objects
Search...