Page MenuHomePhabricator

Section Level Image Suggestions - Data Persistence Request
Closed, ResolvedPublic

Description

The existing image suggestions data pipeline suggests images at an article level. There is a new data pipeline being built that will suggest images at an article section level.

The output of the new data pipeline is expected to the be the same as the article level suggestions but with the addition of a field to contain the section identifier.

When the new data pipeline is built the existing article level data pipeline will continue to run and the output consumed as it's done currently.

Write Frequency and Method:

  • Weekly bulk from Airflow > Cassandra connector job

Size and Growth:

  • Still being investigated as the final algorithm is worked out -> https://phabricator.wikimedia.org/T315976
  • Size is not known yet but realistically would be a multiple of the existing page level image suggestions (10x?)

Access:

  • Cassandra Data Gateway

Timeline:

  • SD writing data to production by end of year/early Jan 23.

Questions for Data Persistence:

  • Based on the above does it make sense to just modify the existing schema to add a new field? I don't think we need new tables.
  • I don’t think we need to have the new section id field as part of the key. Requests would still be on wiki/page_id (requestors wouldn't likely know the section id - SD correct me if I'm wrong?) - Growth may need to add a filter on items with a section id
  • Do we have a Cassandra test instance that SD engineers could use during their development phases?

Update

This is an example row currently available in Cassandra for article-level image suggestions:

 wiki   | page_id | id                                   | image                  | confidence | found_on | kind                        | origin_wiki | page_rev
--------+---------+--------------------------------------+------------------------+------------+----------+-----------------------------+-------------+----------
 anwiki |    3326 | 839c1112-97ae-11ed-89f7-bc97e1581854 | 14_Agosto_2016_(1).jpg |         80 |     null | {'istype-commons-category'} | commonswiki |  1853430

For Section-Level-Image-Suggestions, we agreed to add the following fields:

  • page_qid: string - the page Wikidata ID
  • section_index: int - the numeric index as seen by mwparserfromhell
  • section_title: string - the section title extracted from wikitext, e.g., Discography. The special string ### zero ### is reserved for lead sections.

Event Timeline

The existing image suggestions data pipeline suggests images at an article level. There is a new data pipeline being built that will suggest images at an article section level.

The output of the new data pipeline is expected to the be the same as the article level suggestions but with the addition of a field to contain the section identifier.

When the new data pipeline is built the existing article level data pipeline will continue to run and the output consumed as it's done currently.

Write Frequency and Method:

  • Weekly bulk from Airflow > Cassandra connector job

Size and Growth:

  • Still being investigated as the final algorithm is worked out -> https://phabricator.wikimedia.org/T315976
  • Size is not known yet but realistically would be a multiple of the existing page level image suggestions (10x?)

What is it that makes it a multiple of the existing (page-based) suggestions? Will the algorithm be somehow producing that many more results? Will we be (for example) storing N * num_sections suggestions (where N is the current per-page limit)? I guess what I'm wondering is whether this will result in a corresponding change to the size of result responses as well.

Access:

  • Cassandra Data Gateway

Timeline:

  • SD writing data to production by end of year/early Jan 23.

Questions for Data Persistence:

  • Based on the above does it make sense to just modify the existing schema to add a new field? I don't think we need new tables.

Based on my understanding yes. If the only change to the model is to add an attribute to the image entries (i.e. if there is no change to the PRIMARY KEY group), then it's a non-breaking change.

  • I don’t think we need to have the new section id field as part of the key. Requests would still be on wiki/page_id (requestors wouldn't likely know the section id - SD correct me if I'm wrong?) - Growth may need to add a filter on items with a section id

Yes, this is important. If there is a need to query by-section, then changes to the data model will be needed. The sooner we establish this, the better.

  • Do we have a Cassandra test instance that SD engineers could use during their development phases?

No, we don't. :(

[ ... ]

Size and Growth:

  • Still being investigated as the final algorithm is worked out -> https://phabricator.wikimedia.org/T315976
  • Size is not known yet but realistically would be a multiple of the existing page level image suggestions (10x?)

What is it that makes it a multiple of the existing (page-based) suggestions? Will the algorithm be somehow producing that many more results? Will we be (for example) storing N * num_sections suggestions (where N is the current per-page limit)? I guess what I'm wondering is whether this will result in a corresponding change to the size of result responses as well.

[ ... ]

  • I don’t think we need to have the new section id field as part of the key. Requests would still be on wiki/page_id (requestors wouldn't likely know the section id - SD correct me if I'm wrong?) - Growth may need to add a filter on items with a section id

Yes, this is important. If there is a need to query by-section, then changes to the data model will be needed. The sooner we establish this, the better.

To expound on this a bit more:

The stated timeline (EOY to early January) is pretty short, especially given the constraints the holidays impose. If this is as simple as it initially seems (adding an attribute to the image entry) then what is proposed shouldn't be a problem. For DP's part, I think we'd just need to (re)evaluate/document any new capacity requirements, and carry out the schema change. If it turns out to be more complicated, then the sooner we know the better we can establish a timeline.

  • Do we have a Cassandra test instance that SD engineers could use during their development phases?

No, we don't. :(

I'll add to this one too; We do not have a test/staging environment for this cluster. But let's see if we can figure out what -at a minimum- would be necessary to test these changes, and maybe we can figure something out.

Having spoken to @kostajh I think we'll need a section_name field as well as (or instead of) section_id so that Growth can filter out sections on their end

Did we decide definitively which fields need to be added to the data model? If not then we ought to asap ...

@Cparle I think we wrapped up that discussion, and agreed to expose the following fields in Cassandra:

  • wiki
  • page ID
  • revision ID
  • page QID
  • section index, i.e., numeric index as seen by mwparserfromhell
  • section title, i.e., a string extracted from wikitext, e.g., Discography. The special string ### zero ### is reserved for lead sections.
mfossati claimed this task.
mfossati updated the task description. (Show Details)

Updated the task description, closing.

I had originally understood this ticket to be a request for action/resources from Data-Persistence. If that is the case, should it have been closed? Is there a follow-up ticket coming?

Also...

...

  • Do we have a Cassandra test instance that SD engineers could use during their development phases?

The answer to this is now: Yes.

  • Do we have a Cassandra test instance that SD engineers could use during their development phases?

The answer to this is now: Yes.

That's good news @Eevans, thanks for the heads up! Can you please indicate how to connect to it?

  • Do we have a Cassandra test instance that SD engineers could use during their development phases?

The answer to this is now: Yes.

That's good news @Eevans, thanks for the heads up! Can you please indicate how to connect to it?

Ok, maybe I oversold this... let me start again. :)

We have a cluster that exists for testing, for some value of "testing". I think in this context the meaning is probably closer to staging than it is to experimenting, for which Cloud VPS is probably a better fit. So for example, as a sink to receive data from a pipeline that hasn't yet been cleared for production, but maybe not the place to actively develop new applications against. At least, this is my current thinking. I'd like to hear what folks need first.

So as a step 1, can you create a separate ticket with your requirements here?

We have a cluster that exists for testing, for some value of "testing". I think in this context the meaning is probably closer to staging than it is to experimenting, for which Cloud VPS is probably a better fit.
So for example, as a sink to receive data from a pipeline that hasn't yet been cleared for production, but maybe not the place to actively develop new applications against. At least, this is my current thinking. I'd like to hear what folks need first.

Fully agree, a staged pathway to production is exactly what we need.
In other words, something we can feel free to feed with data and to eventually wipe it clean.

So as a step 1, can you create a separate ticket with your requirements here?

Here you are: T328778: Cassandra test cluster as a staged pathway to production for image suggestions data pipelines

As a subtask for you @Eevans , T328670: Add section title column to image_suggestions.suggestions table schema would be the final requirement for production.

I think we can safely close this ticket as soon as the subtasks are resolved.

Eevans triaged this task as Medium priority.Apr 5 2024, 8:45 PM

We have a cluster that exists for testing, for some value of "testing". I think in this context the meaning is probably closer to staging than it is to experimenting, for which Cloud VPS is probably a better fit.
So for example, as a sink to receive data from a pipeline that hasn't yet been cleared for production, but maybe not the place to actively develop new applications against. At least, this is my current thinking. I'd like to hear what folks need first.

Fully agree, a staged pathway to production is exactly what we need.
In other words, something we can feel free to feed with data and to eventually wipe it clean.

So as a step 1, can you create a separate ticket with your requirements here?

Here you are: T328778: Cassandra test cluster as a staged pathway to production for image suggestions data pipelines

As a subtask for you @Eevans , T328670: Add section title column to image_suggestions.suggestions table schema would be the final requirement for production.

I think we can safely close this ticket as soon as the subtasks are resolved.

Doing a little hygiene work here: T328778 feels more like a follow-up task, than a dependency for completion of this one, so I'll be bold and close this. Feel free to re-open it if I got that wrong!