Page MenuHomePhabricator

Design Image Suggestion Schema
Closed, ResolvedPublic

Description

User Story
As a platform engineer, I need to design a database schema that allows storage of data output by the Image Suggestion process
Success Criteria
  • Schema stores all fields from output
  • Supports retrieval of data set records by project & page ID
  • Optionally(?) supports lookup/retrieval by project & page title
  • Storage accommodates bulk import of new records, expiry / deletion of stale data
Out of scope
  • Storage of data for training

Cassandra Storage

1-- The recommended images dataset
2CREATE TABLE image_suggestions.suggestions (
3 wiki text, -- Wikimedia project
4 page_id int, -- MediaWiki page ID attribute
5 id timeuuid, -- Globally unique, but also a timestamp; Unique to an algorithm run
6 image text, -- Image being recommended
7 origin_wiki text, -- Where the recommended image resides
8 confidence float, -- Strength of recommendation; Value in the range 0.0-1.0
9 found_on set<text>, -- Other wikis that use the image
10 kind set<text>, -- ??
11 page_rev int, -- Revision of page_id at time of recommendation (informational)
12 PRIMARY KEY((wiki, page_id), id, image)
13);
14
15-- A record of user feedback, replicating whatever attributes of the corresponding
16-- suggestions that is necessary.
17CREATE TABLE image_suggestions.feedback(
18 wiki text, -- Corresponds to suggestions.wiki
19 page_id int, -- Corresponds to suggestions.page_id
20 image text, -- Corresponds to suggestions.image
21 id timeuuid, -- ID (& timestamp) of feedback
22 origin_wiki text, -- Corresponds to suggestions.origin_wiki
23 user text, -- User who submitted feedback
24 accepted boolean, -- True if feedback indicates acceptance
25 rejected boolean, -- True if feedback indicates rejection
26 comment text, -- User-submitted comment for a rejection
27 PRIMARY KEY((wiki, page_id), image, id)
28);
29
30-- Page ID/page title mapping.
31
32-- NOTE: This table is a duplication of a relationship that MediaWiki is canonical
33-- for. It is maintained here for convenience, with the understanding that it is
34-- not trustworthy (it should not be considered a source of truth).
35CREATE TABLE image_suggestions.title_cache (
36 wiki text, -- Wikimedia project
37 page_id int, -- MediaWiki page ID attribute
38 page_rev int, -- Revision of page_id
39 title text, -- Title of page at corresponding page_rev
40 PRIMARY KEY((wiki, title))
41);
42
43-- Values of the P31 property for the Wikidata item that corresponds with the page.
44
45-- NOTE: This table is a duplication of a relationship that MediaWiki is canonical
46-- for. It is maintained here for convenience, with the understanding that it is
47-- not trustworthy (it should not be considered a source of truth).
48CREATE TABLE image_suggestions.instanceof_cache (
49 wiki text, -- Wikimedia project
50 page_id int, -- MediaWiki page ID attribute
51 page_rev int, -- Revision of page_id (FIXME: shouldn't this be the Wikidata page_rev?)
52 instance_of set<text>, -- P31 property values
53 PRIMARY KEY((wiki, page_id))
54);

Proposal
  • From a Product perspective, the suggestions table is The Dataset (i.e could appear in a catalog of published datasets for reuse). It establishes a one-to-many relationship between a page (identified by the (wiki,page_id) tuple), and an arbitrary number of suggestion IDs (a type-1 UUID). As this dataset is the product of a batch analytics job, and generated periodically, the suggestion ID (suggetions.id) corresponds to each batch run. There is a one-to-many relationship between suggestion IDs, the images suggested, and the attributes that correspond to each.
  • The feedback table keeps a record of user-supplied feedback for image suggestions. It is considered application state for Structured Data & Growth's use-cases, and not a part of the image suggestions dataset.
  • The title_cache and instanceof_cache tables store attribute relationships that are canonically modeled in other systems (MediaWikis), and are only maintained here for convenience.
  • Retention of data in the suggestions table will be managed by TTLs. The length of the TTL will be a multiple of the update frequency that provides some historical results (and a buffer against late/missing batch jobs), while keeping result sets bounded for performance (clients will receive the full result set, even when they only require the most recent).
  • Since joins between these tables are not possible, multiple queries will be needed in some scenarios. For example, if relevant image suggestions are those without feedback, then separate queries of suggestions & feedback will need to be performed (can be performed concurrently), and set difference performed. Lookups by table name will first require a query against title_cache to find the page ID, and then followed up with a query to suggestions.

Event Timeline

Based on the TSV files in imagerec_prod.tar.bz2, the dataset seems to consist of the following:

attribute(implied) typecomment
page_idintMonotonically increasing integer, unique per wiki; The MediaWiki primary key
page_titletextTextual name for a page
image_idtextTextual name (filename) of an image
confidence_ratingtextOne of low, medium, or high
sourcetextWhere the image lives
dataset_iduuidData generation "version": Globally unique identifier of the job responsible for this dataset
insertion_tsdouble(?)Timestamp of insertion
wikitextWiki (project/site) containing the corresponding page_id
found_ontextWikis (projects/sites) this image currently appears on

Some questions:

  • Is page_title required? Assuming it's an extension (read: MediaWiki) that's consuming this, the page's title is something the client-side would already have, rendering this an unnecessary duplication. Additionally, a page's title can change, which could make what we're storing here incorrect (read: duplication is Bad™).
  • Judging by the enwiki TSV file, only ~8.6% of the rows have an image, is this correct? If so, are we expecting to store entries without an image (read: without a valid recommendation)?

@Clarakosi @gmodena - Are you able to answer the first question above? Seems like there isn't a good reason to store page title? Not sure if there was any reasoning to it in the original requirements.

Second question, please see this thread here. I asked the same question and Gabriele confirmed the client team explicitly asked for these to be stored.

[ ... ]

Second question, please see this thread here. I asked the same question and Gabriele confirmed the client team explicitly asked for these to be stored.

Thanks; Since I don't guess everyone can access that...

image.png (439×1 px, 123 KB)


Looking at the Superset link referenced earlier in that thread, is it fair to assume it is canonical, and imagerec_prod.tar.bz2 is out of date?

Based on this Superset query, the dataset seems to consist of the following:

attribute(implied) typecomment
page_idintMonotonically increasing integer, unique per wiki; The MediaWiki primary key
page_titletextTextual name for a page
image_idtextTextual name (filename) of an image
confidence_ratingtextOne of low, medium, or high
sourcetextWhere the image lives
dataset_iduuidData generation "version": Globally unique identifier of the job responsible for this dataset
insertion_tsdouble(?)Timestamp of insertion
wikitextWiki (project/site) containing the corresponding page_id
found_ontextWikis (projects/sites) this image currently appears on
instance_oftextWikidata ID
is_article_pagebool

@Clarakosi @gmodena - Are you able to answer the first question above? Seems like there isn't a good reason to store page title? Not sure if there was any reasoning to it in the original requirements.

I believe this was an ask for the proof of concept. I doubt we need it moving forward and can probably just have the API populate that field if client teams need it.

...
Looking at the Superset link referenced earlier in that thread, is it fair to assume it is canonical, and imagerec_prod.tar.bz2 is out of date?

They are from the same algorithm run its just that imagerec_prod.tar.bz2 has filtered out articles that should not have images (see: T276137)

Based on the TSV files in imagerec_prod.tar.bz2, the dataset seems to consist of the following:

[...]

  • Judging by the enwiki TSV file, only ~8.6% of the rows have an image, is this correct? If so, are we expecting to store entries without an image (read: without a valid recommendation)?

That's correct, and use case specific. For the Structured Data PoC, the API team expected a dataset with

  1. a list of all unillustrated articles detected on a wiki.
  2. at most three candidate images that match an unillustrated article.

An empty image_ids denotes the case of "unillustrated article with no recommendations". This semantic was required (IIRC) to compare this dataset with an Elasticsearch (MediaSearch) result sets. See https://phabricator.wikimedia.org/T274798.

As for the page_title, we should maybe revisit this requirement with client teams.

Based on this Superset query, the dataset seems to consist of the following:

[...]

Supersets exposes staged datasets (imagerec, imagerec_prod) that are used to create use-case specific materialised views. Both are meant for internal use.

From imagrec_prod we generate three datasets, shared with client teams with a file based API (analytics.wikimedia.org or HDFS).

  1. Data for the Structured Data use case was exported with this query: https://github.com/mirrys/ImageMatching/blob/main/ddl/export_prod_data.hql
  2. Data for the Android use case was exported with this query: https://github.com/mirrys/ImageMatching/blob/main/ddl/export_prod_data-android.hql
  3. Data for the Search use case was exported with this query: https://github.com/mirrys/ImageMatching/blob/main/ddl/external_search_imagerec.hql

The schema you reference in https://phabricator.wikimedia.org/T293808#7449361 refers to the Structured Data, which I'd consider the canonical dataset for our modelling exercise.

[...]

  • Judging by the enwiki TSV file, only ~8.6% of the rows have an image, is this correct? If so, are we expecting to store entries without an image (read: without a valid recommendation)?

That's correct, and use case specific. For the Structured Data PoC, the API team expected a dataset with

  1. a list of all unillustrated articles detected on a wiki.
  2. at most three candidate images that match an unillustrated article.

An empty image_ids denotes the case of "unillustrated article with no recommendations". This semantic was required (IIRC) to compare this dataset with an Elasticsearch (MediaSearch) result sets. See https://phabricator.wikimedia.org/T274798.

I'm not sure whether this is the hill I want to die on, but this doesn't seem right to me. At least, if you believe that what we're trying to do here is model image recommendations (in the abstract sense), rather than simply persisting what this implementation currently produces.

Abstractly, a recommendation is something that could apply to any article on a wiki (sans a few exceptions), regardless of whether they are currently illustrated, or not. Creating an implicit (read: un-modled) distinction between articles with or without images like this would seem to create an unnecessary coupling to this (version of the) recommendation algorithm.

I would propose the following:

page_idintMonotonically increasing integer, unique per wiki; The MediaWiki primary key
page_titletextTextual name for a page
image_idtextTextual name (filename) of an image
confidence_ratingfloatOne of low, medium, or high
sourcetextWhere the image lives
dataset_idtimeuuidData generation "version" & timestamp ; A type 1 UUID
insertion_tstimestampTimestamp of insertion
wikitextWiki (project/site) containing the corresponding page_id
found_ontextWikis (projects/sites) this image currently appears on
  • Eliminating the page_title, on the basis that it is a duplication (rather than reference)
  • Storing confidence_rating as a float as a guard against futures requiring more granularity
  • Using a type 1 UUID for dataset_id (can double as a timestamp)

I would propose the following:

page_idintMonotonically increasing integer, unique per wiki; The MediaWiki primary key
page_titletextTextual name for a page
image_idtextTextual name (filename) of an image
confidence_ratingfloatOne of low, medium, or high
sourcetextWhere the image lives
dataset_idtimeuuidData generation "version" & timestamp ; A type 1 UUID
insertion_tstimestampTimestamp of insertion
wikitextWiki (project/site) containing the corresponding page_id
found_ontextWikis (projects/sites) this image currently appears on
  • Eliminating the page_title, on the basis that it is a duplication (rather than reference)
  • Storing confidence_rating as a float as a guard against futures requiring more granularity
  • Using a type 1 UUID for dataset_id (can double as a timestamp)

Just noting that we currently use page_title for the endpoint added in rMSIS89adfa11057b: Add /:wiki/:lang/pages/:title path . We could still drop it from the proposed schema, but the application code would need to be updated to find a page ID when given a page title.

@kostajh - what would happen if a page title changes between the image rec output and someone viewing the image rec then calling the API?

@kostajh - what would happen if a page title changes between the image rec output and someone viewing the image rec then calling the API?

I should have clarified in my previous comment – we're only using page titles with the API for local development and beta wikis, where it's not (easily) possible to get the page ID in those wikis to match their production equivalents (e.g. page "Foo" on my local wiki has ID 1010 but on enwiki it is ID 9132808). So page title renames wouldn't really be a problem that anyone has to spend engineering effort on; if I want recommendations for page "Foo" in my local wiki and that's renamed to "FooBar" in production, then I'll just rename it in my local wiki.

@kostajh - what would happen if a page title changes between the image rec output and someone viewing the image rec then calling the API?

I should have clarified in my previous comment – we're only using page titles with the API for local development and beta wikis, where it's not (easily) possible to get the page ID in those wikis to match their production equivalents (e.g. page "Foo" on my local wiki has ID 1010 but on enwiki it is ID 9132808). So page title renames wouldn't really be a problem that anyone has to spend engineering effort on; if I want recommendations for page "Foo" in my local wiki and that's renamed to "FooBar" in production, then I'll just rename it in my local wiki.

I'm still unclear here; By-title lookups are something that are planned for production, yes... or is it only something you're using during development? If it is supported in production, and titles were persisted in the dataset (as they are now), what happens when a page is renamed (and the dataset does not reflect this)?

@kostajh and @Eevans : regarding page titles, regardless of what we intend, if something is publicly available for production, then people may invent their own uses for it. Which we may then find ourselves obligated to support. I don't have an objection to any specific proposed implementation, and I definitely want to support local/beta development in whatever way we reasonably can. But if whatever implementation we choose has limitations or "gotchas", let's be sure we understand and document them. Maybe that's a simple as setting expectations by noting in the documentation that "page titles may change without warning and any request by page title will attempt to reference suggestions for the current page by that title".

@Eevans , I share your discomfort with the current approach where some suggestions lack, well, suggestions. The empty image_id thing was always a bit hacky, and if we can do better as we move from an experimental prototype phase to something more resembling a real production service, I'm all in favor of it. But it does seem to me that "pages in need of an image" is a valuable and useful set of data regardless of whether we have Image Matching Algorithm suggestions for all those pages. And some clients have specifically requested pages from that broader set, so that that can use MediaSearch suggestions rather than Image Matching Algorithm suggestions. So I'd be interested in a solution that allows clients to get either of those things (pages with IMA suggestions or just pages that need images) in a pseudorandom way. I'm concerned that we may be losing that with the current proposed solution (but then again, I may be misunderstanding the proposal).

One might then argue that the "Image Suggestions API" is poorly named, if one of the things it provides is pages with no suggestions. Names are hard.

Along those lines, during development we got tired of inconsistency between "recommendation" and "suggestions". Within the team that implemented the service, we agreed to use the term "suggestion". I notice that the proposal uses the term "recommendation". I don't object to either word, and if other parts of the pipeline use "recommendation" I'm not averse to renaming the service. But it might help all our long-term sanity to stick to one word or the other.

FWIW, I'm extremely uncomfortable with the current way the service implements MediaSearch suggestions - it doesn't seem to scale or cache well - so I'd be very supportive of removing that functionality from the service. IMO, if a client wants to get suggestions from something other than the IMA, the service's responsibility ends with providing pages that need images. The client can then get its own suggestions however it likes (MediaSearch or whatever). That may be inconvenient for some clients, but the point of our experiment was to learn things. One of the things I learned is that the way we handled MediaSearch results was pretty bad.

[ ... ]

@Eevans , I share your discomfort with the current approach where some suggestions lack, well, suggestions. The empty image_id thing was always a bit hacky, and if we can do better as we move from an experimental prototype phase to something more resembling a real production service, I'm all in favor of it. But it does seem to me that "pages in need of an image" is a valuable and useful set of data regardless of whether we have Image Matching Algorithm suggestions for all those pages. And some clients have specifically requested pages from that broader set, so that that can use MediaSearch suggestions rather than Image Matching Algorithm suggestions.

So, we have pages, almost any of which could have recommendations suggestions, even ones that already have images. Then we have (qualifying) pages that are unillustrated, for which IMA will attempt to generate suggestions for. And finally, we have those that it was successfully able to do so for. If we are saying that this data set models unillustrated pages, and any corresponding image suggestions IMA was able to make, then we're OKish (the wisdom of distinguishing between the latter by a non-nil suggestion, notwithstanding). If however we later make refinements to IMA, or add one or more additional suggestion algorithms, any of which that is able to make suggestions for already illustrated pages, then we'll have a data model unable to make that distinction. I'm willing to cross that bridge when we come to it if everyone else, I just wanted to point it out.

So I'd be interested in a solution that allows clients to get either of those things (pages with IMA suggestions or just pages that need images) in a pseudorandom way. I'm concerned that we may be losing that with the current proposed solution (but then again, I may be misunderstanding the proposal).

Wait, both? You want a way of pseudorandomly selecting either pages which are unillustrated, but for which there are no IMA suggestions, and pages that do have suggestions?

[ ... ]

@Eevans , I share your discomfort with the current approach where some suggestions lack, well, suggestions. The empty image_id thing was always a bit hacky, and if we can do better as we move from an experimental prototype phase to something more resembling a real production service, I'm all in favor of it. But it does seem to me that "pages in need of an image" is a valuable and useful set of data regardless of whether we have Image Matching Algorithm suggestions for all those pages. And some clients have specifically requested pages from that broader set, so that that can use MediaSearch suggestions rather than Image Matching Algorithm suggestions.

So, we have pages, almost any of which could have recommendations suggestions, even ones that already have images. Then we have (qualifying) pages that are unillustrated, for which IMA will attempt to generate suggestions for. And finally, we have those that it was successfully able to do so for. If we are saying that this data set models unillustrated pages, and any corresponding image suggestions IMA was able to make, then we're OKish (the wisdom of distinguishing between the latter by a non-nil suggestion, notwithstanding). If however we later make refinements to IMA, or add one or more additional suggestion algorithms, any of which that is able to make suggestions for already illustrated pages, then we'll have a data model unable to make that distinction. I'm willing to cross that bridge when we come to it if everyone else, I just wanted to point it out.

To restate what you said (hopefully fairly), we have:

  1. pages, almost any of which could have suggestions, even ones that already have images
  2. qualifying unillustrated pages for which IMA will attempt to generate suggestions
  3. pages IMA was able to generate suggestions for

I'm not sure I understand what distinction you're making between #1 and #2, so let's dig into that. The service doesn't know about pages in general, in an all-pages-on-a-wiki sense. It only knows about pages that are in the dataset provided to it. And the service doesn't know or care if these pages are unillustrated or if they already have images. I think of them as "under-illustrated" pages.

I guess what I'm not following is what issue arises with the data model if we includes pages that already have images. If the IMA decides that an existing page that already has one or more images needs more, what breaks?

So I'd be interested in a solution that allows clients to get either of those things (pages with IMA suggestions or just pages that need images) in a pseudorandom way. I'm concerned that we may be losing that with the current proposed solution (but then again, I may be misunderstanding the proposal).

Wait, both? You want a way of pseudorandomly selecting either pages which are unillustrated, but for which there are no IMA suggestions, and pages that do have suggestions?

Almost. We want a way to pseudorandomly select:

  1. pages that have IMA suggestions
  2. under-illustrated pages, regardless of whether they have IMA suggestions or not.

The difference in what you said vs what I said is that we don't need a way to select pages WITHOUT suggestions from the IMA. We just need a way to provide the full set of under-illustrated pages to clients that want to generate their own suggestions.

Sorry that I didn't make that clear in our previous discussion.

My understanding and recollection is that IMA originally only generated #1. Then clients asked for #2, so IMA was extended to include that data in the .tsv files. And that's how we ended up with the empty image_id hack solution. @gmodena , do I have that right?

For that second use case, the service currently attempts to get MediaSearch suggestions, but IMO that was a Bad Idea and we should revisit how the client and service interact going forward. However, I don't think that whether the service does the MediaSearch queries or pushes those to the client impacts the data model. So I'm happy to ignore the MediaSearch bits for the purposes of this task, and negotiate that elsewhere with the affected people. I mostly mention the MediaSearch part as a real-world example of why clients requested this functionality from us. Otherwise, you'd probably (reasonably) ask "why the heck would you want both"?

[ ... ]
So, we have pages, almost any of which could have recommendations suggestions, even ones that already have images. Then we have (qualifying) pages that are unillustrated, for which IMA will attempt to generate suggestions for. And finally, we have those that it was successfully able to do so for. If we are saying that this data set models unillustrated pages, and any corresponding image suggestions IMA was able to make, then we're OKish (the wisdom of distinguishing between the latter by a non-nil suggestion, notwithstanding). If however we later make refinements to IMA, or add one or more additional suggestion algorithms, any of which that is able to make suggestions for already illustrated pages, then we'll have a data model unable to make that distinction. I'm willing to cross that bridge when we come to it if everyone else, I just wanted to point it out.

To restate what you said (hopefully fairly), we have:

  1. pages, almost any of which could have suggestions, even ones that already have images
  2. qualifying unillustrated pages for which IMA will attempt to generate suggestions
  3. pages IMA was able to generate suggestions for

I'm not sure I understand what distinction you're making between #1 and #2, so let's dig into that. The service doesn't know about pages in general, in an all-pages-on-a-wiki sense. It only knows about pages that are in the dataset provided to it. And the service doesn't know or care if these pages are unillustrated or if they already have images. I think of them as "under-illustrated" pages.

I guess what I'm not following is what issue arises with the data model if we includes pages that already have images. If the IMA decides that an existing page that already has one or more images needs more, what breaks?

The data set contains a subset of all pages, and what defines that subset is a function of the current implementation (and it's one that seems...arbitrary, to me). If you later decide to change the criteria between any old page, versus one that qualifies for this data set, anything that made assumptions about that criteria could break. Maybe that's nothing, I don't know.

And for what it's worth, your characterization here as "under-illustrated" is wholly new to me. Thus far, everyone I have corresponded with has either referred to them as unillustrated or articles without any images. That's really what prompted me to question this, is the framing as records with concrete suggestions, versus those of pages that just had no images. That latter sounded to me like it might be an un-modeled attribute of those records.

So I'd be interested in a solution that allows clients to get either of those things (pages with IMA suggestions or just pages that need images) in a pseudorandom way. I'm concerned that we may be losing that with the current proposed solution (but then again, I may be misunderstanding the proposal).

Wait, both? You want a way of pseudorandomly selecting either pages which are unillustrated, but for which there are no IMA suggestions, and pages that do have suggestions?

Almost. We want a way to pseudorandomly select:

  1. pages that have IMA suggestions
  2. under-illustrated pages, regardless of whether they have IMA suggestions or not.

The difference in what you said vs what I said is that we don't need a way to select pages WITHOUT suggestions from the IMA. We just need a way to provide the full set of under-illustrated pages to clients that want to generate their own suggestions.

Ok, let me take another stab at this then. We need:

  1. The ability to retrieve a record from the data set by its wiki and page_id attributes
  2. A way of pseudorandomly choosing from any of the records in the data set, with, or without suggestions (by its wiki attribute)
  3. A way of pseudorandomly choosing records from only those that have suggestions (by its wiki attribute)

And for what it's worth, your characterization here as "under-illustrated" is wholly new to me. Thus far, everyone I have corresponded with has either referred to them as unillustrated or articles without any images.

I may be the only person who thinks of it that way. I'm just trying to minimize assumptions.

Ok, let me take another stab at this then. We need:

  1. The ability to retrieve a record from the data set by its wiki and page_id attributes
  2. A way of pseudorandomly choosing from any of the records in the data set, with, or without suggestions (by its wiki attribute)
  3. A way of pseudorandomly choosing records from only those that have suggestions (by its wiki attribute)

Yes.

For anyone who skipped to the bottom, the service can still support requesting suggestions by page title. But it will convert title to page_id outside the dataset, probably via the Action API. This carries with it a risk that pages may be renamed, so the page_title => page_id relationship may have changed after the dataset was generated. We'll document this consideration so callers are aware of this possibility.

[ ... ]
For anyone who skipped to the bottom, the service can still support requesting suggestions by page title. But it will convert title to page_id outside the dataset, probably via the Action API. This carries with it a risk that pages may be renamed, so the page_title => page_id relationship may have changed after the dataset was generated. We'll document this consideration so callers are aware of this possibility.

@BPirkle To be clear, are you talking about development environments here? Otherwise, using the Action API to map page_title to page_id will eliminate any risk of a mismatch occurring.

[ ... ]
For anyone who skipped to the bottom, the service can still support requesting suggestions by page title. But it will convert title to page_id outside the dataset, probably via the Action API. This carries with it a risk that pages may be renamed, so the page_title => page_id relationship may have changed after the dataset was generated. We'll document this consideration so callers are aware of this possibility.

@BPirkle To be clear, are you talking about development environments here? Otherwise, using the Action API to map page_title to page_id will eliminate any risk of a mismatch occurring.

Development environments (including beta) are the context for which this endpoint was created. But of course, it is hard to predict what people may use it for "in the wild".

Specifically, I was thinking of pathological situations that may not be a practical concern. For example, stuff like:

  1. algorithm executes, and records that page_id 123, by page_title Foo, is underillustrated
  2. algorithm data is loaded into Cassandra, which stores page_id 123 (but not the page_title Foo)
  3. page Foo is moved to FooBar, leaving a redirect page Foo
  4. redirect page Foo is edited to no longer be a redirect page, but instead have actual content
  5. client requests image suggestions for page_title Foo
  6. Action API tries to find a page by title Foo and "succeeds"
  7. user adds an image that was originally identified by the algorithm as appropriate for the page that is now titled FooBar to the page now titled Foo

Note that if Foo exists as a redirect, we can follow that via the Action API and find the id of the intended original page. The above is only a (theoretical) concern if Foo no longer redirects to FooBar.

I can make that sequence happen on my local dev wiki, but I don't know if it actually happens in practice on the actual projects. There may also be various other pathological situations that I'm unaware of. It is also possible, of course, that no page moves happen but page Foo changes significantly via the normal editing process between suggestion generation and the time a user is presented with that suggestion. There's a reason we called these "suggestions", and I'm not overly concerned about any of this.

All I was really advocating for was documenting that lookup by page title finds the current page by that title, which is not necessarily the same page that the suggestion was generated for.

(edited for typo)

Eevans renamed this task from Design Image Recommendations Schema to Design Image Suggestion Schema.Feb 24 2022, 12:53 AM
Eevans triaged this task as Medium priority.
Eevans updated the task description. (Show Details)
lbowmaker moved this task from QA/Review ❓ to Sign-off ✔️ on the Generated Data Platform board.

Signing off on proposal as detailed in the description after discussing with impacted teams.

Thanks for everyone's efforts in modeling this and reaching consensus.

Is the user column under the feedback table supposed to be text? The feedback event schema currently outputs a user_id instead so I'm wondering if it's supposed to be transformed into a username or if the Cassandra table needs to be updated

Is the user column under the feedback table supposed to be text? The feedback event schema currently outputs a user_id instead so I'm wondering if it's supposed to be transformed into a username or if the Cassandra table needs to be updated

It is text but it could be (and probably makes more sense) as an int.

Is the user column under the feedback table supposed to be text? The feedback event schema currently outputs a user_id instead so I'm wondering if it's supposed to be transformed into a username or if the Cassandra table needs to be updated

It is text but it could be (and probably makes more sense) as an int.

Actually, let's expound on this...

What is proposed here is to change: feedback.user (type text), to feedback.user_id of type int.

Is this correct? /cc @Cparle @lbowmaker ... ?

Change 805175 had a related patch set uploaded (by Eevans; author: Eevans):

[generated-data-platform/datasets/image-suggestions@main] Drop feedback.user (text), add feedback.user_id (int)

https://gerrit.wikimedia.org/r/805175

Change 805175 merged by jenkins-bot:

[generated-data-platform/datasets/image-suggestions@main] Drop feedback.user (text), add feedback.user_id (int)

https://gerrit.wikimedia.org/r/805175

Change 805175 merged by jenkins-bot:

[generated-data-platform/datasets/image-suggestions@main] Drop feedback.user (text), add feedback.user_id (int)

https://gerrit.wikimedia.org/r/805175

This requires no deployment; This amounts to a documentation change, and the production DB has been updated accordingly.