Design Image Suggestion Schema
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	lbowmaker
	Oct 19 2021, 4:09 PM

Description

User Story

As a platform engineer, I need to design a database schema that allows storage of data output by the Image Suggestion process

Success Criteria

Schema stores all fields from output
Supports retrieval of data set records by project & page ID
Optionally(?) supports lookup/retrieval by project & page title
Storage accommodates bulk import of new records, expiry / deletion of stale data

Out of scope

Storage of data for training

Cassandra Storage

P21420 schema.cql

1	-- The recommended images dataset
2	CREATE TABLE image_suggestions.suggestions (
3	wiki text, -- Wikimedia project
4	page_id int, -- MediaWiki page ID attribute
5	id timeuuid, -- Globally unique, but also a timestamp; Unique to an algorithm run
6	image text, -- Image being recommended
7	origin_wiki text, -- Where the recommended image resides
8	confidence float, -- Strength of recommendation; Value in the range 0.0-1.0
9	found_on set<text>, -- Other wikis that use the image
10	kind set<text>, -- ??
11	page_rev int, -- Revision of page_id at time of recommendation (informational)
12	PRIMARY KEY((wiki, page_id), id, image)
13	);
14
15	-- A record of user feedback, replicating whatever attributes of the corresponding
16	-- suggestions that is necessary.
17	CREATE TABLE image_suggestions.feedback(
18	wiki text, -- Corresponds to suggestions.wiki
19	page_id int, -- Corresponds to suggestions.page_id
20	image text, -- Corresponds to suggestions.image
21	id timeuuid, -- ID (& timestamp) of feedback
22	origin_wiki text, -- Corresponds to suggestions.origin_wiki
23	user text, -- User who submitted feedback
24	accepted boolean, -- True if feedback indicates acceptance
25	rejected boolean, -- True if feedback indicates rejection
26	comment text, -- User-submitted comment for a rejection
27	PRIMARY KEY((wiki, page_id), image, id)
28	);
29
30	-- Page ID/page title mapping.
31
32	-- NOTE: This table is a duplication of a relationship that MediaWiki is canonical
33	-- for. It is maintained here for convenience, with the understanding that it is
34	-- not trustworthy (it should not be considered a source of truth).
35	CREATE TABLE image_suggestions.title_cache (
36	wiki text, -- Wikimedia project
37	page_id int, -- MediaWiki page ID attribute
38	page_rev int, -- Revision of page_id
39	title text, -- Title of page at corresponding page_rev
40	PRIMARY KEY((wiki, title))
41	);
42
43	-- Values of the P31 property for the Wikidata item that corresponds with the page.
44
45	-- NOTE: This table is a duplication of a relationship that MediaWiki is canonical
46	-- for. It is maintained here for convenience, with the understanding that it is
47	-- not trustworthy (it should not be considered a source of truth).
48	CREATE TABLE image_suggestions.instanceof_cache (
49	wiki text, -- Wikimedia project
50	page_id int, -- MediaWiki page ID attribute
51	page_rev int, -- Revision of page_id (FIXME: shouldn't this be the Wikidata page_rev?)
52	instance_of set<text>, -- P31 property values
53	PRIMARY KEY((wiki, page_id))
54	);

Proposal

From a Product perspective, the suggestions table is The Dataset (i.e could appear in a catalog of published datasets for reuse). It establishes a one-to-many relationship between a page (identified by the (wiki,page_id) tuple), and an arbitrary number of suggestion IDs (a type-1 UUID). As this dataset is the product of a batch analytics job, and generated periodically, the suggestion ID (suggetions.id) corresponds to each batch run. There is a one-to-many relationship between suggestion IDs, the images suggested, and the attributes that correspond to each.
The feedback table keeps a record of user-supplied feedback for image suggestions. It is considered application state for Structured Data & Growth's use-cases, and not a part of the image suggestions dataset.
The title_cache and instanceof_cache tables store attribute relationships that are canonically modeled in other systems (MediaWikis), and are only maintained here for convenience.
Retention of data in the suggestions table will be managed by TTLs. The length of the TTL will be a multiple of the update frequency that provides some historical results (and a buffer against late/missing batch jobs), while keeping result sets bounded for performance (clients will receive the full result set, even when they only require the most recent).
Since joins between these tables are not possible, multiple queries will be needed in some scenarios. For example, if relevant image suggestions are those without feedback, then separate queries of suggestions & feedback will need to be performed (can be performed concurrently), and set difference performed. Lookups by table name will first require a query against title_cache to find the page ID, and then followed up with a query to suggestions.

Details

	Subject	Repo	Branch	Lines +/-
	Drop feedback.user (text), add feedback.user_id (int)	generated-data-platform/datasets/image-suggestions	main	+6 -6

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		lbowmaker	T293807 Data Persistence for Image Suggestions
		Resolved		Eevans	T293808 Design Image Suggestion Schema

Event Timeline

lbowmaker created this task.Oct 19 2021, 4:09 PM

lbowmaker removed a subtask: T293809: Define Capacity Management Process.Oct 19 2021, 4:24 PM

lbowmaker updated the task description. (Show Details)Oct 19 2021, 5:35 PM

lbowmaker updated the task description. (Show Details)Oct 19 2021, 6:06 PM

lbowmaker reassigned this task from lbowmaker to Eevans.Oct 19 2021, 6:16 PM

lbowmaker added a subscriber: Eevans.

lbowmaker moved this task from Backlog to Work in Progress ⚙️ on the Generated Data Platform board.Oct 19 2021, 6:22 PM

Based on the TSV files in imagerec_prod.tar.bz2, the dataset seems to consist of the following:

attribute	(implied) type	comment
page_id	int	Monotonically increasing integer, unique per wiki; The MediaWiki primary key
page_title	text	Textual name for a page
image_id	text	Textual name (filename) of an image
confidence_rating	text	One of `low`, `medium`, or `high`
source	text	Where the image lives
dataset_id	uuid	Data generation "version": Globally unique identifier of the job responsible for this dataset
insertion_ts	double(?)	Timestamp of insertion
wiki	text	Wiki (project/site) containing the corresponding page_id
found_on	text	Wikis (projects/sites) this image currently appears on

Some questions:

Is page_title required? Assuming it's an extension (read: MediaWiki) that's consuming this, the page's title is something the client-side would already have, rendering this an unnecessary duplication. Additionally, a page's title can change, which could make what we're storing here incorrect (read: duplication is Bad™).
Judging by the enwiki TSV file, only ~8.6% of the rows have an image, is this correct? If so, are we expecting to store entries without an image (read: without a valid recommendation)?

@Clarakosi @gmodena - Are you able to answer the first question above? Seems like there isn't a good reason to store page title? Not sure if there was any reasoning to it in the original requirements.

Second question, please see this thread here. I asked the same question and Gabriele confirmed the client team explicitly asked for these to be stored.

In T293808#7449369, @lbowmaker wrote:

[ ... ]

Second question, please see this thread here. I asked the same question and Gabriele confirmed the client team explicitly asked for these to be stored.

Thanks; Since I don't guess everyone can access that...

Looking at the Superset link referenced earlier in that thread, is it fair to assume it is canonical, and imagerec_prod.tar.bz2 is out of date?

Based on this Superset query, the dataset seems to consist of the following:

attribute	(implied) type	comment
page_id	int	Monotonically increasing integer, unique per wiki; The MediaWiki primary key
page_title	text	Textual name for a page
image_id	text	Textual name (filename) of an image
confidence_rating	text	One of `low`, `medium`, or `high`
source	text	Where the image lives
dataset_id	uuid	Data generation "version": Globally unique identifier of the job responsible for this dataset
insertion_ts	double(?)	Timestamp of insertion
wiki	text	Wiki (project/site) containing the corresponding page_id
found_on	text	Wikis (projects/sites) this image currently appears on
instance_of	text	Wikidata ID
is_article_page	bool

In T293808#7449369, @lbowmaker wrote:

@Clarakosi @gmodena - Are you able to answer the first question above? Seems like there isn't a good reason to store page title? Not sure if there was any reasoning to it in the original requirements.

I believe this was an ask for the proof of concept. I doubt we need it moving forward and can probably just have the API populate that field if client teams need it.

In T293808#7449413, @Eevans wrote:

...
Looking at the Superset link referenced earlier in that thread, is it fair to assume it is canonical, and imagerec_prod.tar.bz2 is out of date?

They are from the same algorithm run its just that imagerec_prod.tar.bz2 has filtered out articles that should not have images (see: T276137)

In T293808#7449361, @Eevans wrote:

Based on the TSV files in imagerec_prod.tar.bz2, the dataset seems to consist of the following:

[...]

Judging by the enwiki TSV file, only ~8.6% of the rows have an image, is this correct? If so, are we expecting to store entries without an image (read: without a valid recommendation)?

That's correct, and use case specific. For the Structured Data PoC, the API team expected a dataset with

a list of all unillustrated articles detected on a wiki.
at most three candidate images that match an unillustrated article.

An empty image_ids denotes the case of "unillustrated article with no recommendations". This semantic was required (IIRC) to compare this dataset with an Elasticsearch (MediaSearch) result sets. See https://phabricator.wikimedia.org/T274798.

As for the page_title, we should maybe revisit this requirement with client teams.

In T293808#7449447, @Eevans wrote:

Based on this Superset query, the dataset seems to consist of the following:

[...]

Supersets exposes staged datasets (imagerec, imagerec_prod) that are used to create use-case specific materialised views. Both are meant for internal use.

From imagrec_prod we generate three datasets, shared with client teams with a file based API (analytics.wikimedia.org or HDFS).

Data for the Structured Data use case was exported with this query: https://github.com/mirrys/ImageMatching/blob/main/ddl/export_prod_data.hql
Data for the Android use case was exported with this query: https://github.com/mirrys/ImageMatching/blob/main/ddl/export_prod_data-android.hql
Data for the Search use case was exported with this query: https://github.com/mirrys/ImageMatching/blob/main/ddl/external_search_imagerec.hql

The schema you reference in https://phabricator.wikimedia.org/T293808#7449361 refers to the Structured Data, which I'd consider the canonical dataset for our modelling exercise.

gmodena mentioned this in T293256: Migrate database from SQLite to MySQL.Oct 25 2021, 2:49 PM

akosiaris subscribed.Oct 25 2021, 2:50 PM

In T293808#7454112, @gmodena wrote:

In T293808#7449361, @Eevans wrote:

[...]

Judging by the enwiki TSV file, only ~8.6% of the rows have an image, is this correct? If so, are we expecting to store entries without an image (read: without a valid recommendation)?

That's correct, and use case specific. For the Structured Data PoC, the API team expected a dataset with

a list of all unillustrated articles detected on a wiki.

at most three candidate images that match an unillustrated article.

An empty image_ids denotes the case of "unillustrated article with no recommendations". This semantic was required (IIRC) to compare this dataset with an Elasticsearch (MediaSearch) result sets. See https://phabricator.wikimedia.org/T274798.

I'm not sure whether this is the hill I want to die on, but this doesn't seem right to me. At least, if you believe that what we're trying to do here is model image recommendations (in the abstract sense), rather than simply persisting what this implementation currently produces.

Abstractly, a recommendation is something that could apply to any article on a wiki (sans a few exceptions), regardless of whether they are currently illustrated, or not. Creating an implicit (read: un-modled) distinction between articles with or without images like this would seem to create an unnecessary coupling to this (version of the) recommendation algorithm.

I would propose the following:

page_id	int	Monotonically increasing integer, unique per wiki; The MediaWiki primary key
~~page_title~~	~~text~~	~~Textual name for a page~~
image_id	text	Textual name (filename) of an image
confidence_rating	float	~~One of `low`, `medium`, or `high`~~
source	text	Where the image lives
dataset_id	timeuuid	Data generation "version" & timestamp ; A type 1 UUID
insertion_ts	timestamp	Timestamp of insertion
wiki	text	Wiki (project/site) containing the corresponding page_id
found_on	text	Wikis (projects/sites) this image currently appears on

Eliminating the page_title, on the basis that it is a duplication (rather than reference)
Storing confidence_rating as a float as a guard against futures requiring more granularity
Using a type 1 UUID for dataset_id (can double as a timestamp)

Eevans updated the task description. (Show Details)Oct 25 2021, 8:11 PM

In T293808#7455801, @Eevans wrote:

I would propose the following:

page_id int Monotonically increasing integer, unique per wiki; The MediaWiki primary key

~~page_title~~ ~~text~~ ~~Textual name for a page~~

image_id text Textual name (filename) of an image

confidence_rating float ~~One of low, medium, or high~~

source text Where the image lives

dataset_id timeuuid Data generation "version" & timestamp ; A type 1 UUID

insertion_ts timestamp Timestamp of insertion

wiki text Wiki (project/site) containing the corresponding page_id

found_on text Wikis (projects/sites) this image currently appears on

Eliminating the page_title, on the basis that it is a duplication (rather than reference)

Storing confidence_rating as a float as a guard against futures requiring more granularity

Using a type 1 UUID for dataset_id (can double as a timestamp)

Just noting that we currently use page_title for the endpoint added in rMSIS89adfa11057b: Add /:wiki/:lang/pages/:title path . We could still drop it from the proposed schema, but the application code would need to be updated to find a page ID when given a page title.

@kostajh - what would happen if a page title changes between the image rec output and someone viewing the image rec then calling the API?

In T293808#7471788, @lbowmaker wrote:

@kostajh - what would happen if a page title changes between the image rec output and someone viewing the image rec then calling the API?

I should have clarified in my previous comment – we're only using page titles with the API for local development and beta wikis, where it's not (easily) possible to get the page ID in those wikis to match their production equivalents (e.g. page "Foo" on my local wiki has ID 1010 but on enwiki it is ID 9132808). So page title renames wouldn't really be a problem that anyone has to spend engineering effort on; if I want recommendations for page "Foo" in my local wiki and that's renamed to "FooBar" in production, then I'll just rename it in my local wiki.

In T293808#7471802, @kostajh wrote:

In T293808#7471788, @lbowmaker wrote:

@kostajh - what would happen if a page title changes between the image rec output and someone viewing the image rec then calling the API?

I should have clarified in my previous comment – we're only using page titles with the API for local development and beta wikis, where it's not (easily) possible to get the page ID in those wikis to match their production equivalents (e.g. page "Foo" on my local wiki has ID 1010 but on enwiki it is ID 9132808). So page title renames wouldn't really be a problem that anyone has to spend engineering effort on; if I want recommendations for page "Foo" in my local wiki and that's renamed to "FooBar" in production, then I'll just rename it in my local wiki.

I'm still unclear here; By-title lookups are something that are planned for production, yes... or is it only something you're using during development? If it is supported in production, and titles were persisted in the dataset (as they are now), what happens when a page is renamed (and the dataset does not reflect this)?

Eevans updated the task description. (Show Details)Nov 1 2021, 6:57 PM

Eevans updated the task description. (Show Details)Nov 1 2021, 7:18 PM

Eevans updated the task description. (Show Details)Nov 1 2021, 7:27 PM

Eevans updated the task description. (Show Details)Nov 1 2021, 7:33 PM

@kostajh and @Eevans : regarding page titles, regardless of what we intend, if something is publicly available for production, then people may invent their own uses for it. Which we may then find ourselves obligated to support. I don't have an objection to any specific proposed implementation, and I definitely want to support local/beta development in whatever way we reasonably can. But if whatever implementation we choose has limitations or "gotchas", let's be sure we understand and document them. Maybe that's a simple as setting expectations by noting in the documentation that "page titles may change without warning and any request by page title will attempt to reference suggestions for the current page by that title".

@Eevans , I share your discomfort with the current approach where some suggestions lack, well, suggestions. The empty image_id thing was always a bit hacky, and if we can do better as we move from an experimental prototype phase to something more resembling a real production service, I'm all in favor of it. But it does seem to me that "pages in need of an image" is a valuable and useful set of data regardless of whether we have Image Matching Algorithm suggestions for all those pages. And some clients have specifically requested pages from that broader set, so that that can use MediaSearch suggestions rather than Image Matching Algorithm suggestions. So I'd be interested in a solution that allows clients to get either of those things (pages with IMA suggestions or just pages that need images) in a pseudorandom way. I'm concerned that we may be losing that with the current proposed solution (but then again, I may be misunderstanding the proposal).

One might then argue that the "Image Suggestions API" is poorly named, if one of the things it provides is pages with no suggestions. Names are hard.

Along those lines, during development we got tired of inconsistency between "recommendation" and "suggestions". Within the team that implemented the service, we agreed to use the term "suggestion". I notice that the proposal uses the term "recommendation". I don't object to either word, and if other parts of the pipeline use "recommendation" I'm not averse to renaming the service. But it might help all our long-term sanity to stick to one word or the other.

FWIW, I'm extremely uncomfortable with the current way the service implements MediaSearch suggestions - it doesn't seem to scale or cache well - so I'd be very supportive of removing that functionality from the service. IMO, if a client wants to get suggestions from something other than the IMA, the service's responsibility ends with providing pages that need images. The client can then get its own suggestions however it likes (MediaSearch or whatever). That may be inconvenient for some clients, but the point of our experiment was to learn things. One of the things I learned is that the way we handled MediaSearch results was pretty bad.

In T293808#7473320, @BPirkle wrote:

[ ... ]

@Eevans , I share your discomfort with the current approach where some suggestions lack, well, suggestions. The empty image_id thing was always a bit hacky, and if we can do better as we move from an experimental prototype phase to something more resembling a real production service, I'm all in favor of it. But it does seem to me that "pages in need of an image" is a valuable and useful set of data regardless of whether we have Image Matching Algorithm suggestions for all those pages. And some clients have specifically requested pages from that broader set, so that that can use MediaSearch suggestions rather than Image Matching Algorithm suggestions.

So, we have pages, almost any of which could have ~~recommendations~~ suggestions, even ones that already have images. Then we have (qualifying) pages that are unillustrated, for which IMA will attempt to generate suggestions for. And finally, we have those that it was successfully able to do so for. If we are saying that this data set models unillustrated pages, and any corresponding image suggestions IMA was able to make, then we're OKish (the wisdom of distinguishing between the latter by a non-nil suggestion, notwithstanding). If however we later make refinements to IMA, or add one or more additional suggestion algorithms, any of which that is able to make suggestions for already illustrated pages, then we'll have a data model unable to make that distinction. I'm willing to cross that bridge when we come to it if everyone else, I just wanted to point it out.

So I'd be interested in a solution that allows clients to get either of those things (pages with IMA suggestions or just pages that need images) in a pseudorandom way. I'm concerned that we may be losing that with the current proposed solution (but then again, I may be misunderstanding the proposal).

Wait, both? You want a way of pseudorandomly selecting either pages which are unillustrated, but for which there are no IMA suggestions, and pages that do have suggestions?

In T293808#7473382, @Eevans wrote:

In T293808#7473320, @BPirkle wrote:

[ ... ]

@Eevans , I share your discomfort with the current approach where some suggestions lack, well, suggestions. The empty image_id thing was always a bit hacky, and if we can do better as we move from an experimental prototype phase to something more resembling a real production service, I'm all in favor of it. But it does seem to me that "pages in need of an image" is a valuable and useful set of data regardless of whether we have Image Matching Algorithm suggestions for all those pages. And some clients have specifically requested pages from that broader set, so that that can use MediaSearch suggestions rather than Image Matching Algorithm suggestions.

So, we have pages, almost any of which could have ~~recommendations~~ suggestions, even ones that already have images. Then we have (qualifying) pages that are unillustrated, for which IMA will attempt to generate suggestions for. And finally, we have those that it was successfully able to do so for. If we are saying that this data set models unillustrated pages, and any corresponding image suggestions IMA was able to make, then we're OKish (the wisdom of distinguishing between the latter by a non-nil suggestion, notwithstanding). If however we later make refinements to IMA, or add one or more additional suggestion algorithms, any of which that is able to make suggestions for already illustrated pages, then we'll have a data model unable to make that distinction. I'm willing to cross that bridge when we come to it if everyone else, I just wanted to point it out.

To restate what you said (hopefully fairly), we have:

pages, almost any of which could have suggestions, even ones that already have images
qualifying unillustrated pages for which IMA will attempt to generate suggestions
pages IMA was able to generate suggestions for

I'm not sure I understand what distinction you're making between #1 and #2, so let's dig into that. The service doesn't know about pages in general, in an all-pages-on-a-wiki sense. It only knows about pages that are in the dataset provided to it. And the service doesn't know or care if these pages are unillustrated or if they already have images. I think of them as "under-illustrated" pages.

I guess what I'm not following is what issue arises with the data model if we includes pages that already have images. If the IMA decides that an existing page that already has one or more images needs more, what breaks?

So I'd be interested in a solution that allows clients to get either of those things (pages with IMA suggestions or just pages that need images) in a pseudorandom way. I'm concerned that we may be losing that with the current proposed solution (but then again, I may be misunderstanding the proposal).

Wait, both? You want a way of pseudorandomly selecting either pages which are unillustrated, but for which there are no IMA suggestions, and pages that do have suggestions?

Almost. We want a way to pseudorandomly select:

pages that have IMA suggestions
under-illustrated pages, regardless of whether they have IMA suggestions or not.

The difference in what you said vs what I said is that we don't need a way to select pages WITHOUT suggestions from the IMA. We just need a way to provide the full set of under-illustrated pages to clients that want to generate their own suggestions.

Sorry that I didn't make that clear in our previous discussion.

My understanding and recollection is that IMA originally only generated #1. Then clients asked for #2, so IMA was extended to include that data in the .tsv files. And that's how we ended up with the empty image_id ~~hack~~ solution. @gmodena , do I have that right?

For that second use case, the service currently attempts to get MediaSearch suggestions, but IMO that was a Bad Idea and we should revisit how the client and service interact going forward. However, I don't think that whether the service does the MediaSearch queries or pushes those to the client impacts the data model. So I'm happy to ignore the MediaSearch bits for the purposes of this task, and negotiate that elsewhere with the affected people. I mostly mention the MediaSearch part as a real-world example of why clients requested this functionality from us. Otherwise, you'd probably (reasonably) ask "why the heck would you want both"?

In T293808#7473451, @BPirkle wrote:

In T293808#7473382, @Eevans wrote:

[ ... ]
So, we have pages, almost any of which could have ~~recommendations~~ suggestions, even ones that already have images. Then we have (qualifying) pages that are unillustrated, for which IMA will attempt to generate suggestions for. And finally, we have those that it was successfully able to do so for. If we are saying that this data set models unillustrated pages, and any corresponding image suggestions IMA was able to make, then we're OKish (the wisdom of distinguishing between the latter by a non-nil suggestion, notwithstanding). If however we later make refinements to IMA, or add one or more additional suggestion algorithms, any of which that is able to make suggestions for already illustrated pages, then we'll have a data model unable to make that distinction. I'm willing to cross that bridge when we come to it if everyone else, I just wanted to point it out.

To restate what you said (hopefully fairly), we have:

pages, almost any of which could have suggestions, even ones that already have images

qualifying unillustrated pages for which IMA will attempt to generate suggestions

pages IMA was able to generate suggestions for

I'm not sure I understand what distinction you're making between #1 and #2, so let's dig into that. The service doesn't know about pages in general, in an all-pages-on-a-wiki sense. It only knows about pages that are in the dataset provided to it. And the service doesn't know or care if these pages are unillustrated or if they already have images. I think of them as "under-illustrated" pages.

I guess what I'm not following is what issue arises with the data model if we includes pages that already have images. If the IMA decides that an existing page that already has one or more images needs more, what breaks?

The data set contains a subset of all pages, and what defines that subset is a function of the current implementation (and it's one that seems...arbitrary, to me). If you later decide to change the criteria between any old page, versus one that qualifies for this data set, anything that made assumptions about that criteria could break. Maybe that's nothing, I don't know.

And for what it's worth, your characterization here as "under-illustrated" is wholly new to me. Thus far, everyone I have corresponded with has either referred to them as unillustrated or articles without any images. That's really what prompted me to question this, is the framing as records with concrete suggestions, versus those of pages that just had no images. That latter sounded to me like it might be an un-modeled attribute of those records.

So I'd be interested in a solution that allows clients to get either of those things (pages with IMA suggestions or just pages that need images) in a pseudorandom way. I'm concerned that we may be losing that with the current proposed solution (but then again, I may be misunderstanding the proposal).

Wait, both? You want a way of pseudorandomly selecting either pages which are unillustrated, but for which there are no IMA suggestions, and pages that do have suggestions?

Almost. We want a way to pseudorandomly select:

pages that have IMA suggestions

under-illustrated pages, regardless of whether they have IMA suggestions or not.

The difference in what you said vs what I said is that we don't need a way to select pages WITHOUT suggestions from the IMA. We just need a way to provide the full set of under-illustrated pages to clients that want to generate their own suggestions.

Ok, let me take another stab at this then. We need:

The ability to retrieve a record from the data set by its wiki and page_id attributes
A way of pseudorandomly choosing from any of the records in the data set, with, or without suggestions (by its wiki attribute)
A way of pseudorandomly choosing records from only those that have suggestions (by its wiki attribute)

• nnikkhoui subscribed.Nov 3 2021, 1:43 PM

In T293808#7473553, @Eevans wrote:

And for what it's worth, your characterization here as "under-illustrated" is wholly new to me. Thus far, everyone I have corresponded with has either referred to them as unillustrated or articles without any images.

I may be the only person who thinks of it that way. I'm just trying to minimize assumptions.

Ok, let me take another stab at this then. We need:

The ability to retrieve a record from the data set by its wiki and page_id attributes

A way of pseudorandomly choosing from any of the records in the data set, with, or without suggestions (by its wiki attribute)

A way of pseudorandomly choosing records from only those that have suggestions (by its wiki attribute)

Yes.

For anyone who skipped to the bottom, the service can still support requesting suggestions by page title. But it will convert title to page_id outside the dataset, probably via the Action API. This carries with it a risk that pages may be renamed, so the page_title => page_id relationship may have changed after the dataset was generated. We'll document this consideration so callers are aware of this possibility.

In T293808#7477730, @BPirkle wrote:

[ ... ]
For anyone who skipped to the bottom, the service can still support requesting suggestions by page title. But it will convert title to page_id outside the dataset, probably via the Action API. This carries with it a risk that pages may be renamed, so the page_title => page_id relationship may have changed after the dataset was generated. We'll document this consideration so callers are aware of this possibility.

@BPirkle To be clear, are you talking about development environments here? Otherwise, using the Action API to map page_title to page_id will eliminate any risk of a mismatch occurring.

Eevans updated the task description. (Show Details)Nov 8 2021, 7:41 PM

lbowmaker moved this task from Work in Progress ⚙️ to QA/Review ❓ on the Generated Data Platform board.Nov 9 2021, 4:03 PM

lbowmaker mentioned this in T295405: Implement Image Suggestions Schema in Cassandra.Nov 9 2021, 8:57 PM

Eevans updated the task description. (Show Details)Nov 10 2021, 3:01 PM

In T293808#7479064, @Eevans wrote:

In T293808#7477730, @BPirkle wrote:

[ ... ]
For anyone who skipped to the bottom, the service can still support requesting suggestions by page title. But it will convert title to page_id outside the dataset, probably via the Action API. This carries with it a risk that pages may be renamed, so the page_title => page_id relationship may have changed after the dataset was generated. We'll document this consideration so callers are aware of this possibility.

@BPirkle To be clear, are you talking about development environments here? Otherwise, using the Action API to map page_title to page_id will eliminate any risk of a mismatch occurring.

Development environments (including beta) are the context for which this endpoint was created. But of course, it is hard to predict what people may use it for "in the wild".

Specifically, I was thinking of pathological situations that may not be a practical concern. For example, stuff like:

algorithm executes, and records that page_id 123, by page_title Foo, is underillustrated
algorithm data is loaded into Cassandra, which stores page_id 123 (but not the page_title Foo)
page Foo is moved to FooBar, leaving a redirect page Foo
redirect page Foo is edited to no longer be a redirect page, but instead have actual content
client requests image suggestions for page_title Foo
Action API tries to find a page by title Foo and "succeeds"
user adds an image that was originally identified by the algorithm as appropriate for the page that is now titled FooBar to the page now titled Foo

Note that if Foo exists as a redirect, we can follow that via the Action API and find the id of the intended original page. The above is only a (theoretical) concern if Foo no longer redirects to FooBar.

I can make that sequence happen on my local dev wiki, but I don't know if it actually happens in practice on the actual projects. There may also be various other pathological situations that I'm unaware of. It is also possible, of course, that no page moves happen but page Foo changes significantly via the normal editing process between suggestion generation and the time a user is presented with that suggestion. There's a reason we called these "suggestions", and I'm not overly concerned about any of this.

All I was really advocating for was documenting that lookup by page title finds the current page by that title, which is not necessarily the same page that the suggestion was generated for.

(edited for typo)

lbowmaker mentioned this in T296758: Implement Cassandra Data Loader in Airflow.Nov 30 2021, 4:38 PM

Eevans mentioned this in T293809: Define Capacity Management Process.Dec 17 2021, 1:11 AM

Cparle subscribed.Feb 10 2022, 11:50 AM

LSobanski subscribed.Feb 10 2022, 3:41 PM

Tgr subscribed.Feb 22 2022, 4:46 PM

lbowmaker moved this task from QA/Review ❓ to Work in Progress ⚙️ on the Generated Data Platform board.Feb 23 2022, 1:40 PM

Eevans renamed this task from Design Image Recommendations Schema to Design Image Suggestion Schema.Feb 24 2022, 12:53 AM

Eevans triaged this task as Medium priority.

Eevans updated the task description. (Show Details)

Eevans updated the task description. (Show Details)Feb 24 2022, 1:48 AM

Eevans updated the task description. (Show Details)Feb 25 2022, 7:57 PM

Eevans updated the task description. (Show Details)Feb 25 2022, 8:10 PM

Eevans updated the task description. (Show Details)Feb 25 2022, 8:40 PM

Eevans updated the task description. (Show Details)Mar 1 2022, 9:00 PM

Eevans updated the task description. (Show Details)Mar 2 2022, 8:10 PM

Eevans updated the task description. (Show Details)Mar 2 2022, 8:15 PM

Eevans mentioned this in T294468: [SPIKE] Decide on best approach for API access to Cassandra.Mar 2 2022, 9:50 PM

Signing off on proposal as detailed in the description after discussing with impacted teams.

Thanks for everyone's efforts in modeling this and reaching consensus.

lbowmaker moved this task from Sign-off ✔️ to Done 🎊 on the Generated Data Platform board.Mar 3 2022, 2:55 PM

Cparle mentioned this in T299885: [L] Push unillustrated articles with their suggestions, suggestion reasons and confidence scores to Cassandra.Mar 14 2022, 5:31 PM

Cparle mentioned this in T295369: Exclude biographies from image suggestions notifications.Mar 14 2022, 5:44 PM

Is the user column under the feedback table supposed to be text? The feedback event schema currently outputs a user_id instead so I'm wondering if it's supposed to be transformed into a username or if the Cassandra table needs to be updated

In T293808#7999077, @tchin wrote:

Is the user column under the feedback table supposed to be text? The feedback event schema currently outputs a user_id instead so I'm wondering if it's supposed to be transformed into a username or if the Cassandra table needs to be updated

It is text but it could be (and probably makes more sense) as an int.

In T293808#7999119, @Eevans wrote:

In T293808#7999077, @tchin wrote:

Is the user column under the feedback table supposed to be text? The feedback event schema currently outputs a user_id instead so I'm wondering if it's supposed to be transformed into a username or if the Cassandra table needs to be updated

It is text but it could be (and probably makes more sense) as an int.

Actually, let's expound on this...

What is proposed here is to change: feedback.user (type text), to feedback.user_id of type int.

Is this correct? /cc @Cparle @lbowmaker ... ?

Change 805175 had a related patch set uploaded (by Eevans; author: Eevans):

[generated-data-platform/datasets/image-suggestions@main] Drop feedback.user (text), add feedback.user_id (int)

https://gerrit.wikimedia.org/r/805175

gerritbot added a project: Patch-For-Review.Jun 13 2022, 3:47 PM

Change 805175 merged by jenkins-bot:

[generated-data-platform/datasets/image-suggestions@main] Drop feedback.user (text), add feedback.user_id (int)

https://gerrit.wikimedia.org/r/805175

Eevans mentioned this in rGDISb1012ae04e8c: Drop feedback.user (text), add feedback.user_id (int).Jun 14 2022, 8:24 PM

Maintenance_bot removed a project: Patch-For-Review.Jun 14 2022, 8:30 PM

In T293808#8004000, @gerritbot wrote:

Change 805175 merged by jenkins-bot:

[generated-data-platform/datasets/image-suggestions@main] Drop feedback.user (text), add feedback.user_id (int)

https://gerrit.wikimedia.org/r/805175

This requires no deployment; This amounts to a documentation change, and the production DB has been updated accordingly.

Cparle mentioned this in T311220: title_cache endpoint for image suggestions api doesn't work.Jun 23 2022, 10:43 AM

lbowmaker closed this task as Resolved.Aug 26 2022, 2:26 PM

Eevans mentioned this in T313973: GrowthExperiments\NewcomerTasks\AddImage\ServiceImageRecommendationProvider::get Unable to decode JSON response for page {title} upstream connect error or disconnect/reset before headers. reset reason: connection termination.Aug 31 2022, 8:43 PM

	F34705143: image.png
	Oct 21 2021, 8:02 PM

Design Image Suggestion SchemaClosed, ResolvedPublicActions