Page MenuHomePhabricator

Define the lifecycle for MV-generated depicts suggestion data
Closed, ResolvedPublic

Description

Machine vision-generated labels for Commons images will be requested from one or more MV providers. We need to answer the following lifecycle questions about this data:

  • How long should we retain candidates? Forever, or can they be dropped after a certain time?
    • This will depend in part on the requirements for promotion to SDC and model feedback.
  • How does this affect how candidates should be stored?
  • If/when should previously fetched data be refreshed?

Event Timeline

From Adam over email:

are there use cases for holding onto the labels for some interval post the first human verification? My gut intuition is we probably want to stash the latest retrieved label data indefinitely even if the original data for the paid services is only internet exposed until a human verification (plus some lag maybe to deal with risk of vandalism?), but in addition to just needing to deal with up front storage planning.

I'll just add to this one possible future use case we've discussed before: when/if WMF eventually has our own homegrown system for this, it could be useful to have the old labels from V1 to compare/contrast against and judge the quality of our new model(s).

Resolved that the labels should be held indefinitely?

I'm in support of permanent stashing.

It's a subtler matter, but the question of refresh frequency is orthogonal.

Mholloway claimed this task.

See T229314 regarding where to store historical suggestions data.