Page MenuHomePhabricator

Add an image: caption validation (PLACEHOLDER)
Open, Needs TriagePublic

Description

NOTE: a full validation experience will not be part of Iteration 1. This is for a future iteration. For the minimal version, see T293161: Add an image: minimal caption validation.

Placeholder task for validation rules on the caption entered by the user. These rules may include:

  • Minimum and maximum length.
  • Not allowing the filename to be included.
  • Not allowing same caption as previous image done by the user.
  • Checking that it is in the content language for the article.

This task will also include the user experience for displaying the warning message.

Mockup as of 2021-10-08:

image.png (1×1 px, 1 MB)

Figma: https://www.figma.com/file/ULhJr1isDstRbGE5vjYDsr/Add-images-structured-task?node-id=3050%3A9628

Event Timeline

Minimum and maximum length.
Not allowing the filename to be included.

These are trivial to do.

Not allowing same caption as previous image done by the user.

Slightly more complicated because we'd have to store previous captions somewhere. Still fairly easy, but is it useful? Someone giving the exact same caption for multiple images seems unlikely.

Checking that it is in the content language for the article.

Language detection is a complicated problem. It requires dictionaries, which are probably too large to do this on the client side. We'd have to find which open-source tool does a good-enough job, and set it up as a web service. Unless someone already did that for another product, this isn't really feasible IMO.

Minimum and maximum length.
Not allowing the filename to be included.

These are trivial to do.

Not allowing same caption as previous image done by the user.

Slightly more complicated because we'd have to store previous captions somewhere. Still fairly easy, but is it useful? Someone giving the exact same caption for multiple images seems unlikely.

I agree that not allowing previous captions seems not super useful validation.

Checking that it is in the content language for the article.

Language detection is a complicated problem. It requires dictionaries, which are probably too large to do this on the client side. We'd have to find which open-source tool does a good-enough job, and set it up as a web service. Unless someone already did that for another product, this isn't really feasible IMO.

I know that Android did this for their app version of Add caption task, not sure if this is using something within Android but @Dbrant may have more info? Also, is it possible that CX uses the dictionaries we are after for V1 languages? cc @Pginer-WMF

Also, is it possible that CX uses the dictionaries we are after for V1 languages? cc @Pginer-WMF

In Content Translation we have not been doing language detection, the focus has often been on content mapping across two languages we knew in advance. In particular:

  • Finding which could be the equivalent sections across two versions of an article in different languages. For which we use a database of equivalent section titles across different languages.
  • Finding which could be the equivalent template parameters across templates in two different languages. For this[[ https://github.com/digitalTranshumant/templatesAlignment/tree/master | a machine learning approach ]] based on multilingual fastText vectors was used.

I don't know if any of those underlying resources, or the MT services that Content Translation provides could be useful as part of the process to support detection. @santhosh may know more about language detection. I recall talking about this functionality when T98728 was explored.

I know that Android did this for their app version of Add caption task, not sure if this is using something within Android but @Dbrant may have more info?

This was indeed specific to Android -- our language detection uses Google's ML Kit.

Minimum and maximum length.
Not allowing the filename to be included.

These are trivial to do.

Not allowing same caption as previous image done by the user.

Slightly more complicated because we'd have to store previous captions somewhere. Still fairly easy, but is it useful? Someone giving the exact same caption for multiple images seems unlikely.

I agree that not allowing previous captions seems not super useful validation.

My use case for not allowing previous captions is if the user is copy/pasting the same thing in to each caption, e.g. "A good image for the article." I don't think I have evidence to say this will happen, but I could imagine users doing it. We'll see when we have the caption data from Iteration 1.

Not something we're currently working on, so I'm moving back to the main board.