Page MenuHomePhabricator

[Java] EventGate validation error for performer language groups
Closed, ResolvedPublic3 Estimated Story Points

Description

Coming out of Android data QA for T353680 and after release to production on 3/21, there are some EventGate validation errors related to a performer's chosen language groups.

Description

Fix the performer language groups validation error for users who chose a large amount of languages in their app preferences.

https://logstash.wikimedia.org/goto/e7c95d72da983c3e59c8f826b7ef8cd5

Example:

"language_groups":"[zh-hant, zh-hans, ja, en, zh-yue, ko, fr, de, it, es, pt, da, tr, ru, nl, sv, cs, fi, uk, el, pl, hu, vi, id, ca, mk, sl, ms, tl, avk, lt, sr-el, eu, nb, ceb, als, uz-latn, az, af, nn, et, eo, la, br, jv, io, bg, ro, nrm, pcd, tg-latn, lmo, gl, cy, sq, is, ha, gd, ku-latn, hr, lv, sk, bar, pms, lld, ga, war]"

Possible remediation steps:

  • Update schema to allow for more than 255 characters in a performer's language groups
  • Update Java library to prevent value from exceeding 255 characters
  • Android app limits the number of languages a user can add to their language group preferences

Acceptance Criteria

  • Resolution for how to prevent validation errors

Event Timeline

cjming created this task.

Discussed with @SNowick_WMF who will check with Android engineers about next steps.

Per Shay, this kind of anomaly in the data is not pervasive (few users do choose a lot of languages - Android allows this, iOS does not). This data is not generally useful at the moment but to pre-empt losing it in case it's useful in the future, we will figure out how to remediate the validation errors.

Possible remediation steps outlined in description.

Per discussion with @SNowick_WMF, we will just cut off language_groups to 255 (limit set by current schema) which should resolve validation errors

As discussed with @cjming the easiest way to resolve this is to keep the character limit at 255, keep the data up to those characters and truncate the value passed along with the event. These users with many languages selected are quite unusual and the extensive languages selected don't provide us any useful info beyond the first few languages they have selected. We mostly rely on a primary_language designation and/or what language wikis the user events are occurring. In summary, I am ok with losing some of the language codes at the end of these fields if they exceed the character limit. As long as we aren't losing any events this seems to be the most expedient solution.

VirginiaPoundstone raised the priority of this task from Low to High.May 3 2024, 3:23 PM