Page MenuHomePhabricator

[Request] Gather the data required to filter and cluster incubator languages for experiment
Closed, ResolvedPublic

Description

In the measurement plan (T367686) we outlined that criteria for filtering languages (to meet a minimum threshold) and also variables to perform clustering with. This requires the following data points to be gathered

  • project type
  • number of active editors in the last 3 months (by month)
  • number of edits in the last 3 months (split by: content & non-content)
  • language directionality
  • time spent by language on incubator (until 30 June 2024)
    • starting date to be considered as the average of first 5 percentile of edits

Details

Other Assignee
KCVelaga_WMF
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Memory and filterrepos/research/incubator-data-exploration!85cmyrickmemory-and-filtermain
New additions (bots, pages, excl10)repos/research/incubator-data-exploration!83kcvelagaT370620_additionsmain
Customize query in GitLab

Event Timeline

KCVelaga_WMF changed the task status from Open to In Progress.Jul 24 2024, 1:30 PM
KCVelaga_WMF assigned this task to CMyrick-WMF.
KCVelaga_WMF updated Other Assignee, added: KCVelaga_WMF.
KCVelaga_WMF moved this task from Incoming to In progress on the LPL Analytics board.

@KCVelaga_WMF

See WE_223_request_T370620.ipynb (and output tsv).

Regarding my progress on excluding editors who make administrative edits across multiple languages Incubator:

I am in the process of writing a query that pulls all editors who edit in 5 or more languages in Incubator (see cell 8, under "Query list of editors who edit 5+ languages on Incubator"). Once this is completed, the list can be applied to all relevant queries in the notebook.

@KCVelaga_WMF Update: See WE_223_request_T370620.ipynb (and output tsv)

As requested, the tsv includes columns for

  • project type - I've also included a column for language name
  • number of active editors in the last 3 months (by month) - excluding editors noted below*
  • number of edits in the last 3 months (split by: content & non-content) - excluding edits by editors noted below*
  • language directionality - specifically rtl (right-to-left) and vertical, both of which are boolean
  • time spent by language on incubator (until 30 June 2024), with starting date to be considered as the average of first 5 percentile of edits - note that these values are duplicated across three rows, due to the table being at the level of month for many of the other variables

*Note about editors excluded:

Per cell 8, under "Query list of editors who edit 5+ languages on Incubator"), I've queried all editors who have made 5+ monthly edits (for at least one month) across 5+ languages in the Incubator. This list of editors (N=165) was converted into a sql tuple, {editors_to_exclude}, and was used in WHERE clauses in all relevant queries of the notebook to exclude these editors.

CMyrick-WMF renamed this task from Gather the data required to filter and cluster incubator languages for experiment to [Request] Gather the data required to filter and cluster incubator languages for experiment.Jul 26 2024, 1:48 PM
CMyrick-WMF closed this task as Resolved.

@CMyrick-WMF I think N=165 who editors across 5+ languages is still quite a high number of editors to exclude. I wonder if we should switch to a defined list or consider some other criteria. Some other options consider are:

  • 30+ monthly edits in more than 5 languages.
  • 100+ all-time edits in more than 5 languages.

@CMyrick-WMF

  • I missed mentioning pages in the original description, I added a query for that.
  • I think excluding editors in 10+ editors seems reasonable, we're excluding about 14 editors and all of LangCom.
  • We should remove bots, added conditions for that.
  • Also, I realized the initial TSV doesn't have the time spent on incubator data, so I split the output into monthly metrics and project info (including language directionality).

I created opened an MR based on your existing notebook. Please review and let me know what you think.