Page MenuHomePhabricator

Exploratory data analysis: user segmentation for article & section creation
Closed, ResolvedPublic

Description

Description

WE2.1.3: Article & section creation guidance research

As we build an understanding of editors' preferences and challenges for different tools to create articles, the LPL team is curious to delve into the data for useful information that will guide our research, prototyping, and experimentation efforts. This ticket will explore other factors beyond content availability to answer if other reasons make editors choose the content translation tool over writing articles from scratch. Some exploratory questions are:

  1. Do users have a preference for creating articles using CX or other tools?
  2. Do they start using CX at the beginning of their Wikipedia editing journey, or do they gradually start using CX after they are exposed to the tool?
  3. How do various entry points of CX play a role?
  4. As they start using it, do they prefer to use CX or not, i.e., create articles from scratch?
  5. Does their preference change as the user gains more experience?
  6. Trends based on comparative wiki size
  7. Trends based on the regional association of a language
  8. Trends based on device used
  9. What is the distribution of the proportion of articles created by a user using CX?

The goal is to largely understand if certain users (by experience, wiki, region, device, etc.) have a preference for using CX to create articles or not. From the data, we may not be able to answer why, if there is any pattern, but we should be able to segment users into their usage patterns to aid further qualitative research (like surveys or interviews).

Related tasks

Event Timeline

KCVelaga_WMF renamed this task from Exploratory data analysis on Article & Section Creation to Exploratory data analysis: user segmentation for article & section creation.Jun 16 2025, 1:35 PM
KCVelaga_WMF updated the task description. (Show Details)
KCVelaga_WMF added a subscriber: cchen.
KCVelaga_WMF subscribed.

per Product Analytics allocation changes

Nikerabbit changed the task status from Open to In Progress.Jul 17 2025, 11:16 AM
Nikerabbit moved this task from Backlog to In-progress on the LPL Hypothesis board.

Worked on:

  • Collected and processed article-creation data for Wikipedia (2023–2025).
  • Collected and processed content-translation event data for Wikipedia (2023–2025).
  • Analyzed data for CX entry points and translation start step
    • Breakdown by user global edit bucket
    • Breakdown by the comparative wiki size of target languages
  • Proportion of articles created by a user using CX (within 2023 - 2025), to check the user's preference for using CX vs. other tools.
    • Breakdown by user global edit bucket
    • Breakdown by the comparative wiki size of target languages
    • Breakdown by mobile vs. desktop
    • Also checked user tenures
  • Proportion of user counts in each global edit bucket, to check when users start to use CX (with their editing experience change).
    • Breakdown by the comparative wiki size of target languages
    • Breakdown by mobile vs. desktop

To do:

  • Look into CX by the region of target languages
    • The goal is to see whether the user's preference to CX changes across different regions.
    • Will try to select the top 5 language wikis of each region and compare between regions.
  • Create all visualizations and summarize the work.

Final analysis in this report and entry_points_analysis

Summary:

CX adopotions

  • By looking at users registered after 2023-07-01, overall, users start using CX at the very beginning of their editing journey to create new articles. Over 80% of the users adopt CX in article creation within the first 10 edits. Very few users wait until they are highly experienced to try CX for the first time.
  • These results are consistent across platforms (desktop and mobile devices) and different sizes of wikis.
  • In smaller-sized wikis, users are more likely to use CX for their first edit. And for the top 5 wikis, while they still have a peak at first edit, a specific segment of users tends to gain some experience (10+ edits) before trying the translation tool and creating new articles.

User Preference on using CX to creating new articles

  • By looking at article creation activities between 2023-07-01 and 2025-06-30, overall, users have a strong, polarized preference. They tend to fall into two extreme camps while creating articles: creating all articles from scratch and using CX exclusively. Over 50% of users tend to create more articles from scratch compared to using CX.
  • The patterns are the same across platforms (desktop and mobile devices). While on desktop, there are more "hybrid" users compared to mobile web.
  • The results are consistent in large wikis, more editors create the majority of their articles through other tools. When it comes to smaller-sized wikis (rank >20), the distribution of CX edits proportion shifts to the right. For these smaller languages, CX appears to be used more often when creating new articles.
  • The patterns are the same for users who made fewer than 100 edits. As users gain more experience, although there are still some users never used CX to create new articles, the distribution of CX edit proportions becomes much more even across the spectrum, with substantial groups using CX for some, most, or all of their edits. When it comes to users making over 1000 edits, the number of users who never use CX is comparatively small.

CX Entry Points

  • 57% of the newcomers opened the translation dashboard by navigating from the frequent language selector, which surfaces missing languages to translate for an article.
  • As users gain experience, reliance on the Frequent Language Selector drops precipitously (only 11% for 1000+ edit users). Experienced users prefer "intentional" entry points, utilizing the Content Language Selector, Options, or Direct Access.
  • On larger Wikipedias, the frequent languages selector was most used to navigate to the translation dashboard.
  • Among the top 20 Wikipedias, it was used 40% of the time to access the dashboard. On smaller Wikipedias, usage of content language selector and options are more compared to larger Wikipedia. It becomes the essential tool for finding smaller languages.
  • The usage of direct entry points is highly consistent regardless of the target language wiki's size. This is related to the observations from the user edit bucket.

Thanks @cchen for the useful analysis.

By looking at article creation activities between 2023-07-01 and 2025-06-30, overall, users have a strong, polarized preference. They tend to fall into two extreme camps while creating articles: creating all articles from scratch and using CX exclusively. Over 50% of users tend to create more articles from scratch compared to using CX.

Given that the opportunity to translate is not always available since it depends on the user speaking multiple languages and the content being available in one of them, I was wondering how the percentage of users not using CX breaks down between multilinguals (e.g., those who have participated in more than one language wiki, especially languages commonly used as source of translation such as English) and monolinguals (who may not be able to use CX in any case).