Page MenuHomePhabricator

Newcomer tasks: undercounting of edits using EventLogging
Closed, ResolvedPublic

Description

I think that because of two issues, we have been undercounting successful suggested edits. I detail the two issues below with examples from Vietnamese Wikipedia, which will hopefully help us look into them.

I have been using the connection between the homepagemodule and editattemptstep schemas to count successful suggested edits. Basically, if a se-task-click event in homepagemodule joins to a saveSuccess event in editattemptstep on homepage_pageview_token = editing_session_id, that's a successful suggested edit. I wrote this code before we had the edit tag for "Newcomer task". Now that we have the tag, I am seeing edits in the recent changes feeds of our target wikis that are not showing up in my reporting. Digging into this, it looks like: (a) many edits are not showing up in editattemptstep and (b) the link between the two tables is not always present. This is causing us to undercount in the reporting by a substantial amount.

Maybe these aren't bugs -- maybe they're a difference in how EventLogging does things vs. edit tags, but I think it is important to sort this out so that we can correctly report on the impact of the feature.

For edits that have the "Newcomer task" tag, only some of them are in editattemptstep

When I look at all the edits that have the "Newcomer task" tag in Vietnamese Wikipedia, I see 13 revisions:

select ct_rev_id from change_tag where ct_tag_id = 75 limit 100;

ct_rev_id
57066339
57099167
57163867
57206739
57227444
57227455
57274969
57322195
57322605
57401587
57401627
57426233
57426284

Then when I go to look for those revisions as "saveSuccess" events in editattemptstep, I only see six of them:

select eas.event.revision_id
from event.editattemptstep eas where year in (2019,2020) and eas.wiki = 'viwiki' and eas.event.action = 'saveSuccess'
and eas.event.revision_id in (57066339,57099167,57163867,57206739,57227444,57227455,57274969,57322195,57322605,57401587,57401627,57426233,57426284)
group by eas.event.revision_id;
revision_id
57322195
57322605
57426284
57099167
57426233
57206739

For edits that are in editattemptstep, not all of them link correctly back to homepagemodule

Taking the same set of six revisions from above, and looking at their editing_session_ids in the editattempstep schema, we can see that two of them (57322195 and 57426284) do not have the 33-character IDs that match to homepage_pageview_tokens in the homepagemodule schema. Instead, they have "classic" 21 character IDs.

Event Timeline

MMiller_WMF renamed this task from Newcomer tasks: IDs not matching across homepagemodule and editattemptstep to Newcomer tasks: undercounting of edits using EventLogging.Jan 4 2020, 2:50 AM
MMiller_WMF created this task.

@nettrom_WMF -- here is one of the QA tasks. If there are bugs here, I would definitely want us to fix them -- but we should also determine whether the reporting on newcomer tasks should switch from the "linked IDs" method to the edit tag method.

LGoto triaged this task as High priority.Feb 10 2020, 7:15 PM

I've dug into this using data from launch of Newcomer Tasks up until Feb 5, 2020. I've excluded known test accounts from the data gathering.

When it comes to edits that are tagged with "newcomer task" that do not also show up as "saveSuccess" events in EditAttemptStep (EAS), the proportions vary greatly by wiki:

CzechKoreanVietnameseArabic
Not in EAS9 (15.5%)40 (26.8%)37 (40.7%)82 (14.2%)
In EAS49 (84.5%)109 (73.2%)54 (59.3%)497 (85.8%)
Total58 (100.0%)149 (100.0%)91 (100.0%)579 (100.0%)

Similarly, when it comes to tagged edits that are found in EAS, the proportion of those that have editing_session_id set to a value that matches a visit to the Homepage (specifically homepage_pageview_token in HomepageVisit, chosen because it's logged server-side while HomepageModule is client-side) also varies greatly by wiki:

CzechKoreanVietnameseArabic
Not matched14 (28.6%)36 (33.0%)22 (40.7%)242 (48.7%)
Matched35 (71.4%)73 (67.0%)32 (59.3%)255 (51.3%)
Total49 (100.0%)109 (100.0%)54 (100.0%497 (100.0%)

There are some patterns that show up when analyzing this data:

  1. Some users make multiple edits to the same article in a single edit session. Subsequent edits have a different editing_session_id (which they should because they are different edits), but only the first can be matched with HomepageVisit/HomepageModule.
  2. Some users do not allow JavaScript or blocks client-side EventLogging, resulting in data in EditAttemptStep but no data in HomepageModule. This is because some events in EAS are server-side, while HomepageModule is client-side.
  3. Not all se-task-click events end up in HomepageModule. I suspect this is likely because the user is navigating to another page and the browser kills all JavaScript. We can instead find a se-task-impression event matching the page the user edited.

Based on this, I have the following recommendations:

  1. Use the "newcomer task" edit tag to count edits.
  2. If these edits are to be connected to events in HomepageModule, match on event.user_id, event.action IN ('se-task-impression', 'se-task-click'), and event.action_data REGEXP "pageId={page_id}" (substitute in the right page ID, or extract the page ID for matching), preferably with a limit on time as well because the tag is only applied within a week).
  3. Reconsider whether a week is an appropriate timespan to allow edits to articles users click on (ref T236885), because few of these edits occur a long time after the impression/click event.
  4. Consider setting up a monthly cron job to measure how many users don't allow client-side events, so we know what to expect with regards to data availability in the HomepageModule schema.

IIRC some of the common blocklists used by uBlock and similar adblocker browser extensions break client-side EventLogging, so that probably affects a lot of users (and is more likely to affect experienced users).

Not all se-task-click events end up in HomepageModule. I suspect this is likely because the user is navigating to another page and the browser kills all JavaScript.

EventLogging uses navigator.sendBeacon so on modern browsers this should only be the case in a small fraction of sessions.

IIRC some of the common blocklists used by uBlock and similar adblocker browser extensions break client-side EventLogging, so that probably affects a lot of users (and is more likely to affect experienced users).

That's right. uBlock Origin for example blocks EventLogging by default. I imagine the other big adblockers do as well.

Reconsider whether a week is an appropriate timespan to allow edits to articles users click on (ref T236885), because few of these edits occur a long time after the impression/click event.

What would you suggest? 1 or 2 days? Less?

When it comes to edits that are tagged with "newcomer task" that do not also show up as "saveSuccess" events in EditAttemptStep (EAS), the proportions vary greatly by wiki:

Overall I would expect the percentages in the first two rows to be close to the percentages where there is an entry in HomepageVisit but not HomepageModule (for viewing, not editing); do you find this to be the same? E.g. for arwiki I'd expect to see about 15% of events in HomepageVisit don't have a corresponding HomepageModule entry.

Alright -- @nettrom_WMF and I have corresponded in depth about this outside Phabricator, and we're ready to resolve the task, with these notes:

  • Edits can be undercounted in EventLogging for a number of reasons. We have taken reasonable measures in our reporting code to pick up as many as we can, while keeping the whole funnel based on EventLogging, and therefore apples-to-apples from top to bottom.
  • When we want to know the exact number of suggested edits completed, and the exact percentage of newcomers who do one, we should use the edit tags.
  • As far as the funnel is concerned, the percentages may be low, but that's okay because the purpose of the funnel is for us to understand whether we are improving things directionally, more than knowing the exact numbers.
  • @nettrom_WMF also recommends that we keep in mind a list of reasons that the edits can be undercounted in EventLogging as compared to the edit tag:
    • Users who block client-side EventLogging (about 10% of all users).
    • Users who have unreliable client-side EventLogging (our estimate is ~15% of users, depending on platform/device/etc)
    • Users who make multiple edits to the same article (% unknown, subject to EAS sampling rate, 1st edit is counted)
    • Users who don't follow the expected workflow (% unknown, subject to EAS sampling rate)