Page MenuHomePhabricator

Add "did edit" field to pageview_actor
Open, MediumPublic

Description

I'm not sure how I actually feel about this but I wanted to start a conversation/task around potentially adding a field to pageview_actor that would record whether a particular actor signature had evidence of editing activity in their session.

Why:

  • In research, for privacy reasons, we often filter out editors from reader session datasets given that their pageview history can be partially reconstructed from edit history -- e.g., the covid dataset. While the logged-in parameter can help with this, that's an imperfect proxy (you can have an account but not edit; you can edit but not have an account).
  • The clause for determining "did edit" depends on webrequests not marked as pageviews so the data is in the webrequests table but not pageview_actor table
  • Precomputing this in pageview_actor would also help to normalize this pattern and make it more accessible to folks working with session data
  • While surfacing it as a field makes it easier to identify editors in the data, I don't see this as a major privacy concern given that pageview_actor also has the same 90-day limit and access restrictions as webrequests

Potential reasons not to do it:

  • For people who want this filter, they can always just go back to working with webrequests instead

The logic that we currently use for this is:

(uri_query LIKE '%action=edit%') # desktop wikitext editor
(uri_query LIKE '%action=visualeditor%') # desktop and mobile visualeditor
(uri_query LIKE '%&intestactions=edit&intestactionsdetail=full&uiprop=options%') # mobile wikitext editor

Notes:

  • these clauses have to sweep through non-pageviews which is why it has to be done in the creation of pageview_actor as opposed to ad-hoc as needed afterwards.
  • I haven't checked these clauses recently but hopefully they are still correct :) Nothing prevents changes to the API calls that would break these though...
  • we don't have a clause for the apps because we traditionally leave them out of research datasets though as they grow in popularity, this will become less acceptable and we'll want to figure out how to include them in these clauses.

Event Timeline

If we were to add some info to the table, we would do it using a flag, similarly to how we flag is_pageview and is_redirect_to_pageview. I suggest we would use is_edit as field name. The 3 named flags would be mutually exclusive.
The reason for this technical approach is to prevent to have to group-by actor_signature when generating pageview_actor hours. Adding a few rows is very cheap computationally, while grouping by and affect all rows with by actor_signature is expensive. The by-actor filtering of non-editor sessions would need to be done in a further step. Would that be an acceptable solution @Isaac ?

If we were to add some info to the table, we would do it using a flag, similarly to how we flag is_pageview and is_redirect_to_pageview. I suggest we would use is_edit as field name. The 3 named flags would be mutually exclusive.

Thanks @JAllemandou ! Yep, that makes sense to me and gets us most of the way. And then that further step would just be something like:

with actor_editors AS (
  SELECT DISTINCT
    actor_signature
  FROM wmf.pageview_actor
  WHERE
    is_edit
)
SELECT
  ...
FROM wmf.pageview_actor p
LEFT ANTI JOIN actor_editors e
  ON (p.actor_signature = e.actor_signature)
WHERE
  ...
Ottomata moved this task from Incoming to Datasets on the Analytics board.

I think this would be super useful!
We could use this information to filter out rows in reading data sets that we inted to make public, like pageviews per article per country.
Not reporting on sessions that included editing would break the bridge between the data set and the public wiki databases, thus allowing to publish data with more granularity!
And the amount of data that we'd be loosing would be orders of magnitude smaller than the total.

Discussed today with the team: the change is cheap, we can implement it soon. One thing we would like to see happening before is a coverage analysis of how many edits we get by using the described filtering (could be documented on Wikitech). This would serve to have a better understanding of the validity as the approach before actually implementing and using.
@Isaac : May we let you sync with your team if this could be done on your end?

I think this would be super useful!

Yay! Glad to hear!

May we let you sync with your team if this could be done on your end?

Yeah, I'm happy to take that on as I have queries lying around that already do most of what's needed. I'll try to do it in the next week or so and report back here and then we can move it to Wikitech.

odimitrijevic lowered the priority of this task from High to Medium.Jan 6 2022, 3:38 AM
odimitrijevic moved this task from Incoming (new tickets) to Datasets on the Data-Engineering board.

Just noting because I never followed up on this task. I personally would like to just decline this task for a few reasons (my thinking has changed on it) and will if I don't hear anything against declining by the 1st of December:

  • I did some analysis (query below) to try to verify that edit attempts reasonably tracked actual edits and wasn't able to show this. For instance, English Wikipedia sees about 20k edits a day but I was seeing close to 300,000 "users" attempting an edit. That means either the concept of a "user" is really bad or so many more people are attempting edits as compared to completing them that attempting an edit maybe isn't a good filter for balancing privacy and transparency/dataset quality.
  • Since 2021 when this was proposed, we've seen that our approximation of a "user" in the logs -- i.e. IP address + User agent -- is increasingly a bad proxy for an individual person accessing our sites. Internet proxies are growing in usage (putting many people on the same IP address), mobile editing is growing (putting a single editor on many different IP addresses), and user-agents are being made more generic. Thus, I wouldn't want to further hard-code in a reliance on that user proxy given its known issues.
  • Since 2021 we have also put a lot of effort into our differential privacy and dataset publication guidelines, which now feel like a much better approach to making sure we're preserving user privacy when releasing datasets.
# NOTE: see https://www.mediawiki.org/wiki/Manual:Recentchanges_table#rc_type for the action_type clause explanation (edits + new pages)

WITH wikipedia_projects AS (
    SELECT database_code
      FROM canonical_data.wikis
     WHERE database_group = 'wikipedia'
           AND status = 'open' AND visibility = 'public' and editability = 'public'
),
users_who_edited AS (
    SELECT CONCAT(normalized_host.project, 'wiki') AS wiki_db,
           COUNT(DISTINCT(sha2(CONCAT(user_agent, client_ip), 256))) AS num_editors_attempts_detected
      FROM wmf.webrequest w
     WHERE webrequest_source = 'text'
           AND normalized_host.project_family = 'wikipedia'
           AND year = 2023 AND month = 10 AND day = 15
           AND agent_type = 'user'
           AND access_method <> 'mobile app'
           AND namespace_id = 0
           AND (uri_query LIKE '%action=edit%'
                OR uri_query LIKE '%action=visualeditor%'
                OR uri_query LIKE '%&intestactions=edit&intestactionsdetail=full&uiprop=options%')
         GROUP BY CONCAT(normalized_host.project, 'wiki')
),
num_editors AS (
    SELECT wiki_db,
           COUNT(DISTINCT(user_fingerprint_or_name)) AS num_editors_total
    FROM wmf.editors_daily
    WHERE month = '2023-10'
          AND NOT SIZE(user_is_bot_by) > 0
          AND date = '2023-10-15'
          AND action_type < 2
    GROUP BY wiki_db
)
SELECT database_code AS wiki_db,
       COALESCE(num_editors_total, 0) AS num_editors_total,
       COALESCE(num_editors_attempts_detected, 0) AS num_editors_attempts_detected
  FROM wikipedia_projects wp
  LEFT JOIN users_who_edited u1
       ON (wp.database_code = u1.wiki_db)
  LEFT JOIN num_editors u2
       ON (wp.database_code = u2.wiki_db)
 ORDER BY num_editors_total DESC