Page MenuHomePhabricator

Investigation: Database - What is engineering recommendation for associating edit with event for MVP? [3 days]
Closed, ResolvedPublic

Description

Background:

Investigate how and where to store the association between an edit and a CampaignEvents event in MediaWiki.

When a user indicates that their edit is related to a campaign event, we need to persist this association in a way that allows efficient querying and reporting later.

We want this data to be able to surface contribution data on pages like the Watchlist, RecentChanges, etc in the future too.

More information on data usage:

What data do we want to collect on edits, both in the short-term and long-term?

  • Shorter-term:
    • Number of articles created through an event
    • Number of articles improved through an event
  • Longer-term (all maybes):
    • Words added
    • Bytes added
    • References added
    • Commons uploads
    • Number of reverts
    • Number of edits by article topic (through LiftWing)
    • Quality score of articles edited/created (through LiftWing)
    • Number of 'thanks' through event edits
    • Number of edits by newcomers (i.e., first edits were through the event)
    • Number of edits by returning editors (i.e., editors who made no edits within X period, such as 30 days or 90 days)

Where do we show the data in the UI, both in the short-term and long-term?

  • Special:EventDetails (Contribution tab - this would be the full breakdown of info)
  • Maybe the event page - I can imagine people wanting the impact data being visible on the actual event page rather than only found through EventDetails - would there be a way to transclude that tab in the future? We don't need to have an answer now, but it is something to consider.
  • Special:Watchlist - a tag/some way of identifying that the edit is from the event, and ideally it would link to the event page/contributions tab
  • Special:RecentChanges - a tag/some way of identifying that the edit is from the event, and ideally it would link to the event page/contributions tab
  • Maybe in the Impact module on the Newcomer Homepage (i.e., expand it to also show data from collaborative contributions)
  • Maybe in optional Echo notifications (for example: "X event has had 50 new articles created in the last 24 hours. Come join it!")

What kinds of reports will people want in the future related to the data?

  • Participant demographics - number of newcomers, number of experienced editors, affiliate membership, aggregate PII data, etc
  • Participant impact - % of participants who made at least one edit during the event, newcomer retention rate (i.e., what % of newcomers continued to edit 30/60/90 days after event)?, number of thanks given through event edits
  • General contribution impact - number of articles created, number of articles improved, number of words added, number of bytes added, number of references added, number of Commons uploads
  • Topical impact - number of edits/articles worked on that were 'topics for impact' (such as gender, underrepresented regions)
  • Edit quality - number of reverts, quality scores related to article quality
  • Reader impact - number of pageviews for articles edited/improved
Acceptance Criteria:
  • This task is focused on identifying:
    • The data structure required (e.g. revision ID and event ID).
    • Any relevant constraints or indexing considerations.
    • Where to store the data
  • Consider what would be the most flexible approach, since we cannot predict everything in advance
    • Consider how we can potentially support many different types of edits (such as a moderator backlog drive that has people reverting edits or deleting pages - example, NPP backlog drives)
  • Consider how we can surface contribution data on pages like the Watchlist, RecentChanges, etc, so that people can see that an edit is a part of an event
  • Consider how we may want to handle reverts
    • We want to display reverted edits rather than having them "disappear" since organizers want to be able to track both legitimate and potentially illegitimate reverts associated with their events so that they can identify gaps in editing knowledge among their participants or potential discrimination/moments to defend edits
  • We may need to check in with Editing team

Event Timeline

ifried renamed this task from Database investigation [3 days] to Investigation: Database - What is engineering recommendation for associating edit with event for MVP? [3 days].Jun 2 2025, 4:30 PM
ifried updated the task description. (Show Details)
ifried updated the task description. (Show Details)
ifried updated the task description. (Show Details)
ifried updated the task description. (Show Details)
ifried updated the task description. (Show Details)
ifried updated the task description. (Show Details)
ifried updated the task description. (Show Details)
ifried updated the task description. (Show Details)
ifried updated the task description. (Show Details)

@ifried asked for feedback on these questions, I will write some initial user stories:
Where do we expect data on collaborative contributions to live on the wikis in the future (i.e., which pages/which experiences)?

  • I would think that there is some kind of "close out" of an event, where an organizer can both publish final data AND do some analysis of it
  • I would expect that there might be a news feed of some sort either for participants, or for the foundation, that highlights these final reports so that its easy to find centrally as a movement.
  • As an organizer, I want to easily be able to embed a widgit that is easy to explore the data in a final affiliate reporting page or in something like a "Archive" of past events.

And what kind of reports do we expect organizers to want related to this collaborative contribution data?

Here are some user stories of hypothetical reports:

  • As a social media manager or communciator for an event, I want to share out a snapshot of the impact of our event that I can easy share to our community channels
  • As a participant in an event, I want to see my role in the context of a larger community event
  • As a coordinator of multiple events as part of a multievent campaign or affiliate, I would like to be able to roll up data into one final report that includes information like # of new people activated, .
  • As an organizer explaining the impact of a project to a partner, I want to show them the gap between the topic list we created together, and the progress we made on both one wiki, and across multiple wikis in the language.
  • As an organizer working on some kind of visualization for my event, I would like to be able to easily display visualizations based on data from Wikidata (i.e. pages from this event were about people from x country)

If I think of more, I will add them (and try to collect more from folks next week in our offsite).

Another user story from reports

  • From an on-wiki or online community perspective, I would like to be able to get a sense of which events have been most productive and least productive against their stated goals in a window of time. One of the challenges that organizers have with communicating their work to the international movement (and one of the challenges), is that the online community selectively sees the inneffective events they have a conflict with, instead of the swarm of effective silent events that are quitely doing good work on the wikis.

About data usage

What data do we want to collect on edits, both in the short-term and long-term?
Metrics already included in the POC and suitable for the MVP:
  • Number of articles created through an event – ✅ Implemented
  • Number of articles improved through an event – ✅ Implemented
  • Words added – It seems doable, but we need more feedback. Defining what counts as a word is hard depending on the language (e.g., Japanese does not use spaces). I was able to make it work even for Japanese using the segmentByWord method in the Language.php class, but this might not be reliable across all languages.
  • Words removed – Same as above.
  • Bytes added – ✅ Implemented
  • Bytes removed – ✅ Implemented
  • References added – Seems possible using regex to count newly added <ref> tags or citation templates. [need feedback on this from folks that know more about it]
  • References removed – Same approach as added references.
  • Commons uploads added – Possible using regex to detect added file links ([[File: or [[Commons:File:). [need feedback on this from folks that know more about it]
  • References removed – Same approach as added references.
  • Commons uploads removed – Same as above.
Metrics that could be added in MVP or early V2:
  • Number of edits by newcomers (i.e., first edits were through the event)
  • Number of edits by returning editors (i.e., editors who hadn’t edited in the last 30 or 90 days)
Metrics that can be considered for V2 or later:
  • Number of reverts
  • Number of edits by article topic (via LiftWing) - need deeper investigation
  • Quality score of articles edited/created (via LiftWing) need deeper investigation
  • Number of 'thanks' given through event edits

Where do we show the data in the UI, both in the short-term and long-term?

  • Special:EventDetails (Contributions tab) – ✅ Should be included in the MVP.
  • Event page – Could include impact data directly on the event page after MVP.
  • Transclusion of Contributions tab – Yes, we can transclude the Contributions tab later. Not needed for MVP.
  • Special:Watchlist – Could show a tag or link indicating edits from events. Needs more investigation. Not recommended for MVP but seems doable.
  • Special:RecentChanges – Similar to Watchlist, a tag linking to the event. Also possible but should come post-MVP.
  • Newcomer Homepage Impact module – Potentially expandable to include event contributions. Needs clearer definition (e.g., show contributions from last X days/events).
  • Echo notifications – e.g., “Event X created 50 articles in the last 24h.” Possible in future but not for MVP.

What kinds of reports will people want in the future related to the data?

  • Participant demographics – Number of newcomers, experienced editors, affiliate membership, etc.
    • Not directly available in Superset. Would require a Jupyter .py script to query and calculate.
  • Participant impact – % of participants who made at least one edit, newcomer retention (30/60/90 days), number of thanks.
    • Possible, but again newcomer retention would not be available in Superset.
  • General contribution impact – Number of articles created/improved, words/bytes added, references/Commons uploads.
    • Possible. Word counting might be tricky for some languages.
  • Topical impact – Edits/articles categorized under themes like gender or underrepresented regions.
    • Could be done using LiftWing in V2.
  • Edit quality – Number of reverts, article quality scores.
    • Would be feasible in V2.
  • Reader impact – Pageviews for edited/improved articles.
    • Not sure how to track this yet; needs more investigation. Could be V2 or V3.

Items in the Acceptance Criteria

This task is focused on identifying:

  • The data structure required (e.g. revision ID and event ID):
    • The basic schema would include the data below:
`
CREATE TABLE /*_*/ce_event_edit_association (
  ceea_id BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
  ceea_revision_id BIGINT UNSIGNED NOT NULL,
  ceea_page_id BIGINT UNSIGNED NOT NULL,
  ceea_event_id BIGINT UNSIGNED NOT NULL,
  ceea_user_id BIGINT UNSIGNED NOT NULL,
  ceea_is_user_first_edit SMALLINT DEFAULT 0 NOT NULL,
  ceea_last_user_edit_date VARBINARY(14) DEFAULT NULL,
  ceea_bytes_added INT DEFAULT 0 NOT NULL,
  ceea_bytes_removed INT DEFAULT 0 NOT NULL,
  ceea_words_added INT DEFAULT 0 NOT NULL,
  ceea_words_removed INT DEFAULT 0 NOT NULL,
  ceea_references_added INT DEFAULT 0 NOT NULL,
  ceea_references_removed INT DEFAULT 0 NOT NULL,
  ceea_commons_uploads_added INT DEFAULT 0 NOT NULL,
  ceea_commons_uploads_removed INT DEFAULT 0 NOT NULL,
  ceea_edit_type VARBINARY(50) NOT NULL,
  ceea_is_minor SMALLINT DEFAULT 0 NOT NULL,
  ceea_wiki VARBINARY(64) NOT NULL,
  ceea_created_at VARBINARY(14) NOT NULL,
  INDEX ce_event_edit_association_event_id (ceea_event_id),
  INDEX ce_event_edit_association_user_id (ceea_user_id),
  INDEX ce_event_edit_association_event_user (ceea_event_id, ceea_user_id),
  PRIMARY KEY(ceea_id)
) /*$wgDBTableOptions*/;
`
  • Any relevant constraints or indexing considerations:
    • The structure above contains the suggested indexes:
      • INDEX ce_event_edit_association_event_id (ceea_event_id)
        • To efficiently retrieve all contributions associated with a specific event (e.g., when viewing the "Contributions" tab for an event).
      • INDEX ce_event_edit_association_user_id (ceea_user_id)
        • To quickly query all edits a user made as part of events (e.g., for user-specific reports or watchlist tagging).
      • INDEX ce_event_edit_association_event_user (ceea_event_id, ceea_user_id)
        • To efficiently fetch a specific user's contributions within a specific event, commonly used when rendering per-user breakdowns inside a given event context.
  • Where to store the data:
    • Same database (X1) where event data is already stored.

Considerations

  • We want flexibility, since future use cases may evolve.
  • Support for different types of edits, including reverts, deletions, etc. (e.g., moderator backlog drives).
    • This would require additional UI components (e.g., modal appearing on delete/revert actions).
  • Tagging on Watchlist/RecentChanges – We could tag edits based on page ID and our event association DB. Not MVP but feasible.
  • Handling reverts – Needs deeper investigation. Reverted edits should still be shown for tracking purposes, especially to spot gaps in contributor knowledge or potential unfair reverts.

Patch Demo for Tracking Contributions

You can try the early POC I created here:

🔗 Patch demo link

Test users (password: patchdemo1):
  • Patch Demo (admin)
  • Alice
  • Bob
  • Mallory (blocked)

There’s a new “Contributions” tab that shows:

  • A summary of contributions during the event
  • A table of individual contributions

Notes:

  • “User” and “View changes” links in the table do not work (POC only)
  • “Users with first edit” is currently fake data
  • Deleted articles data is also not real in the POC

Try registering to this event, then edit or create an article. Check Special:EventDetails/1 to see updated metrics.

About a question I saw somewhere: Can organizers create worklists + then assign articles, based on it? And how/if can this be used to track activity as well?

We could implement a system where organizers define worklists and optionally assign participants to specific articles. This could be done by introducing something like these two new tables:

ce_event_edit_association_worklist

This table would store a list of target pages for each event. Each row would include:

  • event_id: the ID of the event
  • page_id: the ID of the target page
  • Optional goals per page:
    • target_bytes_min, target_bytes_max
    • target_words_min, target_words_max
    • target_references
    • target_commons_uploads
ce_event_edit_association_worklist_assignments

This table would store assignments between users and pages from the worklist. It could include:

  • event_id
  • page_id
  • user_id
  • assignment_status: (e.g., assigned, declined, completed)
  • Contribution metrics:
    • bytes_added, words_added
    • references_added, commons_uploads_added
    • goal_met: boolean or computed field
Behavior

If a participant edits a page from the worklist, the system could automatically associate the edit with the event—especially if the user was explicitly assigned to that page. This would allow organizers to:

  • Track who contributed to which page
  • See whether the assigned improvement goals were met
  • Assess the success of targeted article improvements

This feature is not part of the MVP but could be considered in a future iteration to enhance contribution tracking and event impact reporting.

Thank you!

  1. Metrics already included in the POC and suitable for the MVP:

(Posted something in T378037#11032755 about references and uploads)

Where do we show the data in the UI, both in the short-term and long-term?

[...]

  • Special:Watchlist – Could show a tag or link indicating edits from events. Needs more investigation. Not recommended for MVP but seems doable.
  • Special:RecentChanges – Similar to Watchlist, a tag linking to the event. Also possible but should come post-MVP.

Although this is for later, it might be a good idea to be clear on our plans w.r.t. integrating this feature with change tags, if we anticipate a need to backfill data. Or maybe not, I don't know.

What kinds of reports will people want in the future related to the data?

  • Participant demographics – Number of newcomers, experienced editors, affiliate membership, etc.
    • Not directly available in Superset. Would require a Jupyter .py script to query and calculate.
  • Participant impact – % of participants who made at least one edit, newcomer retention (30/60/90 days), number of thanks.
    • Possible, but again newcomer retention would not be available in Superset.

Who is "people"? Users or WMF data analysts? If the former, I don't think we could use superset here.

  • The data structure required (e.g. revision ID and event ID):
    • The basic schema would include the data below:
CREATE TABLE /*_*/ce_event_edit_association (

[...]

  • The ceea prefix is already used by the ce_event_address table
  • Since this is storing pages, it is subject to all the caveats we discussed a long time ago when designing the campaign_events table (e.g., T307358). We likely wouldn't be able to display the full page name, if it isn't in the mainspace (should be OK for Wikipedias, but likely not for e.g. Wikisource). (Also begs the question of whether contribution tracking would be enabled on all namespaces)
  • I would move the wiki column higher up, it's really important and it took me a bit to realize it was there
  • I would look into combining some of the columns. Integer quantities could use a single column, e.g. bytes_diff as a signed integer, rather than two unsigned quantities, of which one would always be 0. ceea_edit_type and ceea_is_minor could possibly be combined too.
  • INDEX ce_event_edit_association_event_id (ceea_event_id)
    • To efficiently retrieve all contributions associated with a specific event (e.g., when viewing the "Contributions" tab for an event).

This index is redundant (it's a left prefix of ce_event_edit_association_event_user).

  • INDEX ce_event_edit_association_user_id (ceea_user_id)
    • To quickly query all edits a user made as part of events (e.g., for user-specific reports or watchlist tagging).

Are there actual plans of doing it? If not, we could add the index later on (premature indices actually make performance worse).

We are now beginning the work and have made decisions about technical implementation and the scope for the MVP, so I am marking this work as done. Great job, everyone!