Page MenuHomePhabricator

Decommission the EditConflict instrument
Closed, ResolvedPublic

Description

The schema was marked as "Deprecate" on the EventLogging Schema Audit spreadsheet in September, 2021. Other than the EditConflict instrument itself, there are no references to it that I can find via Codesearch:

TODO

Event Timeline

There are no references to it in any analytics codebases.

Hey, @ori! You're listed as the (a?) maintainer of EditConflict schema at https://meta.wikimedia.org/wiki/Schema_talk:EditConflict. Are you aware of anything using the data collected by the EditConflict instrument?

phuedx updated the task description. (Show Details)

@phuedx I'm not aware of anything actively using it, no, but I'm also out of the loop -- can you ask someone on the performance team to confirm?

phuedx renamed this task from Decommission the EditConflict instrument to Decommission the EditConflict instrument?.Oct 15 2022, 6:18 AM

The instrumentation was introduced in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/139270. It was requested by Erik (presumably, Erik Moeller, our then-CPTO, as VP of Engineering and Product). I'm guessing the reason it went through our team, was that it was early days for EventLogging and still effectively owned by Performance with not really a practice established yet for who owns what or instrumenting metrics of interest directly.

Anyway, I too am not aware of any usage. I do know that operationally both SRE and developers like myself often monitor edit conflict stats to understand stability of edit APIs and logic around diff3 merging edits automatically. But, MediaWiki core emits these stats directly already, without needing fully detailed events to be stored anywhere.

phuedx renamed this task from Decommission the EditConflict instrument? to Decommission the EditConflict instrument.Oct 17 2022, 10:34 AM

@Krinkle, FYI, deleting schemas without ensuring that all data and code and config that use the schema are gone causes alerts to fire. In this case, the Hive ingestion jobs for EditConflict have been failing all weekend.

No harm done, as clearly we don't want this data, but @BTullis responded and wasn't sure why things were failing.

Change 843492 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Eventlogging - Stop refining decomissioned EditConflict events

https://gerrit.wikimedia.org/r/843492

Change 843492 merged by Ottomata:

[operations/puppet@production] Eventlogging - Stop refining decomissioned EditConflict events

https://gerrit.wikimedia.org/r/843492

Change 843494 had a related patch set uploaded (by Phuedx; author: Phuedx):

[mediawiki/extensions/WikimediaEvents@master] Hooks: Remove EditConflict instrument

https://gerrit.wikimedia.org/r/843494

Change 843494 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] Hooks: Remove EditConflict instrument

https://gerrit.wikimedia.org/r/843494

phuedx claimed this task.

Being bold. Thanks everyone!

@Krinkle, FYI, deleting schemas without ensuring that all data and code and config that use the schema are gone causes alerts to fire. In this case, the Hive ingestion jobs for EditConflict have been failing all weekend. […]

Meta-Wiki is a publicly editable wiki. My now-dated knowledge of EventLogging is that it was designed around schema revision IDs (or sometimes, a name-revId pair), which events provide explicitly in the capsule for that reason. This means that major schema refactorings can be safely drafted on-wiki even if the relating code has not yet gone through code review, or is not yet finished, or otherwise not yet deployed to production. Including eg. intentionally creating "bad" revisions for local development or Beta cluster testing. The server only fetches schemas (by revision ID) as related to actually incoming events, and they should only affect those events that relate to a given revision ID.

Perhaps this is an old bug in the server code we've somehow not noticed or heard about before, but it seems problematic if minor edits like this can cause such operational issues. In any event, I've created empty schema revisions like this for years for unused/archived schemas to reduce search results and other indication that it is still actively referenced or used. Afaik it is not uncommon due to Wikipedia mirrors or old app installs, that sometimes we receive events months or years after we have undeployed a schema. Hence I assumed that thusfar this has not caused issues given those would have come in after creating an empty schema revision on-wiki for those. Perhaps this is a recent regression?

I'm happy to avoid this for a while or not do it at all. But as said it's a publicly editable wiki, so this will happen from time to time even if I try to remember not to.

The server only fetches schemas (by revision ID) as related to actually incoming events, and they should only affect those events that relate to a given revision ID.

This was true when every new schema revision was also a new MariaDB table. We now use the latest version of a schema to evolve the underlying datastores so that analysts and others don't have to do complicated UNION SQL queries between different versions of the same table.

if minor edits like this can cause such operational issues.

Indeed, and this was one of the reasons we moved schemas off wiki in Event Platform, so we could apply test them in CI and make sure breakages don't happen.

Perhaps this is a recent regression?

I think we've encountered this before, but I'd have to dig up the times when it happened.

I'm happy to avoid this for a while or not do it at all.

It's actually okay to do this, but only after fully decommissioning the stream and/or ingestion configs of this data.