Page MenuHomePhabricator

Metrics Platform Event custom_data field isn't refined correctly
Closed, ResolvedPublic1 Estimated Story Points

Description

Background

Thursday, 28th July 2022
Friday, 29th July 2022
select
    custom_data,
    count(*) as n
from
    mediawiki_web_ui_interactions
where
    year = 2022
    and month = 7
    and day = 28
group by
    custom_data
order by
    n desc
limit 10000
;

custom_data                        n
{"data_type":null,"value":null}    339

The Issue

I reached out to @Ottomata and he pointed out that the schema for the custom_data field should be a map type but is currently a schemaless object (a struct). The schema for the field should be like:

custom_data:
  type: object
  propertyNames:
    pattern: ^[$a-z]+[a-z0-9_]*$ # "[P]roperties must be snake_case"
    minLength: 1
    maxLength: 255
  additionalProperties:
    type: object
    properties:
     data_type:
       type: string
       enum:
         - number
         - string
         - boolean
         - null
     value:
       type: string

This is a type change of a field, which are strongly recommended against. Since there are no downstream consumers, however, we should be able to change the type of the field by:

  1. Disabling the instrument on testwiki
  2. Dropping the mediawiki_web_ui_interactions table
  3. Updating the schema
  4. Restarting EventGate
  5. Re-enabling the instrument on testwiki
TODO

Event Timeline

phuedx triaged this task as High priority.
phuedx moved this task from Backlog to Work in Progress on the Metrics-Platform board.

Change 819014 had a related patch set uploaded (by Phuedx; author: Phuedx):

[operations/mediawiki-config@master] Revert "testwiki: Add mediawiki.web_ui.interactions stream"

https://gerrit.wikimedia.org/r/819014

Change 819043 had a related patch set uploaded (by Phuedx; author: Phuedx):

[schemas/event/secondary@master] mediawiki/client/metrics_event: Make custom_data a map type

https://gerrit.wikimedia.org/r/819043

Change 819014 merged by jenkins-bot:

[operations/mediawiki-config@master] Revert "testwiki: Add mediawiki.web_ui.interactions stream"

https://gerrit.wikimedia.org/r/819014

Mentioned in SAL (#wikimedia-operations) [2022-08-02T13:15:56Z] <urbanecm@deploy1002> Synchronized wmf-config/InitialiseSettings.php: a4499e5ac23a0558bed276e2b74134590afc5c95: Revert "testwiki: Add mediawiki.web_ui.interactions stream" (T314151, T311268) (duration: 03m 19s)

I think that in order to delete the existing data we should do the following:

# Drop the hive tables from the hive CLI
drop table event.medawiki_web_ui_interactions;
drop event_sanitized.medawiki_web_ui_interactions;

# Remove the underlying HDFS files
sudo -u analytics kerberos-run-command analytics hdfs dfs -rm -r /wmf/data/event/medawiki_web_ui_interactions
sudo -u analytics kerberos-run-command analytics hdfs dfs -rm -r /wmf/data/event_sanitized/medawiki_web_ui_interactions
sudo -u analytics kerberos-run-command analytics hdfs dfs -rm -r /wmf/data/raw/event/eqiad.mediawiki.web_ui.interactions
sudo -u analytics kerberos-run-command analytics hdfs dfs -rm -r /wmf/data/raw/event/codfw.mediawiki.web_ui.interactions

Once we have done that, we can:

  1. merge the schema change
  2. trigger a new build of eventgate using the new SHA of the secondary schema repo
  3. deploy the new eventgate image and restart all eventgate services

When those are done, the instrument can be re-enabled in the mediawiki configuration.

@Milimetric, @Ottomata - does this all look OK to you? Have I missed anything obvious?

Order is correct! However in this case, you don't need step 2., and you only need to bounce the eventgate-analytics-external service.

Since this is an analytics instrumentation stream, eventgate-analytics-external dynamically loads and caches the schemas and stream configs.

So:

  1. Drop tables and delete underlying data.
  2. merge the schema change (and make sure puppet runs on schema[12]00* to deploy it)
  3. Bounce eventgate-analytics-external service: https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administration#Roll_restart_all_pods

Thanks @Ottomata - Could you clarify for me please?

Since this is an analytics instrumentation stream...

As opposed to what? I don't reall understand primary vs secondary here either.

Thanks.

EChetty set the point value for this task to 1.Thu, Aug 4, 11:53 AM

Mentioned in SAL (#wikimedia-analytics) [2022-08-04T18:31:50Z] <ottomata> dropping medawiki_web_ui_interactions hive tables and data - T314151

Change 819043 merged by Ottomata:

[schemas/event/secondary@master] mediawiki/client/metrics_event: Make custom_data a map type

https://gerrit.wikimedia.org/r/819043

Mentioned in SAL (#wikimedia-operations) [2022-08-04T19:02:15Z] <ottomata> roll-restarting eventgate-analytics-external to pick up backwards incompatible schema change - T314151

# Drop the hive tables from the hive CLI
drop table event.mediawiki_web_ui_interactions;
drop table event_sanitized.mediawiki_web_ui_interactions;

# Remove the underlying HDFS files
sudo -u analytics kerberos-run-command analytics hdfs dfs -rm -r /wmf/data/event/mediawiki_web_ui_interactions
sudo -u analytics kerberos-run-command analytics hdfs dfs -rm -r /wmf/data/event_sanitized/mediawiki_web_ui_interactions
sudo -u analytics kerberos-run-command analytics hdfs dfs -rm -r /wmf/data/raw/event/eqiad.mediawiki.web_ui.interactions
sudo -u analytics kerberos-run-command analytics hdfs dfs -rm -r /wmf/data/raw/event/codfw.mediawiki.web_ui.interactions
19:01:24 [@deploy1002:/home/otto] 1 $ cd /srv/deployment-charts/helmfile.d/services/eventgate-analytics-external

helmfile -e staging --state-values-set  roll_restart=1 sync
helmfile -e codw --state-values-set  roll_restart=1 sync
helmfile -e eqiad --state-values-set  roll_restart=1 sync

Done!