Page MenuHomePhabricator

Fix Hive event.development_network_probe table
Closed, ResolvedPublic

Description

The event schema change made in T398596: Consider using the alternate chain of Google Trust Services certificates encountered this problem:

https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#Complex_array_element_and_map_value_type_evolution_is_not_well_supported

If we had done T366487: Event Platform schemas should not support type changes to structs as array element or map value types, this would have been caught by CI.

To resolve and make sure the Hive event.development_network_probe table will continue to work, we need to do some manual changes to the table.

The exact procedure will need some investigation. Perhaps we can just drop and allow the Refine system to recover? Or rename the old one and create allow Refine to create the new table? Or manually alter the new table correctly? Or Allow the system to create the new table and then insert all the old data?

Relevant Slack thread


The applied fix was to rename the event table to event.development_network_probe_T400360_original, and allow Refine to recreate the a new event.development_network_probe table that will work with the new schema. The old data is only in the old table.

Done is

  • event.development_network_probe data can be queried correctly
  • Historical data still queryable somehow
  • event.development_network_probe_T400360_original table and data is deleted by 2025-10-27 (reminder message is scheduled for #data-engineering-team slack channel on that date).

Event Timeline

FYI this is causing data-engineering-alerts email spam every hour.

FYI, the Refine Yarn application id that evolved the event.development_network_probe table is , and the ALTER that was run is: application_1750705250302_912703

ALTER TABLE event.development_network_probe
CHANGE COLUMN `reports` `reports` ARRAY<STRUCT<`pulses`: ARRAY<STRUCT<`actual_bytes`: DOUBLE, `dns_time_ms`: DOUBLE, `duration_ms`: DOUBLE, `pulse_identifier`: STRING, `pulse_number`: DOUBLE, `redirect_time_ms`: DOUBLE, `request_time_ms`: DOUBLE, `response_time_ms`: DOUBLE, `status_code`: DOUBLE, `tcp_time_ms`: DOUBLE, `tls_time_ms`: DOUBLE, `transfer_bytes`: DOUBLE, `ttfb_time_ms`: DOUBLE, `host`: STRING COMMENT 'Hostname that handled the pulse'>>, `target_name`: STRING, `target_url`: STRING>>

I spent a little time today trying to figure out the right thing to do.

If @CDanis etc. is okay with missing some historical data, I think we should:

  • drop the event.development_network_probe table
  • re-refine all existent raw event data.

I believe this would preserve the past 60(?) days of data, as well as keep the newly added host field in 1.1.0 of the schema.

We should test that this works before we do it. To do so we need to manually run the appropriate Refine Spark command, but with a different destination Hive database on a certain hourly partition. I would try to refine /wmf/data/raw/event/eqiad.development.network.probe/year=2025/month=07/day=23/hour=11 and then run the following SQL on the test refined table:

select reports[0] from <dbname>.development_network_probe_test_T400360 where year=2025 and month=7 and day=29 and hour=0 limit 1;

Success would mean that the reports[].pulses[].status_code field is not e.g. 2.0 (probably it will be null?) and tcp_time_ms is not always/often 200.0, but some other sensible ms number.

If this works, then we can re-refine the event.development_network_probe table with all raw data.

@mforns whatcha think?

I believe this would preserve the past 60(?) days of data, as well as keep the newly added host field in 1.1.0 of the schema.

This would be fine. Even just preserving one month of history would be acceptable.

@Ottomata Sounds good! If the record is formatted correctly we can assume the re-recreation of the table from scratch will work fine.

Testing.

Ran

spark3-submit --driver-cores 1 --master 'local[1]' --conf spark.yarn.maxAppAttempts=1     --conf write.spark.accept-any-schema=true --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2     --conf spark.executor.memoryOverhead=384MB --conf spark.dynamicAllocation.maxExecutors=1   --executor-cores 1 --executor-memory 2G --driver-memory 3G  --name refine_to_hive_hourly__refine_hive_dataset.evolve_and_refine_to_hive_hourly__2025072311__116_development_network_probe_TEST_T400360     --class org.wikimedia.analytics.refinery.job.refine.cli.EvolveAndRefineToHiveTable     --queue launchers --deploy-mode client hdfs:///wmf/cache/artifacts/airflow/analytics/refinery-job-0.2.67-shaded.jar     --table otto.development_network_probe_TEST_T400360_00 --schema_uri /development/network/probe/latest     --location hdfs://analytics-hadoop/user/otto/external/event/development_network_probe_T400360_00 --partition_columns     datacenter:STRING,year:LONG,month:LONG,day:LONG,hour:LONG --transform_functions     org.wikimedia.analytics.refinery.job.refine.remove_canary_events,org.wikimedia.analytics.refinery.job.refine.deduplicate,org.wikimedia.analytics.refinery.job.refine.geocode_ip,org.wikimedia.analytics.refinery.job.refine.parse_user_agent,org.wikimedia.analytics.refinery.job.refine.add_is_wmf_domain,org.wikimedia.analytics.refinery.job.refine.add_normalized_host,org.wikimedia.analytics.refinery.job.refine.normalizeFieldNamesAndWidenTypes     --input_paths /wmf/data/raw/event/eqiad.development.network.probe/year=2025/month=07/day=23/hour=11,/wmf/data/raw/event/codfw.development.network.probe/year=2025/month=07/day=23/hour=11     --partition_paths datacenter=eqiad/year=2025/month=7/day=23/hour=11,datacenter=codfw/year=2025/month=7/day=23/hour=11     --spark_job_scale small --ignore_missing_paths False
25/07/29 15:50:12 INFO HiveExtensions: Running Hive DDL statement:
CREATE EXTERNAL TABLE otto.development_network_probe_TEST_T400360_00 (
`_schema` STRING,
`ctx` MAP<STRING, STRING>,
`dt` STRING,
`http` STRUCT<`client_ip`: STRING, `has_cookies`: BOOLEAN, `method`: STRING, `protocol`: STRING, `request_headers`: MAP<STRING, STRING>, `response_headers`: MAP<STRING, STRING>, `status_code`: BIGINT>,
`meta` STRUCT<`domain`: STRING, `dt`: STRING, `id`: STRING, `request_id`: STRING, `stream`: STRING, `uri`: STRING>,
`recipe_name` STRING,
`recipe_type` STRING,
`reports` ARRAY<STRUCT<`pulses`: ARRAY<STRUCT<`actual_bytes`: DOUBLE COMMENT 'Actual data size in bytes (decodedBodySize)', `dns_time_ms`: DOUBLE COMMENT 'DNS resolution time in ms (domainLookupEnd - domainLookupStart)', `duration_ms`: DOUBLE COMMENT 'Duration of the request in ms (duration)', `host`: STRING COMMENT 'Hostname that handled the pulse', `pulse_identifier`: STRING COMMENT 'Alphanumeric string used to uniquely identify each pulse', `pulse_number`: DOUBLE COMMENT 'Pulse number', `redirect_time_ms`: DOUBLE COMMENT 'Redirection time in ms (redirectStart - redirectEnd)', `request_time_ms`: DOUBLE COMMENT 'Request time in ms (responseStart - requestStart)', `response_time_ms`: DOUBLE COMMENT 'Response time in ms (responseEnd - responseStart)', `status_code`: DOUBLE COMMENT 'Status of the request (responseStatus)', `tcp_time_ms`: DOUBLE COMMENT 'TCP connection time in ms (secureConnectionStart - connectStart)', `tls_time_ms`: DOUBLE COMMENT 'TLS handshake time in ms (connectEnd - secureConnectionStart)', `transfer_bytes`: DOUBLE COMMENT 'Transfer data size in bytes (encodedBodySize)', `ttfb_time_ms`: DOUBLE COMMENT 'Time to first byte in ms (responseStart - startTime)'>> COMMENT 'The list containing details of all pulses. A pulse represents a single request to the target_url.', `target_name`: STRING COMMENT 'Name of the target. Example eqiad, eqsin, etc', `target_url`: STRING COMMENT 'URL of the target. Example https://measure-eqiad.wikimedia.org/measure'>>,
`geocoded_data` MAP<STRING, STRING>,
`user_agent_map` MAP<STRING, STRING>,
`is_wmf_domain` BOOLEAN,
`normalized_host` STRUCT<`project_class`: STRING, `project`: STRING, `qualifiers`: ARRAY<STRING>, `tld`: STRING, `project_family`: STRING>
)
PARTITIONED BY (
`datacenter` STRING,
`year` BIGINT,
`month` BIGINT,
`day` BIGINT,
`hour` BIGINT
)
STORED AS PARQUET
LOCATION 'hdfs://analytics-hadoop/user/otto/external/event/development_network_probe_T400360_00'


...


25/07/29 15:51:55 INFO RefineHiveDataset: Successfully refined 4166 rows from WrappedArray(/wmf/data/raw/event/eqiad.development.network.probe/year=2025/month=07/day=23/hour=11, /wmf/data/raw/event/codfw.development.network.probe/year=2025/month=07/day=23/hour=11) into `otto`.`development_network_probe_test_t400360_00` datacenter="codfw",year=2025,month=7,day=23,hour=11 at hdfs://analytics-hadoop/user/otto/external/event/development_network_probe_T400360_00/development_network_probe_TEST_T400360_00/datacenter=codfw/year=2025/month=7/day=23/hour=11
select reports[0] from `otto`.`development_network_probe_test_t400360_00` where year=2025 and month=7 and day=23 and hour=11 limit 1;
-- formatted reports field

{
  "pulses": [
    {
      "actual_bytes": 0,
      "dns_time_ms": 169,
      "duration_ms": 609,
      "host": null,
      "pulse_identifier": "ea82108fccd4efa923d4_esams_0",
      "pulse_number": 0,
      "redirect_time_ms": 0,
      "request_time_ms": 143,
      "response_time_ms": 3,
      "status_code": 200,
      "tcp_time_ms": 137,
      "tls_time_ms": 149,
      "transfer_bytes": 0,
      "ttfb_time_ms": null
    },
    {
      "actual_bytes": 0,
      "dns_time_ms": 0,
      "duration_ms": 147,
      "host": null,
      "pulse_identifier": "ea82108fccd4efa923d4_esams_1",
      "pulse_number": 1,
      "redirect_time_ms": 0,
      "request_time_ms": 134,
      "response_time_ms": 2,
      "status_code": 200,
      "tcp_time_ms": 0,
      "tls_time_ms": 0,
      "transfer_bytes": 0,
      "ttfb_time_ms": null
    },
    {
      "actual_bytes": 0,
      "dns_time_ms": 0,
      "duration_ms": 154,
      "host": null,
      "pulse_identifier": "ea82108fccd4efa923d4_esams_2",
      "pulse_number": 2,
      "redirect_time_ms": 0,
      "request_time_ms": 135,
      "response_time_ms": 3,
      "status_code": 200,
      "tcp_time_ms": 0,
      "tls_time_ms": 0,
      "transfer_bytes": 0,
      "ttfb_time_ms": null
    }
  ],
  "target_name": "esams",
  "target_url": "https://measure-esams.wikimedia.org/measure"
}

Looks like it worked!

@mforns and I worked on this today, and it turned out that backfilling this data would be quite difficult. RefineToHiveDataset (the new Refine on Airflow CLI) only works with one or an explicit list of partitions at a time; it will not allow us to target a table and range of dates to backfill.

We could have scripted together a series of RefineToHiveDataset commands. Doing it hourly for 90 days would have been 2160 individual Spark jobs, or scripting together batches of commands that specified a series (daily of 24 partitions maybe?) of commands to run. We may have been able to reduce it to 90 or maybe fewer jobs, but the CLI would get quite long: every partition must be listed explicitly.

We could have used airflow somehow, but that would have been trickier. We can't backfill individual mapped tasks. We could clear and rerun mapped tasks that already existed in Airflow (since the Refine on Airflow migration was complete), but that is only a couple of weeks worth of data.

We asked Chris if he would mind if we renamed his existent table and just allowed Refine to re-create the table and start filling it with new data going forward. Old data would only exist in the renamed table. He said that was fine, so we did it!

We first tested on a fake table. The procedure we used for the real table is below.


# Move the original table file data to a new location
sudo -u analytics kerberos-run-command analytics hdfs dfs -mv /wmf/data/event/development_network_probe /wmf/data/event/development_network_probe_T400360_original
-- Rename the original table
ALTER TABLE `event`.`development_network_probe` RENAME TO `event`.`development_network_probe_T400360_original`;

-- Alter the external location of the original (now renamed) table to the new location
ALTER TABLE `event`.`development_network_probe_T400360_original` SET LOCATION 'hdfs://analytics-hadoop/wmf/data/event/development_network_probe_T400360_original';

-- Drop all partitions from the original (now renamed) table
-- NOTE: This must be run via Hive CLI; it will not work in Spark SQL.
ALTER TABLE `event`.`development_network_probe_T400360_original` DROP IF EXISTS PARTITION (year<>0); 
-- Time taken: 44.678 seconds

-- Repair the table to get partitions re-added but now pointing at the new location
MSCK REPAIR TABLE `event`.`development_network_probe_T400360_original`;
-- Time taken: 36.069 seconds, Fetched: 4353 row(s)


-- Verify that partitions are added at the new location and data can be selected
SHOW PARTITIONS `event`.`development_network_probe_T400360_original`;

-- day=10 should have good data, e.g. status_code makes sense
select reports[0] from `event`.`development_network_probe_T400360_original` where year=2025 and month=7 and day=10 and hour=11 limit 1;

-- day=23 will have 'bad data' because it was after the 1.1.0 schema change was applied
select reports[0] from `event`.`development_network_probe_T400360_original` where year=2025 and month=7 and day=23 and hour=11 limit 1;

Check that refine will create a new table with incoming data. The airflow mapped task that handled this was:
https://airflow.wikimedia.org/dags/refine_to_hive_hourly/grid?dag_run_id=scheduled__2025-07-29T16%3A00%3A00%2B00%3A00&task_id=refine_hive_dataset.evolve_and_refine_to_hive_hourly&tab=details&map_index=116

-- Check that the newly created table (with the original table name) has good data
select reports[0].pulses[0] from `event`.`development_network_probe` WHERE datacenter="codfw" AND year=2025 AND month=7 AND day=29 AND hour=16 and reports[0].pulses[0].status_code IS NOT NULL limit 1;

-- Looks good!  status_code: 200.0
{"actual_bytes":0.0,"dns_time_ms":134.0,"duration_ms":635.0,"host":null,"pulse_identifier":"6d90e6a5ca1b67ed191c_esams_0","pulse_number":0.0,"redirect_time_ms":0.0,"request_time_ms":161.0,"response_time_ms":1.0,"status_code":200.0,"tcp_time_ms":163.0,"tls_time_ms":171.0,"transfer_bytes":0.0,"ttfb_time_ms":null}

So status:

  • old data is in event.development_network_probe_T400360_original; Data in this table since the schema change was merged (2025-07-23T11:00:00 UTC ?) is bad; IIUC, the fields were ordered incorrectly which causes them to read different parquet data into the wrong fields.
  • new data since today 2025-07-29T16:00:00 UTC should be good to go in the event.development_network_probe table. This is using the new 1.1.0 schema with the host field, so @CDanis @Vgutierrez you should be good to produce data that matches your new schema.

90 days from now (or whenever @CDanis and @Vgutierrez don't need the old data anymore), we should drop the renamed table and data.

I'll update the task description, and maybe set a scheduled reminder message in slack? :)

I'll update the task description, and maybe set a scheduled reminder message in slack? :)

That sounds good, thanks Andrew.

I'd also like to know -- as users of Event Platform, is there any way we could have known this was going to be such an ordeal before changing the schema? Should we re-prioritize T366487? Are there any other followup Event Platform engineering actionables here?

Yes, we should prioritize T366487. We've just moved it back into this quarter's work.

This would have prevented you from making the schema change in CI.

Apologies for the trouble!

From an-launcher1003:

$ sudo -u analytics kerberos-run-command analytics spark3-sql
DROP TABLE `event`.`development_network_probe_T400360_original`
$ sudo -u analytics kerberos-run-command analytics hdfs dfs -rm -r /wmf/data/event/development_network_probe_T400360_original