Page MenuHomePhabricator

Android Metrics Platform Migration Data Validation - first pass - first 4 tables
Closed, ResolvedPublic

Description

First pass data validation for the following migrated schemas:

in core_interactions schema

  • event.android_product_metrics_article_link_preview_interaction (and android_metrics_platform_article_link_preview_interaction)
  • event.android_product_metrics_article_toolbar_interaction

Results will be tracked here in this ticket.

Event Timeline

Note: Found some discrepancies in schema naming convention between Hue and Superset - I am assuming these are known schema/stream name differences but recording here for reference.

in Hue schema names are:

android_metrics_platform_article_link_preview_interaction
android_metrics_platform_article_toc_interaction
android_metrics_platform_article_toolbar_interaction
android_metrics_platform_find_in_page_interaction

Screenshot 2023-12-19 at 9.29.12 AM.png (398×404 px, 52 KB)

In Superset the same schemas as above are present AND:

android_product_metrics_article_link_preview_interaction
android_product_metrics_article_toc_interaction
android_product_metrics_article_toolbar_interaction
android_product_metrics_find_in_page_interaction

This 'product_metrics' naming corresponds with the schemas/streams listed in https://phabricator.wikimedia.org/T352947 and on https://schema.wikimedia.org/#!/secondary/jsonschema/analytics/mobile_apps/product_metrics, and these were the original schemas I planned to compare to with our existing schemas for data validation. I will split up findings by table so see next comment for first comparison.

Just starting to compare data with our existing MEP data tables, using android_metrics_platform_article_link_preview_interaction, android_product_metrics_article_link_preview_interaction and the original android_article_link_preview_interaction.

  • So far both new tables have significantly less event data than the original android_article_link_preview_interaction so I'm unsure whether either of these is supposed to be live and collecting data or are both just testbeds.
  • Table android_product_metrics_article_link_preview_interaction value for agent.app_install_id is NULL for all rows.
    • Values for user_agent_map are also null
    • Data is also missing any records where event/action_subtype = disabled

First pass daily events/uniques counts data for all 3 article_link_preview_interaction tables: link to data

Can confirm that we currently have multiple active streams with

StreamSchemaTable in Hive
android.product_metrics.article_toc_interactionanalytics/mobile_apps/product_metrics/android_article_toc_interactionandroid_product_metrics_article_toc_interaction
android.metrics_platform.article_toc_interactionanalytics/mediawiki/client/metrics_eventandroid_metrics_platform_article_toc_interaction

It looks like maybe the product_metrics is the one to use and that the metrics_platform refers to the original migration attempt before Metrics Platform underwent significant changes with regards to custom data?

https://meta.wikimedia.org/w/api.php?action=streamconfigs&constraints=destination_event_service=eventgate-analytics-external&streams=android.product_metrics.article_toc_interaction|android.metrics_platform.article_toc_interaction

{
    "streams": {
        "android.product_metrics.article_toc_interaction": {
            "schema_title": "analytics/mobile_apps/product_metrics/android_article_toc_interaction",
            "destination_event_service": "eventgate-analytics-external",
            "producers": {
                "metrics_platform_client": {
                    "events": [
                        "android.metrics_platform.article_toc_interaction"
                    ],
                    "provide_values": [
                        "mediawiki_database",
                        "page_title",
                        "page_content_language",
                        "page_id",
                        "page_namespace_id",
                        "performer_is_logged_in",
                        "performer_session_id",
                        "performer_pageview_id",
                        "performer_language_groups",
                        "performer_language_primary",
                        "performer_groups"
                    ]
                }
            },
            "sample": {
                "unit": "device",
                "rate": 1
            },
            "topic_prefixes": [
                "eqiad.",
                "codfw."
            ],
            "canary_events_enabled": true,
            "consumers": {
                "analytics_hadoop_ingestion": {
                    "job_name": "event_default",
                    "enabled": true
                }
            },
            "stream": "android.product_metrics.article_toc_interaction",
            "topics": [
                "eqiad.android.product_metrics.article_toc_interaction",
                "codfw.android.product_metrics.article_toc_interaction"
            ]
        },
        "android.metrics_platform.article_toc_interaction": {
            "schema_title": "analytics/mediawiki/client/metrics_event",
            "destination_event_service": "eventgate-analytics-external",
            "producers": {
                "metrics_platform_client": {
                    "events": [
                        "android.metrics_platform.article_toc_interaction"
                    ],
                    "provide_values": [
                        "agent_app_install_id",
                        "agent_client_platform",
                        "agent_client_platform_family"
                    ]
                }
            },
            "sample": {
                "unit": "session",
                "rate": 1
            },
            "topic_prefixes": [
                "eqiad.",
                "codfw."
            ],
            "canary_events_enabled": true,
            "consumers": {
                "analytics_hadoop_ingestion": {
                    "job_name": "event_default",
                    "enabled": true
                }
            },
            "stream": "android.metrics_platform.article_toc_interaction",
            "topics": [
                "eqiad.android.metrics_platform.article_toc_interaction",
                "codfw.android.metrics_platform.article_toc_interaction"
            ]
        }
    }
}

Omitting notes on older 'metrics_platform` data - Comparing data tables for android_product_metrics_article_toolbar_interaction and MEP android_article_toolbar_interaction

  • Table android_product_metrics_article_toolbar_interaction value for agent.app_install_id is NULL for all rows.
    • Values for user_agent_map are also null
    • action_subtype values not present for categories, edit_articles, edit_history_from_article, forward,new_tab, notification, share, talk_page, talk_page_from_article, theme

First pass daily events/uniques counts data for all 3: link to data

SNowick_WMF renamed this task from Android Metrics Platform Migration Data Validation - first 4 schemas to Android Metrics Platform Migration Data Validation - first pass - first 4 tables.Dec 19 2023, 8:33 PM
SNowick_WMF updated the task description. (Show Details)

Both tables android_product_metrics_article_toc_interaction and android_product_metrics_find_in_page_interaction look to be not in use currently:

  • android_product_metrics_article_toc_interaction contains 1 row of data with missing values for action, agent.app_install_id and user_agent_map
  • android_product_metrics_find_in_page_interaction contains 2 rows of data with missing values for agent.app_install_id and user_agent_map

Will hold off on any further analysis until after holiday break to get clarification on expected data for these tables.

Adding original Legacy or MEP schema docs/links for anyone looking for events

  • event.android_product_metrics_article_link_preview_interaction

MobileWikiAppLinkPreview Legacy Schema

  • event.android_product_metrics_article_toolbar_interaction

android_article_toolbar_interaction Instrumentation Process and Spec

These 2 have current.yaml files in the mobile_apps directory:

I dug into the client-side behavior to see where/whether these events are being sent, and made the following observations:

  • There are indeed two different streams configured for the same event(s), namely the product_metrics and metrics_platform versions as mentioned above, and both streams are configured to receive the identical event name. (Is that intended to be allowed?)
  • The problem seems to be that the Metrics client library sends the event to both streams at once, but doesn't use the correct schema name when sending them. Rather, it uses the schema name from product_metrics for both streams, so the metrics_platform event is getting rejected because it's part of a different schema.
  • Further, the server behavior in this case is a bit weird:
    • The client library sends the two events as a batch to the v1/events endpoint, and the response from the server is HTTP 207, which I wasn't familiar with. Apparently it means "multi-status", which was an indication to me that one of the events might not be right.
    • And indeed, if we send just a single event to the product_metrics stream, the server returns the usual 201, and if we send a single event to the metrics_platform stream, the server returns 500 (with a helpful explanation that the schema name is incorrect.)

The slight issue I see is that the server probably shouldn't return a 2xx code (technically a "successful" response!) if one of the events is formatted incorrectly. We have logic that alerts us if we format an event incorrectly, but it only works if the server responds with an error. Can the server's behavior be changed to do that?

In order of least to most severe:


A

  • There are indeed two different streams configured for the same event(s), namely the product_metrics and metrics_platform versions as mentioned above, and both streams are configured to receive the identical event name. (Is that intended to be allowed?)

This isn't strictly an issue. The events stanza isn't required for streams but we should remove the event stanzas from the android.product_metrics_* streams as it's not required and its presence is confusing.


B

  • Table android_product_metrics_article_link_preview_interaction value for agent.app_install_id is NULL for all rows.
  • Table android_product_metrics_article_toolbar_interaction value for agent.app_install_id is NULL for all rows.

This is a misconfiguration issue. The stream definitions don't request the value be mixed in to the events being submitted.


C

  • The problem seems to be that the Metrics client library sends the event to both streams at once, but doesn't use the correct schema name when sending them. Rather, it uses the schema name from product_metrics for both streams, so the metrics_platform event is getting rejected because it's part of a different schema.

This is a serious bug, which should be written up and fixed ASAP /cc @WDoranWMF @cjming


I'll submit a patch for A and B momentarily.

The slight issue I see is that the server probably shouldn't return a 2xx code (technically a "successful" response!) if one of the events is formatted incorrectly. We have logic that alerts us if we format an event incorrectly, but it only works if the server responds with an error.

HTTP 207 fits IMO but I can see the case for 400 Bad Request. Regardless of the response code, however, you'd need to inspect the body of the response to determine which events failed validation/were erroneous. I'll write this up as well.

From https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate

the guaranteed producer type is the default for the /v1/events endpoint. To POST an event in hasty mode, set hasty=true in the request query parameters.

Where the "guaranteed" type/mode is

The guaranteed producer will block the HTTP response until the event has been validated and sent to the Kafka brokers with either an ACK response or known failure to ACK.

According to https://www.rfc-editor.org/rfc/rfc4918#section-13:

The default Multi-Status response body is a text/xml or application/xml
HTTP entity with a 'multistatus' root element.  Further elements
contain 200, 300, 400, and 500 series status codes generated during
the method invocation. 100 series status codes SHOULD NOT be recorded
in a 'response' XML element.

Although '207' is used as the overall response status code, the
recipient needs to consult the contents of the multistatus response
body for further information about the success or failure of the
method execution.  The response MAY be used in success, partial
success and also in failure situations.

Assuming that EventGate does respond with XML with details of the multiple responses and since the client is configured to send batches of events, then the logic which handles the response (and alerts the team) should probably be updated to deal with the possibility of multiple responses.

A

  • There are indeed two different streams configured for the same event(s), namely the product_metrics and metrics_platform versions as mentioned above, and both streams are configured to receive the identical event name. (Is that intended to be allowed?)

This isn't strictly an issue. The events stanza isn't required for streams but we should remove the event stanzas from the android.product_metrics_* streams as it's not required and its presence is confusing.


C

  • The problem seems to be that the Metrics client library sends the event to both streams at once, but doesn't use the correct schema name when sending them. Rather, it uses the schema name from product_metrics for both streams, so the metrics_platform event is getting rejected because it's part of a different schema.

This is a serious bug, which should be written up and fixed ASAP /cc @WDoranWMF @cjming

I misremembered. A and C are related. T354199: Java MPC shouldn't broadcast events to multiple streams tracks this issue.

Change 987159 had a related patch set uploaded (by Phuedx; author: Phuedx):

[operations/mediawiki-config@master] Add agent.app_install_id to android.product_metrics.* streams

https://gerrit.wikimedia.org/r/987159

Demo: Single

{
    "meta": {
        "stream": "test.instrumentation"
    },
    "$schema": "/analytics/test/2.0.0",
    "dt": "2024-01-02T16:32:20Z",
    "test_string": "Hello world!"
}
curl -v --json @demo_single.json https://intake-analytics.wikimedia.org/v1/events

HTTP 201

Demo: Multiple, Successful

[
    {
        "meta": {
            "stream": "test.instrumentation"
        },
        "$schema": "/analytics/test/2.0.0",
        "dt": "2024-01-02T16:32:20Z",
        "test_string": "Hello world!"
    },
    {
        "meta": {
            "stream": "test.instrumentation"
        },
        "$schema": "/analytics/test/2.0.0",
        "dt": "2024-01-02T16:32:20Z",
        "test_string": "42"
    }
]
curl -v --json @demo_multi-success.json https://intake-analytics.wikimedia.org/v1/events

HTTP 201

Demo: Multiple, Mixed

[
    {
        "meta": {
            "stream": "test.instrumentation"
        },
        "$schema": "/analytics/test/2.0.0",
        "dt": "2024-01-02T16:32:20Z",
        "test_string": "Hello world!"
    },
    {
        "meta": {
            "stream": "test.instrumentation"
        },
        "$schema": "/analytics/test/2.0.0",
        "dt": "2024-01-02T16:32:20Z",
        "test_string": 42
    }
]
curl -v --json @demo_multi-mixed.json https://intake-analytics.wikimedia.org/v1/events

HTTP 207

Ah! Good thing I checked because the response is application/json content type with the following body:

{
    "invalid": [
        {
            "status": "invalid",
            "event": {
                "meta": {
                    "stream": "test.instrumentation",
                    "id": "ff68d9e0-19e4-425a-9c18-c85a0e2667af",
                    "dt": "2024-01-02T16:43:18.196Z",
                    "request_id": "86509b09-13fa-49c0-9e04-64e1f4d019b4"
                },
                "$schema": "/analytics/test/2.0.0",
                "dt": "2024-01-02T16:32:20Z",
                "test_string": 42,
                "http": {
                    "request_headers": {
                        "user-agent": "curl/8.4.0"
                    }
                }
            },
            "context": {
                "errors": [
                    {
                        "keyword": "type",
                        "dataPath": ".test_string",
                        "schemaPath": "#/properties/test_string/type",
                        "params": {
                            "type": "string"
                        },
                        "message": "should be string"
                    }
                ],
                "errorsText": "'.test_string' should be string"
            }
        }
    ],
    "error": []
}

Hm… Again, from https://www.rfc-editor.org/rfc/rfc4918#section-13:

A Multi-Status response uses one out of two distinct formats for
 representing the status:

 1.  A 'status' element as child of the 'response' element indicates
     the status of the message execution for the identified resource
     as a whole (for instance, see Section 9.6.2).  Some method
     definitions provide information about specific status codes
     clients should be prepared to see in a response.  However,
     clients MUST be able to handle other status codes, using the
     generic rules defined in Section 10 of [RFC2616].

Which reads to me as EventGate needing a logic update for how it formats a multi-status response. @Ottomata, thoughts?

(Or we keep as-is but document that this is the format.)

Them deciding to make multi-status a 2xx instead of a 1xx makes no sense to me, but it is what it is. ¯\_(ツ)_/¯

The slight issue I see is that the server probably shouldn't return a 2xx code (technically a "successful" response!) if one of the events is formatted incorrectly. We have logic that alerts us if we format an event incorrectly, but it only works if the server responds with an error.

On second thought, I'm curious, could you describe the nature of the logic/alerting? EventGate has detailed dashboards and validation error logs. I wonder if it's possible to migrate your alerting to use those or remove it entirely.

Which reads to me as EventGate needing a logic update for how it formats a multi-status response

Interesting! Looks like someone (me) read the summary for 207 response but not the details :)

I don't think we want to make the response XML :p, but perhaps it should be reformatted to match what the spec says (with response and status and href). Or this? https://datatracker.ietf.org/doc/html/draft-nottingham-http-problem

(Or we keep as-is but document that this is the format.)

This is probably the simpler/easier/less intrusive way to go. We should update the EventGate OpenAPI/swagger spec so that 207's example value and schema doc match reality.

mpopov triaged this task as Medium priority.Jan 3 2024, 3:00 PM
mpopov edited projects, added Product-Analytics (Kanban); removed Product-Analytics.
mpopov moved this task from Next 2 weeks to Doing on the Product-Analytics (Kanban) board.

The slight issue I see is that the server probably shouldn't return a 2xx code (technically a "successful" response!) if one of the events is formatted incorrectly. We have logic that alerts us if we format an event incorrectly, but it only works if the server responds with an error.

On second thought, I'm curious, could you describe the nature of the logic/alerting? EventGate has detailed dashboards and validation error logs. I wonder if it's possible to migrate your alerting to use those or remove it entirely.

I'm just referring to how the Client Library catches errors returned by the server and logs them through the default logger, which shows up in LogCat (the Android logger) in real time when we debug our code. When we integrate some new instrumentation into a feature we're building, we will look at LogCat to see if events are being sent successfully (or if errors are being returned) as a basic check of whether things are working as expected.

I'm just referring to how the Client Library catches errors returned by the server and logs them through the default logger, which shows up in LogCat (the Android logger) in real time when we debug our code. When we integrate some new instrumentation into a feature we're building, we will look at LogCat to see if events are being sent successfully (or if errors are being returned) as a basic check of whether things are working as expected.

Hrrm. My read of https://gitlab.wikimedia.org/repos/data-engineering/metrics-platform/-/blob/main/java/src/main/java/org/wikimedia/metrics_platform/EventProcessor.java?ref_type=heads#L100-104 and https://gitlab.wikimedia.org/repos/data-engineering/metrics-platform/-/blob/main/java/src/main/java/org/wikimedia/metrics_platform/EventSenderDefault.java?ref_type=heads#L50-57 is that there should be some log lines appearing in LogCat. Maybe this is a bug /cc @cjming

Right, there will be log lines in case of an unsuccessful response (>=300), but not in the case of 207.

🤦 'course. I'll take a look.

Change 987159 merged by jenkins-bot:

[operations/mediawiki-config@master] Add agent.app_install_id to android.product_metrics.* streams

https://gerrit.wikimedia.org/r/987159

Mentioned in SAL (#wikimedia-operations) [2024-01-08T14:12:57Z] <urbanecm@deploy2002> Started scap: Backport for [[gerrit:987159|Add agent.app_install_id to android.product_metrics.* streams (T353680)]], [[gerrit:982467|Remove partial migration of EditAttemptStep instrument (T351335)]], [[gerrit:982903|Add new stream names to the config variable (T353297)]]

Mentioned in SAL (#wikimedia-operations) [2024-01-08T14:14:43Z] <urbanecm@deploy2002> urbanecm and phuedx and ksarabia and sfaci: Backport for [[gerrit:987159|Add agent.app_install_id to android.product_metrics.* streams (T353680)]], [[gerrit:982467|Remove partial migration of EditAttemptStep instrument (T351335)]], [[gerrit:982903|Add new stream names to the config variable (T353297)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Change 988504 had a related patch set uploaded (by Phuedx; author: Phuedx):

[operations/mediawiki-config@master] agent.app_ -> agent_app_ in android.product_metrics.* streams

https://gerrit.wikimedia.org/r/988504

Change 988504 merged by jenkins-bot:

[operations/mediawiki-config@master] agent.app_ -> agent_app_ in android.product_metrics.* streams

https://gerrit.wikimedia.org/r/988504

Mentioned in SAL (#wikimedia-operations) [2024-01-08T14:35:48Z] <urbanecm@deploy2002> Started scap: Backport for [[gerrit:987159|Add agent.app_install_id to android.product_metrics.* streams (T353680)]], [[gerrit:982467|Remove partial migration of EditAttemptStep instrument (T351335)]], [[gerrit:982903|Add new stream names to the config variable (T353297)]], [[gerrit:988504|agent.app_ -> agent_app_ in android.product_metrics.* streams (T353680)]]

Mentioned in SAL (#wikimedia-operations) [2024-01-08T14:37:14Z] <urbanecm@deploy2002> urbanecm and phuedx and ksarabia and sfaci: Backport for [[gerrit:987159|Add agent.app_install_id to android.product_metrics.* streams (T353680)]], [[gerrit:982467|Remove partial migration of EditAttemptStep instrument (T351335)]], [[gerrit:982903|Add new stream names to the config variable (T353297)]], [[gerrit:988504|agent.app_ -> agent_app_ in android.product_metrics.* streams (T353680)]] synce

Mentioned in SAL (#wikimedia-operations) [2024-01-08T14:46:11Z] <urbanecm@deploy2002> Finished scap: Backport for [[gerrit:987159|Add agent.app_install_id to android.product_metrics.* streams (T353680)]], [[gerrit:982467|Remove partial migration of EditAttemptStep instrument (T351335)]], [[gerrit:982903|Add new stream names to the config variable (T353297)]], [[gerrit:988504|agent.app_ -> agent_app_ in android.product_metrics.* streams (T353680)]] (duration: 10m 22s)

Update:

B

  • Table android_product_metrics_article_link_preview_interaction value for agent.app_install_id is NULL for all rows.
  • Table android_product_metrics_article_toolbar_interaction value for agent.app_install_id is NULL for all rows.

This is a misconfiguration issue. The stream definitions don't request the value be mixed in to the events being submitted.

The stream configurations have been updated.

T354199: Java MPC shouldn't broadcast events to multiple streams will be brought into the sprint. @Dbrant's MR was merged and will be bundled with the release that we do as part of that task.