Page MenuHomePhabricator

Druid access for view on event.editeventattempt
Closed, ResolvedPublic

Description

This is a request to provide Druid access for Superset/Turnilo on edit attempts as inferred via event.editattemptstep, particularly with the ability to identify saves (which are server-event sourced, not client-sourced like the other events in this funnel) along with some key dimensions.

It would be great to have the geo-mapped country as well, so we would be rather pleased to be a trial candidate for T208589: [HiveToDruid] Add support for ingesting subfields of map columns, but understand we may need to wait on that part. Happy to iterate on a version 1 of this to a version 2.

Event Timeline

Assuming this is about Analytics, feel free to correct

Change 587984 had a related patch set uploaded (by Dr0ptp4kt; owner: Dr0ptp4kt):
[operations/puppet@production] WIP: Add Druid support for event.editattemptstep

https://gerrit.wikimedia.org/r/587984

Nuria added subscribers: mforns, Nuria.

Assigning to @fdans , please work with @mforns to test ingestion, should be pretty straight forward

hold on, cause i think @dr0ptp4kt will be able to use superset to view this data, @dr0ptp4kt let us know otherwise

Milimetric moved this task from Incoming to Ops Week on the Analytics board.
Milimetric moved this task from Next Up to Paused on the Analytics-Kanban board.

We'd like to be able to use both Turnilo and Superset.

As an aside, for Superset, what are the steps to ensure that a Presto-backed result set allows for "self-updating" dashboards? In Superset it would be ideal to avoid having to re-materialize data in order for visualizations to performantly display the latest data.

We'd like to be able to use both Turnilo and Superset.

then, let's just go ahead with this task and ingest the data on druid, there is no need to have two dashboards on top of the same data stream with two different data connectors.

Looked at data and it looks good, if @dr0ptp4kt (let's give him a day to respond) sees no issues then let's merge cc @kaldari

Looking good to me. Looking forward to the fuller data!

I'm subscribed on T208589: [HiveToDruid] Add support for ingesting subfields of map columns for any next step on a v2 of this.

@dr0ptp4kt to be super clear, the data ingested will look as it does on the link francisco provided as we do not have plans to work on the map columns ingestion soon.

Looking great so far. Would it be possible to add a description for this dashboard in Turnillo (similar to the other dashboards), something like: "Sampled eventlogging of the non-API editing interfaces". That way people can tell the difference between it and the edits_hourly dashboard. Speaking of, does anyone know where the data for the edits_hourly Turnillo dashboard comes from?

To answer the question above, looks like the data in the edits_hourly dashboard comes from the database and mostly relies on revision_tags.

@kaldari edits_hourly is a denormalized version of the data in medaiwiki updated monthly (the denormalization takes couple days of processing time), Once data is in turnilo it will be called event_<schema> which the convention for eventlogging ingestion., We can put a link to the schema and every field can be explained that way.

Note that this data is only available for the last 90 days

The longer retained fields for this schema are here: https://github.com/wikimedia/analytics-refinery/blob/master/static_data/eventlogging/whitelist.yaml#L140

@kaldari hi! can you confirm that the 90 day limitation works for yall? If you require data before that, the fields required must be part of the whitelist that @Nuria linked above and the database should be changed to event_sanitized.

@kaldari should weigh in on the 90 day window.

@fdans, to clarify, there are two pieces here, right?

  1. initial import: If we want anything from more than 90 days ago at the time of this initial import (e.g., 2019, 2018, ...), then we'd need to update the druid_load.pp to set the database field to event_sanitized because that database has the data, whereas event does not.
  2. after initial import: all new data would just get ingested daily, so starting from the point of initial import, the data would just keep rolling into Druid so they're available in Turnilo and Superset. Is that right or would non-whitelisted fields' values be deleted over time as well if we use the implicit event database instead of event_sanitized? It's really just the four UA fields I think that are the source of potentially longer term historical interest (and they're not in event_sanitized.editattemptstep as of today, so looking back in time at those isn't interesting anyway as we don't have the data, but going forward it would be helpful!)

All of this said, Kaldari, are there any additional fields you want picked up? Maybe webhost in case we want to further disambiguate (e.g., because of T249944: WikiEditor records all edits as platform = desktop in EventLogging) and just otherwise to be able to filter or pivot the data simply that way?

@fdans - Yes, the 90 day limit works fine for me.

@dr0ptp4kt - webhost would be very useful for mitigating the T249944 bug. I can't think of any other fields that would be needed.

So, to be clear this dataset will only have the last 90 days of data cc @kaldari @dr0ptp4kt

Cool. I'll update the patch to add webhost and 90 days is good.

Patch is ready for review and deploy.

Change 587984 merged by Elukey:
[operations/puppet@production] Add Druid support for event.editattemptstep

https://gerrit.wikimedia.org/r/587984

@fdans i think a manual ingest might be needed to get the whole 90 days span

also ping to @dr0ptp4kt that you did not added webhost to whitelist, it should probably be there

@Nuria I added webhost to druid_load.pp for this more narrow treatment - see https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/587984/6..8/modules/profile/manifests/analytics/refinery/job/druid_load.pp at the far right side of the diff.

I believe it's also part of event_sanitized.editattemptstep in https://github.com/wikimedia/analytics-refinery/blob/master/static_data/eventlogging/whitelist.yaml#L140 .

Do I have that right?

Ah, yes. I had missunderstood @kaladari's request. Ok, the only thing to do here is to run the reindexation since there is data available.

ping @fdans that we need to reindex since the beginning of data in events database, currently data spans only a week

Just ran the ingestion job for data from Feb 20 to Apr 18.