Page MenuHomePhabricator

Document Traffic Datasets in Datahub
Closed, ResolvedPublic9 Estimated Story Points

Description

As an analyst or any other user of traffic related datasets, I wish to be able to quickly discover and understand the data, so that I can estblish if the dataset is fit for use or to answer questions that I may have about the data.

Acceptance Criteria:

  • The traffic related datasets documented on the Wikitech are also documented in datahub.wikimedia.org
  • The documentation conforms to the style guidelines outlined Data Catalog Documentation Guide

Event Timeline

odimitrijevic added a project: Data-Catalog.

For a reference of datasets that have already been documented, and ones that still need to be added to the data catalog see: https://docs.google.com/spreadsheets/d/1lyl92MVVhfFPQva_fPMUnXtSCFa3axHliM6eUaSFjNU/edit#gid=812088000

EChetty set the point value for this task to 9.
EChetty moved this task from Ready to In Progress on the Data Pipelines (Sprint 08) board.

I have documented the following datasets and I am awaiting feedback.

  • mediawiki_api_request
  • mobile apps session metrics
  • mobile apps uniques

media_request_api

  • The [[ media requestvapi entry | https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,event.mediawiki_api_request,PROD)/Schema?is_lineage_mode=false ]] is lacking information for the normalized host. Not sure why that's the case given that the rest of the structs are filled in. The other aspects look good. Who would be the best person to fill it in? Who should be assigned as the data owner.
  • Are there links to external documentation that can be added?
  • The field is_wmf_domain talks about how it is derived but not what the field means. Is it a boolean that indicates if the request came from a wmf domain vs externally e.g. bot or toolforge tool?
  • Who should be the owner of this dataset?

**[[mobile_apps_session_metrics| https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,wmf.mobile_apps_session_metrics,PROD)/Schema?is_lineage_mode=false&schemaFilter=]]**

I had a whole bunch of comments here but then noticed it was deprecated. Is that the case? Do we have a process for deleting deprecated datasets after a period of time? I'll go ahead and add a tag. Is there an alternative dataset that should be used? @mpopov @mforns

Is mobile_apps_session_metrics_by_os also deprecated? If not, let's add it to the list of datasets to document. @mpopov @mforns Who would be the dataset owner?

If not indeed deprecated, @JEbe-WMF let's add a note on wikitech too.

mobile_apps_uniques_monthly - looks good

Documented the following datasets and added wikitech links where applicable

  • uniques devices
  • banner-activity (druid)
  • mobile_apps_session_metrics_by_os

Also made corrections based on Olja's suggestions above on mediawiki_request_api

Currently working to document the mobile_apps_session_metrics_by_os dataset

@odimitrijevic The Mobile Apps team was the main stakeholder of this metric, but it was us (when we were called Analytics Engineering) who designed and implemented the pipeline.

I shared some historical context in Slack but should document here too:

The in-app instrumentation has advanced to the point where we don't need this ETL – it was created at a time when EventLogging was on MySQL so instruments had to use low sampling rates but that's not a problem anymore. The queries look for presence of the app install ID which is only sent with requests (and maybe not even all requests at this point) when the user has opted in to sharing usage data with us, so sifting through webrequests doesn't give us anything that existing instrumentation doesn't already give us.

There's no point in adding that to DataHub since they're going away in T329310: Deprecate old mobile datasets

Currently only unique_devices_per_domain_monthly has the dataset description.
For unique devices, let's document all of:

  • unique_devices_per_domain_daily
  • unique_devices_per_domain_monthly
  • unique_devices_per_project_family_daily
  • unique_devices_per_project_family_monthly
  • unique_devices_project_wide_daily
  • unique_devices_project_wide_monthly

Please see slack Data-Catalog channel for question about project_wide vs project_family as well as suggestion to use glossary term for shared documenttion.

unique_devices_project_wide_daily and unique_devices_project_wide_monthly have no data and have been marked as deprecated. Ticket to delete: https://phabricator.wikimedia.org/T329978