Thu, Dec 7
Decommissioning EventLogging would be EPIC!
Wed, Dec 6
Tue, Dec 5
Can the header be translated into an x-analytics value?
Sat, Dec 2
A few questions:
- While we ought to consider an upgrade for all 4 clusters, from what I understand Jumbo can be upgraded independently. Are there any concerns with that approach?
- What are the upgrade considerations for Kafka clients?
- Specifically are there clients that publish to Kafka Jumbo directly or do all Kafka topics get mirrored from main (possibly logging?)?
Fri, Dec 1
Oct 31 2023
This was delivered as part of the "documentathon": https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/DataHub/Data_Catalog_Documentation_Guide
Oct 25 2023
Oct 4 2023
Sep 22 2023
Sep 19 2023
Sep 8 2023
Sep 6 2023
Sep 5 2023
@MGerlach does the pre-fetch traffic have headers that can identify it as such as it comes through as webrequests?
Sep 1 2023
Aug 24 2023
It looks like the request is also in PyHive with the following PR still open: https://github.com/dropbox/PyHive/pull/328
Bug closed because too old, and not fixed: https://github.com/apache/superset/issues/3243
Aug 23 2023
@JAllemandou is the limitation in data formatting coming from Presto or Superset (or both :) ?
@BTullis we'll need the SRE team's help with the deployment of the event platform schema ingestion into Datahub. The deployment involves a) creating the event steams custom platform and
b) deploying the ingestion code/transformer
Aug 18 2023
The failure of this job requires a manual rerun, and based on a recent assessment this happens with some frequency (on average once daily). Let's bring this into current sprint and continue to troubleshoot.
Aug 17 2023
Approving group membership
Aug 16 2023
Here are some considerations that we discussed, that we need to further explore and decide on:
- Explore creating a custom platform for Event Streams
- Add top level event schema description as the dataset documentation. TBD on how to accomplish this given import options.
- The schema import automatically adds subgroups under kafka based on the first dot segment of the schema name. In the production instance of DataHub there are also streams with the naming analytics/mediawiki/web_ab_test_enrollment. Can “/” be used as a separator to designate the top level category?
- Can we import goblin lineage to propagate lineage from kafka > hive?
- There would value to import hive event_raw database for completion of lineage events
- Can we add a link to the event platform schema/datahub documentation to hive tables in event and event_sanitized? Lineage would be one way to trace this. Another would be to add links in the documentation to datasets with equivalent schema both upstream and downstream. This falls into the larger consideration on how to propagate metadata between equivalent datasets stored across different platforms and refinements.
- Some of the kafka topics are remnants of tests and misconfiguration/misnamings. There is an option to add them to an exclusion list. Ideally we'd delete these in Kafka, otherwise there is an exclusion list.
- Given that the prod datahub has the event streams current Kafka metadata can we delete and reimport all the Kafka metadata? If a fresh backup is not available it would be have one handy
- Is there a way to add ownership data to event schema json and import it from there? This would benefit Metrics Platform work and allow alerting the right parties about event publishing errors. Some discussion about adding this data already happened https://phabricator.wikimedia.org/T201063#4546544
- What is the best way to ingest the metadata? Datahub transformer vs airflow vs TBD?
@tchin as discussed today, that sounds like a good approach. Before deploying to production, let's wipe out the kafka metadata given that the original POC was imported under the kafka platform. I'll add these to the acceptance criteria.
The work related to this has been done as part of standing up the DSE K8s cluster. I will go ahead and close the ticket.
Aug 14 2023
@BTullis These are good to be removed
Aug 10 2023
Done. Are there any recovery keys to be had in case I am not able to access
my phone for whatever reason?
Aug 2 2023
Aug 1 2023
@Htriedman we are picking this work up again. Is the POC that you did available in a repository on gitlab?
Thank you @jbond!
Jul 28 2023
Jul 27 2023
This dataset is no longer subscribed to. We should remove the database from the download list.
Jul 26 2023
Jul 11 2023
@BTullis do the permissions need to be removed before closing the task?
Jul 7 2023
@Antoine does this still need to be implemented?
Jul 6 2023
So gratifying to be able to be closing this task!
Jun 28 2023
May 19 2023
@lbowmaker now that we are on the other side of the Oozie migration can we prioritize for the next sprint?
Apr 26 2023
Yes, confirming the above. I approve the request.
Apr 14 2023
It turns out that the sql_labs role was missing. Unfortunately the error message did not indicate a permissions issue.
@HShaikh Can you please post the SQL statement that you are running.
Apr 3 2023
Consider implementing together with https://phabricator.wikimedia.org/T329978
Consider implementing together with https://phabricator.wikimedia.org/T329310
Related conversation: https://phabricator.wikimedia.org/T332420
Mar 21 2023
Mar 16 2023
Mar 15 2023
Mar 9 2023
Mar 3 2023
Is this the same issue reported in https://phabricator.wikimedia.org/T328127?
Feb 17 2023
unique_devices_project_wide_daily and unique_devices_project_wide_monthly have no data and have been marked as deprecated. Ticket to delete: https://phabricator.wikimedia.org/T329978
Consider doing this at the same time as https://phabricator.wikimedia.org/T329978
Feb 14 2023
Currently only unique_devices_per_domain_monthly has the dataset description.
For unique devices, let's document all of:
Feb 10 2023
Feb 8 2023
- The [[ media requestvapi entry | https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,event.mediawiki_api_request,PROD)/Schema?is_lineage_mode=false ]] is lacking information for the normalized host. Not sure why that's the case given that the rest of the structs are filled in. The other aspects look good. Who would be the best person to fill it in? Who should be assigned as the data owner.
- Are there links to external documentation that can be added?
- The field is_wmf_domain talks about how it is derived but not what the field means. Is it a boolean that indicates if the request came from a wmf domain vs externally e.g. bot or toolforge tool?
- Who should be the owner of this dataset?