Data Platform Engineering Bug Report or Data Problem Form.
Please fill out the following
What kind of problem are you reporting?
- Access related problem
- Service related problem
- Data related problem
For a data related problem:
- Is this a data quality issue? Yes.
- What datasets and/or dashboards are affected? event_sanitized.mediawiki_revision_tags_change
- What are the observed vs expected results? Please include information such as location of data, any initial assessments, sql statements, screenshots.
Querying data about tagged edits in event_sanitized.mediawiki_revision_tags_change results in the performer struct having no information about the user making the edit, the fields are instead NULL. When querying this data historically (see output below), it appears that some edits have always been this way but the proportion dramatically increased in October 2023 and now accounts for almost all edits. This problem is current and ongoing.
Expected results: the performer struct correctly reflects the information about the user who made the given edit.
Historical counts:
WITH tagged_edits AS ( SELECT rev_id, FIRST_VALUE(substr(rev_timestamp, 1, 7)) AS log_month, MAX(IF(performer.user_id IS NULL, 1, 0)) AS had_null_performer FROM event_sanitized.mediawiki_revision_tags_change AS mert WHERE year = 2023 GROUP BY rev_id ) SELECT log_month, count(1) AS num_tagged_edits, SUM(had_null_performer) AS num_tag_edits_with_null_performer, SUM(had_null_performer) / count(1) AS prop_null FROM tagged_edits GROUP BY log_month Output table from Pandas with non-2023 edits removed and sorted by log_month: log_month num_tagged_edits num_tag_edits_with_null_performer prop_null 2023-01 33133594 2264876 0.068356 2023-02 27028430 1939196 0.071747 2023-03 29962527 2158956 0.072055 2023-04 29423416 2021438 0.068702 2023-05 27712184 1997860 0.072093 2023-06 28597216 1989085 0.069555 2023-07 27921145 2111404 0.075620 2023-08 30272521 1993681 0.065858 2023-09 26427299 1999313 0.075653 2023-10 28243140 8352344 0.295730 2023-11 29146770 29146625 0.999995 2023-12 5696883 5696840 0.999992
For the DE Team to fill out
Which systems does this effect?
- Hive
- Druid
- Superset
- Turnilo
- WikiDumps
- Wikistats
- Airflow
- HDFS
- Goblin
- Scqoop
- Dashiki
- DataHub
- Spark
- Jupyter
- Modern Event Platform
- Event Logging
- Other
Impact Assessment:
Does this problem qualify as an incident?
- Yes
- No
Does this violate an SLO?
- Yes
- No
Value Calculator | Rank |
---|---|
Will this improve the efficiency of a teams workflow? | 1-3 |
Does this have an effect of our Core Metrics? | 1-3 |
Does this align with our strategic goals? | 1-3 |
Is this a blocker for another team? | 1-3 |