Page MenuHomePhabricator

Create data-problem Phabricator tag
Closed, ResolvedPublic

Description

  • Name: analytics-data-problem
  • Description: Specific issues where an analytics dataset has incorrect, missing, or malformed data or shows an anomaly which might be caused by such data. Not for general work on data quality processes or monitoring.
  • Project type: tag
  • Example tasks: T300164, T361242, T353296, T341139, T349743, T321960

Background

The Wikimedia Foundation is making an effort to improve its data governance. One facet of this is creating clear processes for handling data problems. Having a specific tag for data problems will help in this goal.

Event Timeline

nshahquinn-wmf renamed this task from Create data-bug Phabricator tag to Create data-problem Phabricator tag.Wed, Apr 17, 11:48 PM
nshahquinn-wmf updated the task description. (Show Details)

I don't want this created yet. First, I'm looking for feedback on the idea (primarily from Foundation folks involved in the data governance work mentioned in the description, but of course it's open to everyone).

One question is how we should handle issues where the data shows spikes or anomalies which are likely related to external factors (e.g. bot users) rather than a clear software bug on our end, like T355143 or T313114. My inclination is that such issues should also be included in this tag (rather than, for example, having a separate tag), but I'm interested in what others think.

Ty @nshahquinn-wmf for opening this task. Sharing my initial thoughts below:

  • Agree that we should tag all the tasks that reports data anomalies with this tag, for now. Once we use this method and find that the proportion of bot issues surpasses software bugs, or vice versa, we can think about splitting it out.
  • I would like us to use this tag for data issues reported Community as well. Not sure if you already intended that but in the examples I saw tasks opened by WMF staff so wanted to make it explicit here
  • I'm wondering if instead of one data-problem tag, we should instead have functional area tags like data-problem-traffic, data-problem-readership, data-problem-contributors, data-problem-content to keep track of the areas where we have data problems. But this would mean not having tags for tables that fall outside of functional areas (if we have any). And the quintessential problem of too many tags!
  • I would like us to use this tag for data issues reported Community as well. Not sure if you already intended that but in the examples I saw tasks opened by WMF staff so wanted to make it explicit here.

Yes, I definitely agree.

  • I'm wondering if instead of one data-problem tag, we should instead have functional area tags like data-problem-traffic, data-problem-readership, data-problem-contributors, data-problem-content to keep track of the areas where we have data problems. But this would mean not having tags for tables that fall outside of functional areas (if we have any). And the quintessential problem of too many tags!

I think classifying the issues by data domain will definitely be useful. As a start, we can use workboard columns to do this, unless we decide it would be more useful to use columns for status (e.g. backlog, investigation, fixing...).

+1 to this proposal. Having the ability to separately categorise and manage these issue will help tremendously both with triaging and tracking issues for Data Platform Engineering.

Okay, sounds like we have broad support, so let's go ahead and create the tag!

Are you sure you need multiple projects?, a tag (yellow) project is a secondary tag that will already have other projects attached to the task and you can just just search based on both project tags.

I'd like to see a more specific name (maybe an Analytics- prefix or something) - nearly everything is some kind "data problem", for example the HTTP 503 error when I open an image on Commons, my buggy on-wiki user script code, that image being rendered hiding some article text on a Wikipedia page... thanks

Are you sure you need multiple projects?, a tag (yellow) project is a secondary tag that will already have other projects attached to the task and you can just just search based on both project tags.

At this point, we're not actually proposing multiple projects. But I see your point that, if we decided we did need them, we could replicate that by creating separate tags for "contribution-data", "reading-data", and "content-data" and filtering the "data-problem" workboard by those tags.

I'd like to see a more specific name (maybe an Analytics- prefix or something) - nearly everything is some kind "data problem", for example the HTTP 503 error when I open an image on Commons, my buggy on-wiki user script code, that image being rendered hiding some article text on a Wikipedia page... thanks

I can see that. I think analytics-data-problem is a good choice ("analytics" being intentionally lowercase to clarify that it's a functional area, not a team). Along those lines, we could rename Analytics-Canonical-Data to analytics-canonical-data.

I think classifying the issues by data domain will definitely be useful. As a start, we can use workboard columns to do this, unless we decide it would be more useful to use columns for status (e.g. backlog, investigation, fixing...).

Yes! lets start with that.
@nshahquinn-wmf , do you have permission to create the yellow project tags? or can we request to have that created in this task ?

@nshahquinn-wmf , do you have permission to create the yellow project tags? or can we request to have that created in this task ?

I don't have project creation permissions. What do you mean by the "yellow project tags"? I was thinking we would just start with the single data problem tag for now, and we could create the tags for the data domains later. Are you thinking we should create those now?

Aklapper claimed this task.
Aklapper mentioned this in Analytics-Data-Problem.

can we request to have that created in this task ?

This task is tagged with Project-Admins so that's what this task is about :)

Requested public project Analytics-Data-Problem has been created: https://phabricator.wikimedia.org/project/view/7149/

(In case you need to edit the project or project workboard itself at some point and lack permissions, please see Trusted-Contributors.)

Interested people are welcome to join the project as members, and to watch the project in order to receive notifications on task updates.

If tasks are created under this new project which are about a specific codebase, please make sure to also add these codebase project tags to the tasks in addition.

Recommended practices for project and workboard management in Phabricator are available.

Feel free to bring up any questions you might have about Phabricator or about best ways to manage projects in Phabricator.

Enjoy!