Page MenuHomePhabricator

Check two variables in knowledge gaps dataset files (article_count_value and article_count_total)
Closed, ResolvedPublic

Description

For the knowledge gaps datasets in the public repository, their might be an issue with the article_count_value variable and the article_count_total variable.

The values in the article_count_value column appear to show the current article count for each wiki+category; the value is duplicated for every time bucket, rather than having time-bucket-aligned-values.

Screenshot1_showing_article_count_value.png (866×1 px, 171 KB)

The values in the article_count_total column appear to show the current article count for each wiki (across categories); the value is duplicated for every time bucket, rather than having time-bucket-aligned totals.

Screenshot2_showing_article_count_total.png (896×1 px, 194 KB)

Event Timeline

The article count is incorrect, I removed it in the interim. I also added some pointers below on how to compute the article created in the more standard way.

The article count was at first a count of all pages "oberserved" in a given month, e.g. an article that received views or edits. Note that is not the same as "pages that existed in a given month". However, the quality score for an article that isn't "observed" in a given month needed to be forward filled with the quality from the previous month. When we added this logic the "article count" became meaningless, as we now observe every page for every month and the number is thus the same for all months.

Also, the name article_count metric makes more sense as "how many pages have been created so far", which can be implemented using a cumulative sum of the acticle created metric like, in Miriam's notebook here.

I also added a pyspark example to calculate a cumulative article_created column, using a Window (notebook), and created T344851 to track adding cumulative metrics to the knowledge gaps pipeline itself.

Thank you for reporting this, and hopefully this explains your observations.