Check two variables in knowledge gaps dataset files (article_count_value and article_count_total)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	CMyrick-WMF
	Aug 2 2023, 8:59 PM

Description

For the knowledge gaps datasets in the public repository, their might be an issue with the article_count_value variable and the article_count_total variable.

The values in the article_count_value column appear to show the current article count for each wiki+category; the value is duplicated for every time bucket, rather than having time-bucket-aligned-values.

Screenshot1_showing_article_count_value.png (866×1 px, 171 KB)

The values in the article_count_total column appear to show the current article count for each wiki (across categories); the value is duplicated for every time bucket, rather than having time-bucket-aligned totals.

Screenshot2_showing_article_count_total.png (896×1 px, 194 KB)

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		fkaelin	T343067 Content gap metrics data issue
		Resolved		fkaelin	T343383 Check two variables in knowledge gaps dataset files (article_count_value and article_count_total)

Event Timeline

CMyrick-WMF created this task.Aug 2 2023, 8:59 PM

CMyrick-WMF updated the task description. (Show Details)

fkaelin added a parent task: T343067: Content gap metrics data issue.Aug 8 2023, 3:33 PM

fkaelin moved this task from Backlog to In Progress on the Research board.Aug 23 2023, 4:30 PM

The article count is incorrect, I removed it in the interim. I also added some pointers below on how to compute the article created in the more standard way.

The article count was at first a count of all pages "oberserved" in a given month, e.g. an article that received views or edits. Note that is not the same as "pages that existed in a given month". However, the quality score for an article that isn't "observed" in a given month needed to be forward filled with the quality from the previous month. When we added this logic the "article count" became meaningless, as we now observe every page for every month and the number is thus the same for all months.

Also, the name article_count metric makes more sense as "how many pages have been created so far", which can be implemented using a cumulative sum of the acticle created metric like, in Miriam's notebook here.

I also added a pyspark example to calculate a cumulative article_created column, using a Window (notebook), and created T344851 to track adding cumulative metrics to the knowledge gaps pipeline itself.

Thank you for reporting this, and hopefully this explains your observations.

fkaelin closed this task as Resolved.Sep 5 2023, 1:42 PM

	F37160910: Screenshot2_showing_article_count_total.png
	Aug 2 2023, 8:59 PM

	F37160909: Screenshot1_showing_article_count_value.png
	Aug 2 2023, 8:59 PM

Check two variables in knowledge gaps dataset files (article_count_value and article_count_total)Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Check two variables in knowledge gaps dataset files (article_count_value and article_count_total)
Closed, ResolvedPublic
Actions

Related Objects
Search...