Page MenuHomePhabricator

Gender gap metrics data issue
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

What happens?:
When we calculate the aggregate using the latest (7/29) snapshot of the data, the totals are drastically lower than the April snapshot.

April :
gender article_created_value
0 female 2,357,289
1 male 9,812,451
2 non-binary 12,487
June :
gender article_created_value
0 female 199,937
1 male 562,693
2 non-binary 1,154

What should have happened instead?:
The numbers should have increased slightly or stayed the same.

Other information (browser name/version, screenshots, etc.):
There seems to be an issue with the latest 7/29 snapshot of the Gender gap data. Im not a 100% sure if this is due to the root causes mentioned in T343067. But this needs to be investigated and fixed.
*For now I am adding this as a subtask to T343067 for the ability to track content gap data issues. we can separate them if the root cause is different.

Event Timeline

When looking at gender content gap on the public folder that was created on July 29th, I don't see drop as you did.

Running this snippet based on Miriam's notebook

import pandas as pd
gendata=pd.read_csv("https://analytics.wikimedia.org/published/datasets/knowledge_gaps/content_gaps/knowledge_gap_index_metrics_csv/content_gap=gender/part-00000-4001ddac-9eed-47bb-b0ae-251784f815ca.c000.csv")
wikis = [ w for w in list(gendata["wiki_db"].unique()) if w.endswith("wiki")]
gendata=gendata[gendata.wiki_db.isin(wikis)]
gendata['gender3category']=[i if i in ['male','female'] else 'non-binary' for i in gendata['category']]
gen3snapshot=gendata.groupby(['gender3category'])['article_created_value'].sum().reset_index()
gen3snapshot

gender3category article_created_value
0 female 2386030
1 male 9898221
2 non-binary 13474

This is consistent with April data you pasted. Were you using the same file as in the snippet above?

I tried a few things -

  1. created a new notebook today for calculating the values using the gender content gap csv. and I'm now getting values similar to yours, though slightly different than yours.

gender3category article_created_value
0 female 2381837
1 male 9873151
2 non-binary 13463

  1. I re-ran the previous notebook where I got erroneous numbers and I'm still getting the same wrong numbers.
  1. The last thing I tried was using gendata=pd.read_csv("https://analytics.wikimedia.org/published/datasets/knowledge_gaps/content_gaps/knowledge_gap_index_metrics_csv/content_gap=gender/part-00000-4001ddac-9eed-47bb-b0ae-251784f815ca.c000.csv") instead of downloading the csv and uploading it to jupyter nb gendata=pd.read_csv(indir+'gender-june.csv') and this also yielded the correct numbers!

In conclusion, I think it is best to use the csv link in the read_csv command to avoid errors that could possibly be introduced due to manual handling of the csv file. One other major take-away is to find a way to QA the numbers we get (other than just comparing and seeing if it is in line with the previous months).

Lastly, I'm still curious to know why our numbers are different even though we are both using the same csv file.. any idea?

I am not sure what the problem is with reading the local vs the remote file, somehow it seems the local file is incorrect.

Regarding the difference in numbers: the numbers for the a past month can change from one snapshot to another (since the underlying snapshots "change the past"). When using the same file, the difference is likely in how canonical wikis are filtered since this is done adhoc in the notebook (see T344845 to move this filter step into the pipeline).

With T343368, this will hopefully become less confusing, e.g. see the csv folder.

Closing this as resolved.