Page MenuHomePhabricator

Review Active Editor skin stats
Closed, ResolvedPublic


Request to review the analysis for Active editor skin stats before publishing it to the web team.

Event Timeline

nshahquinn-wmf lowered the priority of this task from High to Medium.Mar 25 2020, 1:24 PM

I'm looking forward to checking out your work! This seems like a normal priority task, but please let me know if you disagree.

Thanks @nshahquinn-wmf ! I was hoping to get this reviewed by the end of this week (with the assumption that I may have more work on it after your feedback)
But if that is not possible then lets keep it at medium and look at this next week.

Hey, Maya! Thanks for inviting me to take a look at your work! I've separated my comments between (1) those that affect the results and (2) those that just talk about how to most efficiently and stylishly get the same results.

Correctness and results

  • As you noticed yourself, you need to ensure that users who don't have any skin preference are treated as having Vector.
  • Users need to be deduplicated across wikis; otherwise, some users will be counted hundreds of times.
  • It's best to give the most recent data possible since that's most relevant to stakeholders' decision. This analysis uses data from Jan-Dec 2019, but data from Mar 2019–Feb 2020 was available. I don't think the calendar year has any particular significance here.
  • As the task description mentions, anything that does not match a valid skin needs to be replaced with Vector. Some values for defunct skins (chick and nostalgia as well as some others) were not replaced.
  • Because people will want to compare this data with the previous results (T147696), it's best to make it as comparable as possible (unless you want to conform to a more widely used standard, in which case you should explain the limited comparability when you deliver the results). There are several ways that this data is not comparable:
    • the previous analysis defined active editors as users who had made at least 30 edits of any kind during the year; this one defines it as users who made at least 5 content edits during at least one month.
    • the previous analysis defined very active editors as users who had made at least 600 edits of any kind during the year; this one defines it as users who made at least 600 content edits in at least one month.
    • the previous code treated each Wikidata edit as 1/10th of a real edit.
    • the previous code dealt with the duplication of users across wikis by picking the wiki where each user had the most edits and treating the preference there as their global preference.

Code style

  • When you scaled up your data collection process from a quarter to a year, you should have replaced the quarterly code with the yearly code. Duplicating so much code makes the notebook harder to read and review.
  • If there's something common to all your variable names and thus doesn't provide any useful information, it can be eliminated to make the variable names easier to write and read. With the quarterly code eliminated, _2019 can be removed.
  • You should follow our SQL style guide in your queries!
  • The multiple calls to np.where can be replaced with a single call to DataFrame.replace:
skin_aliases = {
  "": "vector",
  2: "cologneblue",
  "myskin": "vector",

skin_2019 = skin_2019.replace({"skin": skin_aliases})
  • Nesting many function calls within each other makes code difficult to understand. For example, take this block:
for wiki in active_ed_2019['wiki'].unique():
        users = ','.join([str(u) for u in active_ed_2019.loc[
            active_ed_2019['wiki'] == wiki]['user_id']])
    ), wiki))

You could write it more clearly as:

wikis = active_ed_2019['wiki'].unique()
up_skin_2019 = list()

for wiki in wikis:
    user_ids = active_ed_2019[active_ed_2019['wiki'] == wiki]["user_id"]
    user_list = ','.join([str(u) for u in user_ids])
    prefs =