Page MenuHomePhabricator

Update and fix wiki segmentation dataset
Closed, ResolvedPublic

Description

Since we released the wiki segmentation dataset in July 2018, it has been used by a wide variety of people for exploration and project targeting.

We should make sure the dataset is in good shape for continued use by doing the following:

  • Update with current data (at the moment, the data is from June 2018)
  • Make some high-impact, low-effort improvements (add pageviews, fix the count of active administrators, fix some errors in wiki names)
  • Extract the code that generates the dataset of project names, families, and languages into a reusable form, so it can be used in other data project (T184576)
  • Change the spreadsheet to pull data from a CSV file to eliminate copy-paste errors and ease updating

Event Timeline

From @Iflorez :

I've updated all of the queries in the wiki segmentation notebook and submitted a new pull request, with the updates, from my fork to the base GitHub repository which is managed by Neil and Mikhail.

I recommend a query review of those queries whose outputs differ to a large degree, and at least one from the small degree list, and a follow-up discussion. I noted this in the pull request that I just submitted.

The highest review priority is for the majority mobile editors query, mme [majority mobile editors], and the related mobile edits query. The mme query is evidently in error and I will be happy to learn how to fix and optimize this query.

Queries whose output match last year's collected data:

  • overall SIZE rank
  • monthly unique devices
  • mobile unique devices
  • mobile web pageviews
  • mobile app pageviews
  • second month editor retention

Queries whose output differ to a small degree:

  • unique devices per editor
  • monthly editors
  • monthly active editors
  • monthly new active editors
  • bot edits

Queries whose output differ to a large degree:

  • monthly nonbot edits
  • mobile edits
  • majority mobile editors*

Follow-up discussion: Some of the outputs are not matching the outputs that were gathered last year. It appears, that this difference may be due in part to the shift from querying the MariaDB editor month and editor staging tables, and the differences in syntax therein. And, some of the differences may be due to how the replicas pipeline logged bot users vs how this is done on hive per ticket 218819. If the queries are correct, these output differences warrant a follow-up conversation. For example, are the outputs different enough that it is necessary to output new 2017-2018 data in addition to 2018-2019 data?

Irene downloaded the data here - Google doc - for review

(When you can, could you clarify what "output match last year's collected data" means? I am assuming that it still means that obviously there were small variations, but then I wouldn't know what "output differ to a small degree" would be about.)

reviewed with the following changes:

  • update canonical wikis table for missing wikis
  • update queries:
    1. Monthly non-bot edits
    2. Bot Editing Proportion
    3. Anonymous Editing Proportion
    4. Mobile Editing Proportion
    5. Majority-mobile editors proportion
    6. Visual edits

@Neil_P._Quinn_WMF what do you mean by

Change the spreadsheet to pull data from a CSV file to eliminate copy-paste errors and ease updating

I'm curious about what you were considering as far as improving the way we move data to the spreadsheet.
I was thinking to download a CSV file from the notebook with all of the updated data and upload that to the master spreadsheet. But maybe you have better ideas? If a simple upload (and formatting) will do, I can do that by Monday. I just sent over a request for permission to edit the master spreadsheet.

Also, given small differences in data output from the updated queries, I recommend pulling 2019 data into a new 2019 sheets tab and also pulling 2018 data into a new 2018_updated tab.

quick update: I see on the related T199266 ticket that @Neil_P._Quinn_WMF mentioned the following method for data importing:

importing it all  (possibly using the [datasets publication feature](https://wikitech.wikimedia.org/wiki/Analytics/Ad_hoc_datasets) and Google Spreadsheets' [IMPORTDATA function](https://support.google.com/docs/answer/3093335?hl=en).

the next steps are:

  • discuss the remaining issue with Tosk from T199266 and how to update the Canonical table to update Tosk and ALS data. Discuss discrepancy with mobile edits query outputs. Also, review
wikis_list_not_working = ['alswiktionary', 'alswikibooks', 'alswikiquote', 'mowiki', 'mowiktionary']
  • run all of the queries for 2019 and create a 2019 csv (or formatted xls) file
  • run all of the queries for 2018 and create an updated 2018 csv (or formatted xls) file
  • move the two csv files to /srv/published-datasets/wiki-segmentation
  • update the wiki-segmentation sheets document using the import data function (with the url location of the two csv files)

@Iflorez, @cchen let me know if you need anything else from me.

I like the idea of producing snapshots for every calendar year from now on; since we tend do our annual planning in the Feb-May timeframe, that means we'd have recent data available for those planning cycles.

However, we'll probably have to think about how we'll handle changing definition. For example, if we redefine active editors in a couple years, will we will put two active editors entries in the definition tab, with some method of explaining which applies to which tab?

Hi Neil,

@Iflorez, @cchen let me know if you need anything else from me.

Hoorah! Looks like it's all set.
I'll run these and output the data over the weekend.

I like the idea of producing snapshots for every calendar year from now on; since we tend do our annual planning in the Feb-May timeframe, that means we'd have recent data available for those planning cycles.

By calendar year, do you mean January - December?

However, we'll probably have to think about how we'll handle changing definitions. For example, if we redefine active editors in a couple years, will we will put two active editors entries in the definition tab, with some method of explaining which applies to which tab?

That sounds reasonable.

I like the idea of producing snapshots for every calendar year from now on; since we tend do our annual planning in the Feb-May timeframe, that means we'd have recent data available for those planning cycles.

By calendar year, do you mean January - December?

Yeah, that's what I meant. In this case, maybe we can just label tabs with the time period covered, so the current tab becomes "Jul 2017–Jun 2018", your updated 2018 becomes just "2018", and the new one becomes (I think) "Nov 2018–Oct 2019". Then, in January, we could produce a new "2019" snapshot and then aim to do it yearly from then on.

However, we'll probably have to think about how we'll handle changing definitions. For example, if we redefine active editors in a couple years, will we will put two active editors entries in the definition tab, with some method of explaining which applies to which tab?

That sounds reasonable.

Hmm, yeah, the more I think about it, the easiest way to do it would probably be to start versioning whenever we make a substantial change. So we'd just call this hypothetical revised active editors metric "active editors v2", which would allow us to provide both for some transitional period without unnecessary confusion.

Hi @Neil_P._Quinn_WMF, for clarity, can you rename the current "June 2018" tab in the Wiki Segmentation dataset? That sheet is locked for me.
Rename it to "0717–0618_OLD" or "0717–0618_BROKEN" ?
Also, I think it will be wise to change the color of the mobile edits [column p] on this sheet, to reflect the same rosy color used on the administrator's tab. This because the measure definition (the bot identifying and filtering that goes into the query) has been updated.

I added a 2017 tab but am having some errors when running the same queries for 2018. The cchen hive table query results are coming up empty.
@cchen, it appears that the editing movement cchen tables are missing data for 2018.

Hi @Neil_P._Quinn_WMF, for clarity, can you rename the current "June 2018" tab in the Wiki Segmentation dataset? That sheet is locked for me.
Rename it to "0717–0618_OLD" or "0717–0618_BROKEN" ?
Also, I think it will be wise to change the color of the mobile edits [column p] on this sheet, to reflect the same rosy color used on the administrator's tab. This because the measure definition (the bot identifying and filtering that goes into the query) has been updated.

Thank you for bringing this up! This got me thinking a lot about the information architecture of this spreadsheet, so I've made a bunch of changes:

  • I changed the sharing settings so that Product Analytics (including you) can edit and anyone with the link can comment. Previously, anyone at the WMF could edit and anyone outside could comment, but the tab with the data was locked so that only Product Analytics could edit. This was pretty complicated, and was threatening to get even more complicated with the need to set up separate locking policies for each data tab.
  • I removed the "manually tagged" tab since we never used it and moved the "community contacts" tab to another spreadsheet.
  • In the Jun 2018 tab, I deleted the rows from 662 down (which corresponded to the very least active and therefore very least important wikis) and the monthly active administrators columns so that there's no broken data that needs to be highlighted or regenerated.
  • I deleted the "last updated" column from the "definitions" tab since we're going to be having multiple snapshots of the data.
  • I changed the spreadsheet name to "wiki comparison", as we discussed.
  • Since the spreadsheet is no longer publicly editable, we can't use a tab to track suggestions. I've moved those to a wiki page: mw:Product Analytics/Wiki comparison suggestions.
  • I added an introduction tab containing some context, links, and pathways for help.

Feel free to disagree with any of this!

Also, I think we can just title the tabs based on the end of the snapshot period; since all the periods will be 12 months, specifying the start of the period won't provide any useful information. We should also title it "Dec YYYY" if it covers a full calendar year just to avoid people thinking that year-labeled snapshots are different from month-labeled snapshots.

I added a 2017 tab but am having some errors when running the same queries for 2018 using an updated query_vars dict variable. Here's the update:

query_vars = dict(
    snapshot = "2019-09",
    start= "2018-01-01", 
    end="2019-01-01",
    
    pv_start = "201801",
    pv_end = "201901",
    
    ner_start = "2018-01",
    ner_end = "2019-01"
)

The hive tables are not happy with 2018-01-01 as a a date input.
Maybe you have a suggestion @cchen?

For the record, we discussed this in chat and figured out that the underlying table (cchen.editor_month) was missing the underlying 2018 data for some reason.

I'll try this again for a 2018 and a Nov 2018–Oct 2019 tab tomorrow.

Sounds good!

Also, it looks like the new editor retention query needs to be updated? Was this metric definition updated since the initial writing of the query? At present, the query looks at a 1 year period whereas product analytics is now using the second-month editor retention period

with open("queries/new_editor_retention.hql") as f:
    q = f.read()

ner = wmf.hive.run(
    q.format(start = "{ner_start}", end = "{ner_end}").format(**query_vars))

I believe it's all right as it. The query gives the overall 2 month editor retention rate among all the new editors during the year-long period :)

@Iflorez, @cchen, we shouldn't call this done until we've updated the formatting to make all the sheets consistent (I suggest we follow the Jun 2018 sheet, which I spent a lot of time on!) and sent out an announcement to the interested departments (Product and the teams formerly known as Community Engagement). I'm happy to do that if you're busy right now :)

The other thing I was wondering about was the December 2017 data. The monthly unique devices data is exactly identical to the June 2018 data, which shouldn't happen, and I'm wondering what's going on there.

Hi @Neil_P._Quinn_WMF
I appreciate your eye for detail. I see two differences between the Jun 2018 tab and the others:

  • colors - offset blue/white rows (I think this is offset, there is likely a better term)
  • filter capability on each of the columns.

Are there other differences that need to be enabled?
Are there other issues? Which is the issue related to the unique devices data column?

@Neil_P._Quinn_WMF
What do you think about this idea, for programming the changes that need to be made...so that we can ensure that the sheets created in the future for this worksheet also get the missing and needed changes:

If the changes required, are required on all the new tabs...and presumably will also be required on the new tabs created in the new year:
Can you record a macro, while you make the edits to one tab, and title it appropriately...then deploy that macro on the next tabs that need those changes?
And then we have that macro ready to deploy for future sheets/tabs?

@Iflorez, sorry for the delay. Here's the full list of issues I've noticed.

Data issues
I think we definitely need to address the first two; I would feel bad about ignoring the second two, but honestly they would only affect sites way down at the bottom so it would probably be fine.

  • Cumulative content edits and content pages always use the latest values, even for retrospective snapshots (we would be able to address this by using the AQS API like we do for monthly metrics)
  • Columns C-F (the readership metrics) in the Dec 2017 snapshot seem to have completely identical values to the Jun 2018 snapshot
  • Retrospective snapshots include projects that didn't exist at the time (would be easy to filter them out since they will have zero or null values for almost all metrics)
  • Projects that hadn't existed for a full 12 months still use 12 month averages for all values (this was true even for the first snapshot, though).

Export/formatting issues
Note that some of these are only issues because they're inconsistent with the first snapshot; we want everything to look as identical as possible to make moving between sheets seamless. The idea of making a macro is interesting and hadn't occurred to me, but on investigation, it seems like more trouble than it's worth; for example, if we add new metrics in the future, it will break, and I don't think it can make the columns equal widths across multiple sheets.

One option is to address them in the xlswriter conversion (which I imagine would work for some, like the fonts and number formats, but not all). However, my preference would just be to take the new values and paste them into a duplicated version of the Jun 2018 snapshot; I did the "Dec 2018 (reformatted)" tab as an example. It only took 5 minutes; considering how rarely we're going to do this, it doesn't seem worth the trouble of automating.

  • I assume the "Nov 2018-Nov 2019" sheet is actually "Dec 2018-Nov 2019" (which we can just call "Nov 2019")
  • Product code column is duplicated
  • Inconsistent fonts (mix of Arial and Calibri) and font sizes (the first snapshot uses 10-point Arial consistently)
  • Counts are formatted with two decimal points rather than as integers
  • Percentages are formatted as integers rather than with one decimal point (this may actually be an improvement, although it seems like doing it with xlswriter rather than Sheets formatting means you can't click into the cell to see the full value)
  • Black cell borders around header cells
  • Column titles are center justified rather than having the same justification as the column content
  • Wiki names (column A) are bold

Thank you @Neil_P._Quinn_WMF

I'll take a look at #2 in the Data Issues section later today and will reach back with questions on the other data issues items.

I have some questions on the Data Styling portion. I'll dig into that tomorrow.

Thank you @Neil_P._Quinn_WMF

I'll take a look at #2 in the Data Issues section later today and will reach back with questions on the other data issues items.

I have some questions on the Data Styling portion. I'll dig into that tomorrow.

Great, thank you for looking into them! Let me know if you have any questions or disagreements 😁

HI @Neil_P._Quinn_WMF

Data Issues:
#2 - yes, 2017 readership metrics do have identical data to the June 2018 tab. I'm looking into this. For now, can you please remove that sheet from the Wiki Comparison notebook? No dates were hardcoded into the individual queries in the code repository. Instead, there's one cell at the top that defines dates. So I'm not sure how this came about, it seems unusual that one section has different data variables than another...this definitely needs to be fixed.

Related: can you please remove the Dec 2018 tab? There are now two 2018 tabs which can be confusing. The updated 2018 reformatted tab that you worked on should suffice. I do not have editing permissions on the workbook at present.

Data Issues:
#2 - yes, 2017 readership metrics do have identical data to the June 2018 tab. I'm looking into this. For now, can you please remove that sheet from the Wiki Comparison notebook? No dates were hardcoded into the individual queries in the code repository. Instead, there's one cell at the top that defines dates. So I'm not sure how this came about, it seems unusual that one section has different data variables than another...this definitely needs to be fixed.

Yeah, I had seen that in the notebook; obviously there is something weird and unexpected going on! But agreed that removing it for now (as you've already done) is the best solution.

Anyway, I think we're actually pretty close to good! What do you think about these ideas:

  • I can reformat the Nov 2019 tab as I did with the Dec 2018 one.
  • It's not the ideal solution, but removing the cumulative content edits from the Dec 2018 tab (as you did with the content pages) would take care of the issue with current data.

If we do those two things, I think that will take care of all the pressing issues! Then, we can make an announcement and call this done 😁

Hi @Neil_P._Quinn_WMF

About the data Issues:
#1 I will need a little more insight into gathering Cumulative content edits and content pages with the API. As far as historical cumulative content pages, the Wikistats 2.0 API notes that Total article count can be derived from the Pages Created count, by taking its cumulative value. However, I'm not clear on how this addresses deleted content. I looked at this in T240253 for GLOW India and ended up pulling data from https://meta.wikimedia.org/wiki/List_of_Wikipedias/Table. I will appreciate more detailing on how to pull accurate data for these two measures. In the short term, I recommend having blank columns for any historical year wiki comparison snapshots. So, for example, if we decide to create a 2017 tab, then have blank columns for cumulative content edits and content pages.

#2 yes, 2017 readership metrics do have identical data to the June 2018 tab. I ran the notebook again, just in case there was an error in the file uploading up to Google Sheets, and received the same output. There are three related queries in the Data Collection notebook. Each of the queries is looking at proportions. Is it possible that the proportions did not change from 2017-2018 for these four measures? Is it possible that this issue will come up again in 2019...where we have proportions and the proportions match previous proportions? Should we consider renaming the column to make it further clear that we are not talking about absolute values but about proportions?

Mobile Unique Devices (see cell 62 in the repo):

SELECT
    regexp_replace(
        regexp_replace(
            regexp_replace(regexp_replace(domain, "www\\\\.", ""), "zero\\\\.", ""),
        '^m\\\\.', ''),
    '\\\\.m\\\\.', '.') AS domain_name,
    SUM(if((domain regexp '^m\\\\.' or  domain regexp '\\\\.m\\\\.'), uniques_estimate, 0)) AS mobile_COUNT,
    SUM(uniques_estimate) AS total_COUNT,
    SUM(
        IF((domain regexp '^m\\\\.' or  domain regexp '\\\\.m\\\\.'), uniques_estimate, 0)
    ) / SUM(uniques_estimate) AS mobile_unique_devices
FROM wmf.unique_devices_per_domain_monthly
WHERE 
    CONCAT(year, month) >= "{pv_start}" AND
    CONCAT(year, month) < "{pv_end}"
GROUP BY    
    regexp_replace(
        regexp_replace(
            regexp_replace(regexp_replace(domain, "www\\\\.", ""), "zero\\\\.", ""),
        '^m\\\\.', ''),
    '\\\\.m\\\\.', '.')

Monthly Unique Devices (see cell 10 in the repo):

SELECT
    regexp_replace(
        regexp_replace(
            regexp_replace(domain, "zero\\\\.", ""),
        '^m\\\\.', ''),
    '\\\\.m\\\\.', '.') AS domain_name,
  SUM(uniques_estimate) / 12 AS monthly_unique_devices
FROM wmf.unique_devices_per_domain_monthly
WHERE 
    CONCAT(year, month) >= "{pv_start}" and
    CONCAT(year, month) < "{pv_end}"
GROUP BY    
    regexp_replace(
        regexp_replace(
            regexp_replace(domain, "zero\\\\.", ""),
        '^m\\\\.', ''),
    '\\\\.m\\\\.', '.')

Mobile Web and Mobile App Pageviews Proportion (see cell 30 in the repo):

SELECT 
    CONCAT("https://", project, ".org") AS domain_name,
    SUM(if(access_method = "mobile web", view_COUNT, 0)) / SUM(view_COUNT) AS mobile_web_pageviews_proportion,
    SUM(if(access_method = "mobile app", view_COUNT, 0)) / SUM(view_COUNT) AS mobile_app_pageviews_proportion
FROM wmf.projectview_hourly
WHERE
    agent_type = "user" 
    AND CONCAT(year, month) >= "{pv_start}" 
    AND CONCAT(year, month) < "{pv_end}" 
GROUP BY CONCAT("https://", project, ".org")

#3

Retrospective snapshots include projects that didn't exist at the time (would be easy to filter them out since they will have zero or null values for almost all metrics)

I could use more detailing here. Which columns or data are you referencing when you say retrospective snapshots?
Do you mean that when we pull data for 2017, for example, there are projects with many zero values as they didn't exist at the time? Maybe there needs to be a column for 'birthdate' ? See Item 4 below.

#4

Projects that hadn't existed for a full 12 months still use 12-month averages for all values (this was true even for the first snapshot, though).

I appreciate your catching this. It seems like for these types of projects there needs to be additional code in the queries as far as data processing. Where is a good source for identifying how long a wiki has been around? Is there a recommended Hive table? It would be good to catch this sort of thing for the next round of data pulled, 2019 data.

In T221566#5739287, @Neil_P._Quinn_WMF wrote:

Data Issues:
#2 - yes, 2017 readership metrics do have identical data to the June 2018 tab. I'm looking into this. For now, can you please remove that sheet from the Wiki Comparison notebook? No dates were hardcoded into the individual queries in the code repository. Instead, there's one cell at the top that defines dates. So I'm not sure how this came about, it seems unusual that one section has different data variables than another...this definitely needs to be fixed.

Yeah, I had seen that in the notebook; obviously there is something weird and unexpected going on! But agreed that removing it for now (as you've already done) is the best solution.

Anyway, I think we're actually pretty close to good! What do you think about these ideas:

  • I can reformat the Nov 2019 tab as I did with the Dec 2018 one.

excellent! Again, I highly recommend recording a macro as it should just take two clicks and then can be deployed with a shortcut on the next sheet. The process would be clicking

tools > macros > Record macros

and sheets will put your actions into code that it saves as a macro.
but, of course, it's up to you :)

  • It's not the ideal solution, but removing the cumulative content edits from the Dec 2018 tab (as you did with the content pages) would take care of the issue with current data.

done. Also, what do you think about renaming the 'content pages' column to 'cumulative content pages' ? I made that edit in the Sheets and can put that into the repo for the next ones (or change the naming back...)

If we do those two things, I think that will take care of all the pressing issues! Then, we can make an announcement and call this done 😁

sounds good! I made a few comments relative to the four data issues that you identified...those can be discussed and rolled out in the next iteration, for 2019 data.

Again, I highly recommend recording a macro as it should just take two clicks and then can be deployed with a shortcut on the next sheet.

I looked into it briefly, and the problem is that a macro would likely be quite brittle. If I went and actually reformatted a sheet "in place", for example, I could probably make that into a macro. But the commands that applied the proper numerical formatting to each column would likely break as soon as the arrangements of the columns changed. So, for example, adding any new metric would probably render the macro useless. And I doubt I would be able to make a macro out of what I actually did (which was to copy the Jun 2018 tab, and paste the values from a new tab into it, so that the data acquires the format from the Jun 2018 tab).

It might make sense to put up with the brittleness if we were adding one of these sheets every week. But considering that we're probably going to be doing it once a year, I don't think it does.

The more maintainable solution would be to upstream as many fixes as possible into the xlswriter export. For example, could we fix the font problems there? Could we change the export of the numerical columns to eliminate the distracting, overly precise fractional values? Could we change from applying the formats at fixed locations to applying formats according to a dict mapping metrics to formats?

Once we did that, perhaps we would be left with some things that could be put into a less brittle macro (e.g. changing the font for the whole sheet, auto-setting the column widths) or perhaps we could change the old sheets to match the new exported format.

We have to do any of this now, of course; I imagine you're ready to move on to other things! I'm just explaining what I think the ideal path forward would be 😊

Also, what do you think about renaming the 'content pages' column to 'cumulative content pages' ? I made that edit in the Sheets and can put that into the repo for the next ones (or change the naming back...)

Interesting idea! But, on reflection, it doesn't make sense to me. I added "cumulative" to "content edits" because the "content edits" by itself would imply the number of content edits made during some recent period, as opposed to the number of content edits since the wiki's inception. With "content pages", the obvious implication is the number of content pages currently existing on the wiki, which is in fact the definition used here and elsewhere. "Cumulative content pages" might make people think it's something different, like the number of content pages that have ever been created, including ones that have been deleted.

In T221566#5744406, @Neil_P._Quinn_WMF wrote:

And I doubt I would be able to make a macro out of what I actually did (which was to copy the Jun 2018 tab, and paste the values from a new tab into it, so that the data acquires the format from the Jun 2018 tab).

Well, it seems like it might be able to manage this! But that still leaves the problem that the macro would no longer work seamlessly if the columns change. My main point is that it's probably not a good idea to add a new technology to our stack to save five minutes every six months or so 😁

@nshahquinn-wmf The wiki comparison sheet is now updated with Dec 2019 data.
You can now add formatting magic :)

One other thing to consider, I propose that we also remove the edits per content page column.
On the Dec 2018 tab, the edits per content page column, column V, is calculated by dividing two cumulative columns, cumulative content pages and cumulative edits.
wikis["edits_per_content_page"] = wikis["cumulative_content_edits"] / wikis["article_COUNT"]
As we removed the article_COUNT and cumulative content edits columns from the Dec 2018 tab because the data was inaccurate as it only showed the counts as of the date of the data pull, 2019 in this case, not the date in the query, I propose that we also remove the edits per content page column.

@nshahquinn-wmf The wiki comparison sheet is now updated with Dec 2019 data.
You can now add formatting magic :)

Nice—the formatting spell has been cast!

One other thing to consider, I propose that we also remove the edits per content page column.
On the Dec 2018 tab, the edits per content page column, column V, is calculated by dividing two cumulative columns, cumulative content pages and cumulative edits.
wikis["edits_per_content_page"] = wikis["cumulative_content_edits"] / wikis["article_COUNT"]
As we removed the article_COUNT and cumulative content edits columns from the Dec 2018 tab because the data was inaccurate as it only showed the counts as of the date of the data pull, 2019 in this case, not the date in the query, I propose that we also remove the edits per content page column.

Good point! I removed it from the Dec 2018 snapshot (but not the Dec 2019 snapshot, as I think you intended).

So now the dataset is improved, up-to-date, and pretty! Nice job, @Iflorez 😊

I've filed two follow-up tasks:

  • Get Legal's okay to make the data public and then make it public again (T243197)
  • Send out an announcement of the update (T243199)