Page MenuHomePhabricator

Isaac (Isaac Johnson)
Research Scientist

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Oct 1 2018, 2:19 PM (119 w, 6 d)
Availability
Available
IRC Nick
isaacj
LDAP User
Isaac Johnson
MediaWiki User
Isaac (WMF) [ Global Accounts ]

Recent Activity

Fri, Jan 15

Isaac moved T272175: Prototype article importance metrics from Staged to FY2020-21-Research-January-March on the Research board.
Fri, Jan 15, 6:57 PM · Research (FY2020-21-Research-January-March)
Isaac created T272175: Prototype article importance metrics.
Fri, Jan 15, 6:57 PM · Research (FY2020-21-Research-January-March)
Isaac added a comment to T272109: Assess prevalence of Wikidata infoboxes.

In Czech Wikipedia, all templates using data from Wikidata are in a category, so we can use that for the analysis. Note the category doesn't contain _only_ infoboxes, but infoboxes can be easily filtered out thanks to their specific page titles.

Adding to that, there's also a more specific category on certain wikis such as English that's not just Wikidata templates but Wikidata-infobox templates -- e.g., en:Category:Infobox_templates_using_Wikidata. This will get you a conservative estimate of pages that just maybe might be adding an image automatically via Wikidata.

Fri, Jan 15, 2:45 PM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team, Wikipedia-Android-App-Backlog

Wed, Jan 13

Isaac added a comment to T271960: New anaconda-wmf release with updated packages.

Thanks @Ottomata! The three that I can think of quickly are:

  • mwapi for making easy requests to Mediawiki APIs. Obviously less relevant for worker nodes but I often use this in the same notebook as I'm running PySpark jobs for various data augmentation etc. I think only available via pip.
  • mwxml for easy access to the dumps. Again, less for worker nodes and more for being able to mix analyses that depend on the cluster with dump processing. I think only available via pip.
  • shapely for spatial analyses. If this one is a big package, it's okay to leave it off as it's probably more specific to me. Available via pip but also a conda package.
Wed, Jan 13, 8:24 PM · Analytics-Kanban, Discovery, Product-Analytics, Research, Analytics

Tue, Jan 12

Isaac added a comment to T270140: Release dataset on top search engine referrers by country, device, and language.

Regarding privacy, there are always various options for how to implement this that James will help guide. This dataset is perfect for differential privacy but unfortunately I assume we're not ready yet to apply it. In the future though, I'd love to come back to do that. In the meantime, I'm going to assume we're using our most straightforward approach of defining a threshold of data that a datapoint must exceed to be released -- e.g., at least 1000 pageviews to a given country+lang+browser+OS for that data point to be released. This is a simple thing to enforce in the data pipeline -- e.g., add HAVING COUNT(1) >= 1000 in the final clause of the example pipeline above. This is an update from an earlier analysis above when we were just considering browser family and not OS (so the data was more aggregated).

Tue, Jan 12, 11:11 PM · Patch-For-Review, Privacy Engineering, Research, Analytics
Isaac added a comment to T270140: Release dataset on top search engine referrers by country, device, and language.

I might be slightly ahead but I wanted to give an update on where we are with the queries / privacy analyses. This first post focuses on the dataset generation code. I think it's pretty straightforward (thanks to wmf.pageview_actor) and just requires a few choices to be made around privacy (future posts will address that). Basic code below and some notes:

Tue, Jan 12, 10:18 PM · Patch-For-Review, Privacy Engineering, Research, Analytics

Mon, Jan 11

Isaac closed T264455: Measure equity impact of current recommender systems, a subtask of T155541: [Epic] Article importance prediction model, as Resolved.
Mon, Jan 11, 5:00 PM · Research, Machine Learning Platform, artificial-intelligence
Isaac closed T264455: Measure equity impact of current recommender systems as Resolved.
Mon, Jan 11, 5:00 PM · Research (FY2020-21-Research-October-December)

Fri, Jan 8

Isaac created T271571: Update Image usage metric.
Fri, Jan 8, 6:15 PM · Analytics-Kanban, Patch-For-Review, Product-Analytics, Analytics
Isaac updated subscribers of T270140: Release dataset on top search engine referrers by country, device, and language.

Thanks @Milimetric -- @bmansurov will be leading the technical work on this so we're going to start work on this and greatly appreciate whatever code review / support Analytics is able to provide along the way. My assumption is that this will be a reportupdater query like the existing browser/OS stats but I'm largely ambivalent about where/how the data is generated.

Fri, Jan 8, 4:19 PM · Patch-For-Review, Privacy Engineering, Research, Analytics

Wed, Jan 6

Isaac added a comment to T267283: Evaluate a differentially private solution to release wikipedia's project-title-country data.

This looks awesome @Isaac! Can't wait to try it out.

Thanks! If there's anything that doesn't make sense, let me know and I'll try to explain. Or additional features that would be useful and I can see if they'd be easy to incorporate.

Wed, Jan 6, 9:05 PM · Analytics, Research, Privacy Engineering, Privacy, Data-release

Tue, Jan 5

Isaac added a comment to T271163: TranslationRecommendation* Schemas Event Platform Migration.

If it's not hard, I'd ask to retain the geocoded data.

It isn't hard, we can do!

Thanks!

Tue, Jan 5, 3:14 PM · Research, Analytics, Event-Platform

Mon, Jan 4

Isaac added a comment to T271163: TranslationRecommendation* Schemas Event Platform Migration.

let us know if this schema needs client IP and/or geocoded data? If not, it will be removed as part of this migration.

If it's not hard, I'd ask to retain the geocoded data. Client IP is nice for determining unique number of users (UA+IP) but there's a user token in the data that also works for that purpose. I do use the geocoded data (country specifically) for looking at geographic diversity of users for the system though so would prefer to retain it.

Mon, Jan 4, 10:25 PM · Research, Analytics, Event-Platform
Isaac updated the task description for T270140: Release dataset on top search engine referrers by country, device, and language.
Mon, Jan 4, 8:10 PM · Patch-For-Review, Privacy Engineering, Research, Analytics
Isaac updated the task description for T270140: Release dataset on top search engine referrers by country, device, and language.
Mon, Jan 4, 8:08 PM · Patch-For-Review, Privacy Engineering, Research, Analytics
leila awarded T238437: Identify and prepare a data-set for Fair Ranking Track at TREC a Love token.
Mon, Jan 4, 7:30 PM · Research (FY2020-21-Research-January-March)
Isaac added a comment to T270779: Parallelize pipeline for building groundtruth for region inference.

Update: fkaelin helped me identify that the problem was with Null values being passed to the point-in-country lookup (and causing the function to error out). With the null checking in place (WHERE lat IS NOT NULL AND lon IS NOT NULL), the code now runs (and quite fast for all Wikidata items at ~20 minutes). Results are sufficiently close to what I had through the old Python script that I think everything is working correctly (I'm working with slightly newer data and a few other small adjustments so some difference was expected).

Mon, Jan 4, 7:10 PM · Research
Isaac claimed T238437: Identify and prepare a data-set for Fair Ranking Track at TREC.

Assigning this to myself as it's clear that it's active now. Looks like TREC is interested in a task around building WikiProject worklists while taking into account equity aspects of the articles that show up in the list. I'll continue to meet with the organizers to help shape the task and will provide dataset support over the next few months.

Mon, Jan 4, 6:20 PM · Research (FY2020-21-Research-January-March)
Isaac added a comment to T267283: Evaluate a differentially private solution to release wikipedia's project-title-country data.

Hey all -- happy new year!! I had differential privacy on the mind and my feeling is that we were stalled on how to choose the parameters that determine how much "privacy" is being assured. So I made an attempt to set up a simple interface to help us see the impact of different choices of parameters for differential privacy on potential result lists. I know that we should make our decisions about privacy separate from how it affects the perceived utility of the results but I found this really useful for thinking about the different approaches / parameters and what they mean. I mostly based the tool and my corresponding thoughts on the links that @TedTed shared to differential privacy blogposts, Facebook's mobility data, and Google's search trends data (thank you!).

Mon, Jan 4, 2:59 PM · Analytics, Research, Privacy Engineering, Privacy, Data-release

Wed, Dec 23

Isaac added a comment to T264455: Measure equity impact of current recommender systems.

weekly update:

  • Equity impact complete for Suggested Edits. Summary:
The analyses demonstrated that depending largely on a random selection of content for recommendation reinforces the status quo around gender and geography -- i.e. heavy imbalance towards men, the United States, and United Kingdom -- and therefore the net effect of the recommender is to improve content about men more than women or other gender identities and content about the US/UK more than other regions. The exact regions improved depends heavily on language -- i.e. US/UK for English Wikipedia but Japan for Japanese Wikipedia or Germany for German Wikipedia -- but the trend remains that editors do not themselves seem to exert additional selection bias over the recommendations. Analogously, the gender associated with the content recommended does not seem to affect whether editors choose to make an edit or not.
  • I still would like to repeat this analysis on another recommender system. I'll likely go with Newcomer tasks because there is good data collection, it's been used quite a bit in a number of languages outside of English, and it has the added variable of maintenance templates (which might skew the potential recommendations) and topic preferences (which might skew what recommendations are actually shown). So while Suggested Edits ended up being a pretty straightforward story (biased content -> biased recommendations -> biased edits), Newcomer Tasks might be more complicated.
Wed, Dec 23, 9:34 PM · Research (FY2020-21-Research-October-December)
Isaac added a comment to T270779: Parallelize pipeline for building groundtruth for region inference.

leaving myself this link too so I don't lose it on optimizing point-in-polygon lookups. I suspect not necessary if we can get the PySpark pipeline working but a reminder that there are other possible optimizations to speed up this pipeline: https://gis.stackexchange.com/questions/120955/understanding-use-of-spatial-indexes-with-rtree/144764#144764

Wed, Dec 23, 8:36 PM · Research
Isaac closed T171635: Prototype new models to facilitate sockpuppet detection, a subtask of T171251: [Objective 3.1.2] Models for sockpuppet and toxic discussion detection, as Resolved.
Wed, Dec 23, 7:48 PM · Anti-Harassment, Epic, Research-Programs
Isaac closed T171635: Prototype new models to facilitate sockpuppet detection as Resolved.
Wed, Dec 23, 7:48 PM · Research (FY2020-21-Research-October-December), Anti-Harassment, artificial-intelligence
Isaac added a comment to T270779: Parallelize pipeline for building groundtruth for region inference.

@fkaelin no hurry on this but maybe we can walkthrough this code at our next meeting. My first attempt at this is not working. I can get the shapely library I'm using for point-in-polygon operations onto the worker nodes and the operation to run for test examples but it fails when running on even smallish subsets of Wikidata (e.g., just items with sitelinks to Hausa Wikipedia, which only has ~7000 articles (and therefore Wikidata items) of which not all have coordinate data. Notebook is here but also can be found on stat1008 at /home/isaacj/wiki-region-data.ipynb. I _think_ the code is technically correct and the issue is just with how much lifting the point-in-polygon operation requires on the worker nodes but I can't be certain...

Wed, Dec 23, 5:25 PM · Research
Isaac created T270779: Parallelize pipeline for building groundtruth for region inference.
Wed, Dec 23, 5:17 PM · Research

Tue, Dec 22

Isaac closed T269358: Can't use custom conda kernel in Newpyter within PySpark UDFs as Resolved.

Its easy enough, and already huge (as you noticed) so no harm in adding more packages.

Sounds good -- in the new year, I'll likely come back to this in a new task and start building a list.

Tue, Dec 22, 8:00 PM · Analytics-Kanban, Analytics
Isaac added a comment to T269358: Can't use custom conda kernel in Newpyter within PySpark UDFs.

does https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark#pyspark_and_external_packages help at all? You could certainly pass those args in a custom pyspark kernel.json, but perhaps you can also do the same via the SparkSession API? https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter#Launching_as_SparkSession_in_a_Python_Notebook

Yep, just gave this a try (zipping up my local conda environment after installing shapely and adding to the PySpark environment; PDF of notebook attached) and all seems to be working. Thanks for the pointer (I actually had no idea there was a Spark page on wikitech outside of the Jupyter page)! A few thoughts:

  • The zipped conda environment is 750MB, which seems quite large. I'm not sure if this is something that is problematic for the workers or can be improved, but I figured worth mentioning because its size surprised me. I guess it's because it contains a complete Python environment + lots of packages (and hence why it works...)
  • I of course would not object to shapely being included by default in the Anaconda environment (easier for me in most cases than configuring my own kernel) but do appreciate that I can have an environment over which I have full control of packages and versions. However, for the sake of simplicity with most of what I do, how much effort is it to add a package and at what point does the Anaconda environment become too bloated? If it's easy, I can create a task for it because I'd request at least mwapi (easy Mediawiki API calls) and mwxml (parsing XML dumps) as well, which aren't for the Spark side but for other common analyses I do in a notebook.
Tue, Dec 22, 3:36 PM · Analytics-Kanban, Analytics
Isaac added a comment to T207171: Have a way to show the most popular pages per country.

Sorry for the radio silence! I just finished up final exams, so I'm now freed up to make more progress on this.

No worries - thanks for working on this!

Tue, Dec 22, 2:34 PM · Patch-For-Review, Analytics-Wikistats, Privacy Engineering, Inuka-Team, Language-strategy, Tool-Pageviews, Analytics

Dec 18 2020

Isaac added a comment to T264455: Measure equity impact of current recommender systems.

weekly update:

Dec 18 2020, 9:20 PM · Research (FY2020-21-Research-October-December)
Isaac committed rRLPf811a21f8f8d: update header button for wikiworkshop submissions (authored by Isaac).
update header button for wikiworkshop submissions
Dec 18 2020, 7:07 PM
Isaac added a comment to T171635: Prototype new models to facilitate sockpuppet detection.

I'm going to close this task out unless there are any objections -- my work on this has largely been complete for a while now and no issues have come up yet in the productization that would require serious rework of the approach (though plenty of improvements have been made to the stability of the prototype). Future tasks that we might open are:

  • Making updates based on Checkuser feedback
  • Further research into other types of data / modeling that could help inform the ranking.
Dec 18 2020, 4:32 PM · Research (FY2020-21-Research-October-December), Anti-Harassment, artificial-intelligence
Isaac committed rRLPcaad1dd60da5: Update events page -- new WikiWorkshop plus finally new screenshot of Showcase… (authored by Isaac).
Update events page -- new WikiWorkshop plus finally new screenshot of Showcase…
Dec 18 2020, 3:38 PM
Isaac updated the task description for T219903: Keep research.wikimedia.org landing page updated.
Dec 18 2020, 3:18 PM · Patch-For-Review, Research

Dec 16 2020

Isaac closed T270213: No admin response for many months for research-internal listserv as Resolved.

Received -- thanks @Dzahn !

Dec 16 2020, 6:29 PM · Research, SRE, Wikimedia-Mailing-lists
Isaac added a comment to T270213: No admin response for many months for research-internal listserv.

Thanks for the quick response @Dzahn ! I emailed the list to ask for nominations and @Ladsgroup graciously volunteered so we will be the two new admins. If you could add the following as admins, that'd be much appreciated:

ijohnson@wikimedia.org
ladsgroup@gmail.com
Dec 16 2020, 2:44 PM · Research, SRE, Wikimedia-Mailing-lists

Dec 15 2020

Isaac created T270213: No admin response for many months for research-internal listserv.
Dec 15 2020, 8:17 PM · Research, SRE, Wikimedia-Mailing-lists

Dec 14 2020

Isaac added a comment to T270140: Release dataset on top search engine referrers by country, device, and language.

Some data from December 10th to help us think about privacy. Raw data can be found in isaacj.search_engine_data in Hive and data pipeline in stat1004:/home/isaacj/notebooks/Search_Engine_Traffic.ipynb. Specifically looking at how much data we'd have for each country if our daily threshold was at least 500 pageviews. I'll try to provide some additional analyses on how different privacy thresholds and k values (e.g., only reporting top 100) affect how much data is made available.

Dec 14 2020, 10:49 PM · Patch-For-Review, Privacy Engineering, Research, Analytics
Isaac renamed T270140: Release dataset on top search engine referrers by country, device, and language from Release dataset on top search engine referrers by country, OS, and language to Release dataset on top search engine referrers by country, device, and language.
Dec 14 2020, 10:36 PM · Patch-For-Review, Privacy Engineering, Research, Analytics
Isaac created T270140: Release dataset on top search engine referrers by country, device, and language.
Dec 14 2020, 10:36 PM · Patch-For-Review, Privacy Engineering, Research, Analytics
Isaac added a comment to T258419: Survey users about mediasearch on commons.

There is no quick and easy way to do either of these, but I don't think it matters for this particular survey, since it lives on the project page and is not presented to users where they wouldn't expect it.
And even if it were possible, hiding the survey still wouldn't do all that much, because the rest of the survey context (the wikitext surrounding it) would still exist.

Yeah, good point, that makes sense. Best of luck!

Dec 14 2020, 6:09 PM · SDAW-MediaSearch (MediaSearch-ReleaseCandidate), Product-Analytics, Patch-For-Review, Surveys, Structured-Data-Backlog (Current Work), Structured Data Engineering
Isaac added a comment to T269358: Can't use custom conda kernel in Newpyter within PySpark UDFs.

I've switched from custom kernels to the generic Python (not pyspark) kernel, and can install packages directly in the notebook's environment

Thanks for chiming in @awight. First off, cool notebook! Installing packages locally and using them for Python functions in the notebook also works for me -- the challenge is when I want Spark workers to also have access to the library so I can parallelize the computation. It didn't seem like you were doing that in your notebook, but let me know if I missed it.

Dec 14 2020, 5:42 PM · Analytics-Kanban, Analytics
Isaac updated subscribers of T269358: Can't use custom conda kernel in Newpyter within PySpark UDFs.

To make this more relevant, I now find myself in a situation where I do want a Python lib on the workers that is not available in the standard conda environment: shapely
It's used for doing spatial analyses and I intend to use it to parallelize this job that gathers all items with lat/lon coordinates in Wikidata and checks which countries the coordinates are in: https://github.com/geohci/wiki-region-groundtruth/blob/main/gather_wikidata_region_groundtruth.py#L169

Dec 14 2020, 5:35 PM · Analytics-Kanban, Analytics

Dec 11 2020

Isaac added a comment to T264455: Measure equity impact of current recommender systems.

weekly update:

  • wrote up notebook for gathering data on how the Suggested Edits module has actually been used. This will complement the analysis of what types of recommendations are made and indicate whether there is any bias towards skipping recommendations along gender / geography lines. For instance, since May 2020 (v4 of suggested edits), there have been 28,331 edits made via the module to images that have associated Wikipedia articles (and therefore I can directly infer gender / geography associated with those images)
  • TODO: read through results from Growth experiments to help guide impact analysis of that module
Dec 11 2020, 9:06 PM · Research (FY2020-21-Research-October-December)
Isaac added a comment to T171635: Prototype new models to facilitate sockpuppet detection.

Weekly updates:

  • Continued support of productization
  • Regenerated data through all of November -- whole pipeline was about 20 minutes start to finish from collecting all the relevant edit history from the cluster to outputting the TSV files the tool uses.
Dec 11 2020, 6:39 PM · Research (FY2020-21-Research-October-December), Anti-Harassment, artificial-intelligence

Dec 7 2020

Isaac added a comment to T266180: Request increased quota for wmf-research-tools Cloud VPS project.

There isn't; the regex is applied per-project. So you can invent whatever norms you want!

Oooh fun! In that case, instances whose name ends with -test, -build, or -prototype would be the three regexes I'd feel comfortable putting in place for this project (and recommendation-api project if you'd like too). I can confirm that currently those regex would capture two projects that don't require backup and I'll start using them for new instances that won't require backup and try to document this for our team.

Dec 7 2020, 10:33 PM · cloud-services-team (Kanban), Cloud-VPS (Quota-requests)
Isaac added a comment to T266180: Request increased quota for wmf-research-tools Cloud VPS project.

We have a few more hypervisors online now so will be granting the quota change soon.

Yay, thanks!

Dec 7 2020, 9:35 PM · cloud-services-team (Kanban), Cloud-VPS (Quota-requests)
Isaac added a comment to T269053: Turn off old surveys on the beta cluster running with non-zero coverage.

Thanks!

Dec 7 2020, 7:14 PM · QuickSurveys (Surveys), Wikimedia-Site-requests, Research, Beta-Cluster-reproducible

Dec 4 2020

Isaac added a comment to T264455: Measure equity impact of current recommender systems.

weekly update: no progress though the start of the Outreachy project on a country classifier for articles will help greatly with the geographic equity component of this work (T263646)

Dec 4 2020, 7:53 PM · Research (FY2020-21-Research-October-December)
Isaac added a comment to T171635: Prototype new models to facilitate sockpuppet detection.

Weekly update:

  • Still no feedback from Checkusers -- at this point, I believe the expectation is that we will productize it so they can access the tool directly, which should make it much easier for them to provide feedback.
  • Tool code has been moved to Gerrit: https://gerrit.wikimedia.org/r/admin/repos/mediawiki/services/similar-users
  • I'm largely just playing a consultation role right now but really excellent progress on productization of the tool as being tracked here: T265722
  • I produced datasets of text diffs and which sections were edited by each user to explore with DD
Dec 4 2020, 7:52 PM · Research (FY2020-21-Research-October-December), Anti-Harassment, artificial-intelligence
Isaac closed T266339: Summarize feedback on first draft of knowledge gaps taxonomy, a subtask of T242172: Taxonomy of Knowledge Gaps, as Resolved.
Dec 4 2020, 7:45 PM · Research, Epic
Isaac closed T266339: Summarize feedback on first draft of knowledge gaps taxonomy as Resolved.

Update: closing this task as the summarization is complete and we have moved to make changes to the taxonomy. High-level changes to taxonomy documented here: https://docs.google.com/spreadsheets/d/1QPo8_AYHJfVBMBkAhtTi5tXvZX_Z9DrWeq376MmGqS8/edit#gid=0
Changes to the taxonomy will be tracked so that they can be linked back to the original prompts from the feedback that led to them.

Dec 4 2020, 7:45 PM · Research (FY2020-21-Research-October-December)
Isaac committed rRLPd4ef86219c0a: Update publications and team page. (authored by Isaac).
Update publications and team page.
Dec 4 2020, 1:10 AM

Dec 3 2020

Isaac updated the task description for T219903: Keep research.wikimedia.org landing page updated.
Dec 3 2020, 8:47 PM · Patch-For-Review, Research
Isaac added a comment to T207171: Have a way to show the most popular pages per country.

Additional data to hopefully help see the impact of the different privacy unique actor thresholds would have on what countries would actually be able to benefit from this data (this is based on the data from T207171#6615009). I look at 1000 vs. 500 unique actor thresholds and how many countries in each continent show up on the resulting list with at least k articles. In general, if a country only has one or two articles on a list, that means Main Page and Special:Search (so not particularly useful data for that region and it looks to me that the list starts becoming useful at around at least 5 articles)

Dec 3 2020, 6:25 PM · Patch-For-Review, Analytics-Wikistats, Privacy Engineering, Inuka-Team, Language-strategy, Tool-Pageviews, Analytics
Isaac created T269358: Can't use custom conda kernel in Newpyter within PySpark UDFs.
Dec 3 2020, 2:46 PM · Analytics-Kanban, Analytics

Dec 2 2020

Isaac added a comment to T269053: Turn off old surveys on the beta cluster running with non-zero coverage.

Has it ever been used in production or are there plans to use it in production?

@AlexisJazz yes -- see for more details: https://meta.wikimedia.org/wiki/Research:Surveys_on_the_gender_of_editors/Report

Dec 2 2020, 5:05 PM · QuickSurveys (Surveys), Wikimedia-Site-requests, Research, Beta-Cluster-reproducible
Isaac added a comment to T269053: Turn off old surveys on the beta cluster running with non-zero coverage.

I assume this survey is still useful for testing and intentionally undismissable?

@Jdlrobson at this point, the survey can be disabled. Sorry, didn't realize it was still around, but as you point out, it's purely for testing purposes and we are not expecting anyone to respond to it. If you are able to do it quickly, I'd much appreciate you disabling it. If not, let me know, but it may take me a while to get to as I rarely do Gerrit patches.

Dec 2 2020, 4:53 PM · QuickSurveys (Surveys), Wikimedia-Site-requests, Research, Beta-Cluster-reproducible
Isaac added a comment to T258419: Survey users about mediasearch on commons.

Just two quick notes:

  • On the This is a bunch of text to explain what the survey is about!... text: editors often (reasonably) assume that the survey follows them around based on their username when in fact whether it shows up or not is browser-based. You'll probably want to clarify that in that text (e.g., "If you use multiple browsers, you may see the survey multiple times and should just dismiss it if you've already taken it") or you'll likely get some questions along the lines of "I thought I already took this" or "Please make this go away".
  • I don't see a dismiss button for the survey. Apologies for not bringing this up earlier -- I think this is a basic issue with QuickSurveys but I had forgotten. If it's an easy fix, it'd be great to add that. If not, is there a button that could be easily added that is "Prefer not to say" or "Dismiss" or something like that? Otherwise, people who do not want to answer but also don't want to see the survey will get pretty frustrated (in the past, I've shared how to update their local browser cache to remove the survey but this is obviously far from ideal): https://meta.wikimedia.org/wiki/Research:Surveys_on_the_gender_of_editors/Report#QuickSurveys
Dec 2 2020, 3:51 PM · SDAW-MediaSearch (MediaSearch-ReleaseCandidate), Product-Analytics, Patch-For-Review, Surveys, Structured-Data-Backlog (Current Work), Structured Data Engineering

Dec 1 2020

Isaac added a comment to T207171: Have a way to show the most popular pages per country.

Once he completes his analysis, there should only be a few more minor design considerations to work out, and then I can start on implementation.

@lexnasser great to hear! and thank you for leading this work and taking all of these points into consideration. I'm excited to see this come to fruition!

Dec 1 2020, 7:10 PM · Patch-For-Review, Analytics-Wikistats, Privacy Engineering, Inuka-Team, Language-strategy, Tool-Pageviews, Analytics

Nov 30 2020

Isaac added a comment to T266180: Request increased quota for wmf-research-tools Cloud VPS project.

Thanks for the update @aborrero

Nov 30 2020, 5:33 PM · cloud-services-team (Kanban), Cloud-VPS (Quota-requests)

Nov 19 2020

Isaac added a comment to T266180: Request increased quota for wmf-research-tools Cloud VPS project.

@Andrew just checking in to see if we have a new expected date for these changes? Thanks!

Nov 19 2020, 3:02 PM · cloud-services-team (Kanban), Cloud-VPS (Quota-requests)

Nov 16 2020

Isaac added a comment to T207171: Have a way to show the most popular pages per country.

For the sake of consistency, I'd rather continue using page_title as identifier.

Thanks @JAllemandou for these additional details. What you say makes sense and for this dataset I'm more open to using page_title because of the dataset's clear intent to help editors and the fact that the ranking is presumed to be more valuable than the underlying pageview counts (so missing a few pageviews that came from a redirect feels like less of a concern). A few additional thoughts:

  • It goes against consistency, but another option is page_title for daily and page_id for monthly. This will handle page moves that happen mid-month, provide higher-quality (in my opinion) data for at least one of the datasets, and be far far easier to actually execute (because you can just join against the page table for the canonical title to associate with that page ID)
  • From a privacy perspective, the one thing I'll note is that preserving redirects can bring with it some implications because there will be page redirects that are only used by e.g., one external site that could have enough interest to make it onto the top articles list while still being so specific as to reveal information about exactly where those pageviews are coming from. Aggregating redirects helps with this because if an article has enough interest that it makes the list, it probably is receiving pageviews from a variety of independent sources.
Nov 16 2020, 2:58 PM · Patch-For-Review, Analytics-Wikistats, Privacy Engineering, Inuka-Team, Language-strategy, Tool-Pageviews, Analytics

Nov 10 2020

Isaac added a comment to T207171: Have a way to show the most popular pages per country.

I'll be looping in a privacy engineer very soon, and I'm interested in hearing their opinion regarding the breadth of these privacy concerns.

Excellent - glad to hear!

Nov 10 2020, 1:40 PM · Patch-For-Review, Analytics-Wikistats, Privacy Engineering, Inuka-Team, Language-strategy, Tool-Pageviews, Analytics

Nov 9 2020

Isaac added a comment to T207171: Have a way to show the most popular pages per country.

And regarding unique pageview threshold, I threw together this table of # of pages (and unique languages/projects) that would be on the list for each country for 100, 500, and 1000 unique pageviews. Example query for k = 1000 and data for all below. My takeaway is that unless we have a strong reason for k=1000, I'd push for k=500 or k=100 given that many countries are included and go from maybe 1-2 pages to 10 or more as you push k lower, which feels like a sizable jump in value for these countries. In general, I'd argue for pushing the k value as low as we feel comfortable so that more countries can be included (and would drop bucketed pageview data if that makes us feel comfortable with pushing the k lower).

Nov 9 2020, 10:12 PM · Patch-For-Review, Analytics-Wikistats, Privacy Engineering, Inuka-Team, Language-strategy, Tool-Pageviews, Analytics
Isaac added a comment to T207171: Have a way to show the most popular pages per country.

Thanks for making the table @lexnasser! A few thoughts below:

Nov 9 2020, 9:00 PM · Patch-For-Review, Analytics-Wikistats, Privacy Engineering, Inuka-Team, Language-strategy, Tool-Pageviews, Analytics
Isaac updated the task description for T219903: Keep research.wikimedia.org landing page updated.
Nov 9 2020, 7:43 PM · Patch-For-Review, Research

Nov 6 2020

Isaac added a comment to T264455: Measure equity impact of current recommender systems.

weekly update: no progress

Nov 6 2020, 9:53 PM · Research (FY2020-21-Research-October-December)
Isaac added a comment to T171635: Prototype new models to facilitate sockpuppet detection.

Weekly update:

  • Still no feedback
  • Waiting on decision around making code public. Will follow up next week.
  • Meeting with NK/EP to discuss productization in the meantime and we seem to have good agreement there.
Nov 6 2020, 9:53 PM · Research (FY2020-21-Research-October-December), Anti-Harassment, artificial-intelligence
Isaac added a comment to T266339: Summarize feedback on first draft of knowledge gaps taxonomy.

Weekly update: moving slowly but been debating with MG about what changes to recommend based on the feedback we've collected. Biggest challenges are around:

  • What is a barrier? What is a gap? Sometimes this is obvious but other times it's not.
    • For example, with internet connectivity, it feels odd to say that we're aiming to have a high diversity of readers / contributors based on internet connectivity (ideally everyone would have good internet speeds / access). Internet connectivity, however, is clearly a barrier to diversity of readers / contributors because populations of people who would provide valuable perspectives to Wikipedia are prevented from doing so due to internet connectivity.
    • But what about disabilities? Do we view physical / mental / etc. disabilities as merely barriers to access or do we understand that individuals with these disabilities often identify in cultural communities around these disabilities and thus it's less about it being a barrier and more that we do want a diversity of people based on how able-bodied they are because that will bring new perspectives to Wikipedia.
  • How do we center the concept of power in the taxonomy? What does it mean to e.g., have gender as a gap while at least some of sexual orientation / race / ethnicity / nationality / political orientation / religion might not be specific gaps but are clearly important for many of the same reasons (people with these different identities bring new viewpoints to Wikipedia and have been excluded from history / Wikipedia). The current motivation is clear but feels lacking: gender is very well-studied, relatively easy to measure, and relatively universal in how it impacts representation. Race, on the other hand, while no less important, is less well-studied with regard to wikis, not well-tracked on the wikis, and highly contextual (i.e. what race means and its relationship with power varies by country and changes greatly over time). While ability to measure at a global scale is relevant to certain use cases for the taxonomy and an arguably objective way to choose which gaps are elevated and which are discussed but not central, it ignores the use-cases for the taxonomy that look to it as defining what is important to understand and work on when it comes to diversity and readers/contributors/content. There's no obvious way to solve this, but the feedback clearly is that we need to continue to think about our inclusion criteria and what it means for aspects of identity / representation that are not elevated as individual gaps.
Nov 6 2020, 9:52 PM · Research (FY2020-21-Research-October-December)
Isaac added a comment to T267283: Evaluate a differentially private solution to release wikipedia's project-title-country data.

Thanks all for this fascinating (and hopefully productive) conversation!

Nov 6 2020, 7:27 PM · Analytics, Research, Privacy Engineering, Privacy, Data-release
Isaac added a parent task for T266180: Request increased quota for wmf-research-tools Cloud VPS project: T266768: Develop 3 approaches for building article lists (ad-hoc topic models) and provide experimental API.
Nov 6 2020, 7:03 PM · cloud-services-team (Kanban), Cloud-VPS (Quota-requests)
Isaac added a subtask for T266768: Develop 3 approaches for building article lists (ad-hoc topic models) and provide experimental API: T266180: Request increased quota for wmf-research-tools Cloud VPS project.
Nov 6 2020, 7:03 PM · Research (FY2020-21-Research-October-December)
Isaac added a comment to T266180: Request increased quota for wmf-research-tools Cloud VPS project.

update: we needed to order some new hardware to get those cloudvirts online so things are delayed a bit. Hopefully not more than another week or two :(

Bummer to hear but thanks for the update and continuing to work on this!

Nov 6 2020, 2:55 PM · cloud-services-team (Kanban), Cloud-VPS (Quota-requests)

Nov 5 2020

Isaac updated the task description for T219903: Keep research.wikimedia.org landing page updated.
Nov 5 2020, 7:30 PM · Patch-For-Review, Research

Oct 30 2020

Isaac added a comment to T266768: Develop 3 approaches for building article lists (ad-hoc topic models) and provide experimental API.

A comment on the Wikidata-based approach. @diego if you weren't aware, Media Search on Commons is experimenting with using subclass-of for displaying what they are calling concept chips and are essentially search recommendations based on Wikidata. You can see an example here and more details about how it works at: T256431 Might be that you can make some recommendations to them or see what has worked / not worked in their experiments.

Oct 30 2020, 6:46 PM · Research (FY2020-21-Research-October-December)
Isaac added a comment to T266339: Summarize feedback on first draft of knowledge gaps taxonomy.

Weekly update:

Most of the requested changes are more minor (or so big that I don't think worth it) but there are a few larger changes that I would argue for:
* Add Power to taxonomy to bring together race, ethnicity, nationality, religion, politics, etc. (NOTE: this will require some additional thinking because as MG pointed out, what does it mean to have a power gap that is separate from e.g., gender?)
* Add section on Barriers / Causes and move a number of gaps to this section
* Making geography / language gaps consistent across all three dimensions (and other standardization where possible)
* Better clarifying upfront the scope (not metrics yet; why just reader/contributor/content, etc.) and terminology (e.g., why gaps vs. diversity)

Comments were broken up into the following categories (more details in doc):

  • Individual Gaps
    • Contributor Contextual Gaps
    • Income
    • Nationality / Race / Ethnicity / Religion Gap
    • Language
    • Geography
    • Readability
    • Multimedia
    • Structured Data
    • Policy Gaps
    • Sexual Orientation
    • Recency bias / Time Gap
    • Politics
    • Mediawiki / Tools / Bot infrastructure
    • Miscellaneous
  • Causes / Barriers
  • Meta
    • Definitions / Terminology
    • Format
    • Standardization
    • What's Missing?
    • Sources
    • Clean-up
    • Wikidata
  • Measurement / Action / Next Steps
    • Metric Definitions
    • Surveys
    • Selection vs. Extent vs. Framing
    • Internal vs. External
    • Action
Oct 30 2020, 6:38 PM · Research (FY2020-21-Research-October-December)
Isaac moved T266339: Summarize feedback on first draft of knowledge gaps taxonomy from Staged to FY2020-21-Research-October-December on the Research board.
Oct 30 2020, 6:29 PM · Research (FY2020-21-Research-October-December)
Isaac added a comment to T171635: Prototype new models to facilitate sockpuppet detection.

Weekly update:

  • Still no feedback
  • Feedback collected from AS about making code public but was requested by PE to give several more days for discussion before making a decision
Oct 30 2020, 6:29 PM · Research (FY2020-21-Research-October-December), Anti-Harassment, artificial-intelligence
Isaac added a comment to T264455: Measure equity impact of current recommender systems.

weekly update: no progress.

Oct 30 2020, 6:29 PM · Research (FY2020-21-Research-October-December)
Isaac added a comment to T207171: Have a way to show the most popular pages per country.

I agree that threshold of 100s pageviews seems small for privacy.

I agree if we're delivering raw data and using pageviews as our sole threshold. I'm more open to e..g, 100 pageviews as a minimum threshold if...

  • We're bucketing -- e.g., buckets in 100s -- i.e. 100-200 pageviews, 200-300 pageviews, ... 1000-1100 pageviews, ... I'd also be open to thousands buckets (100-1000, 1000-2000, ...) if that's deemed safer. 100 pageviews from a given country to a given article in a day is a pretty high level of traffic for the smaller language editions and I'd like to see us try to include it if possible.
  • We're reporting pageviews but using unique # of users as the threshold. This is something that @lexnasser had indicated a willingness to consider. How does this change things? Is this still feasible? I'm much more comfortable with 100 pageviews if I know that's coming from 100 different UA+IPs in a day.
Oct 30 2020, 5:02 PM · Patch-For-Review, Analytics-Wikistats, Privacy Engineering, Inuka-Team, Language-strategy, Tool-Pageviews, Analytics

Oct 29 2020

Isaac added a comment to T263874: Outreachy Application Task: Tutorial for Wikipedia Page Protection Data.

Hey everyone! A few days left to get in those final contributions on the Outreachy site. Make sure you complete your final application there (you can do this today and still edit it up until the deadline). Diego also posted some good general feedback about notebooks at T263860#6589759 that I wanted everyone to see:

I have a general recommendation to all of you: Keep the notebook easy to read. That means:
Oct 29 2020, 8:30 PM · Outreachy (Round 21)

Oct 28 2020

Isaac closed T266405: Outreachy Proposal 21 : Create Machine Learning datasets to measure content reliability on Wikipedia. as Resolved.

Hey @Thulieblack -- thanks for putting this together. In the past, the guidance had been to create a phabricator task for feedback / application, but we're now asking that you fill out your application via the Outreachy portal (see for more details, specifically step #11: https://www.mediawiki.org/wiki/Outreachy/Participants#Application_process_steps). As I already provided feedback on your initial notebook, I likely won't be able to give you any further feedback while I prioritize applicants who haven't submitted their notebooks yet for feedback. I'm going to resolve the task, but don't hesitate to let me know if you have any further questions.

Oct 28 2020, 5:56 PM · Outreachy (Round 21)
Isaac added a comment to T266180: Request increased quota for wmf-research-tools Cloud VPS project.

Thanks for the update @Andrew ! 1-2 weeks is fine -- we like stable instances :) I'll check back then if I haven't heard.

Oct 28 2020, 4:00 PM · cloud-services-team (Kanban), Cloud-VPS (Quota-requests)

Oct 27 2020

Isaac added a comment to T263874: Outreachy Application Task: Tutorial for Wikipedia Page Protection Data.

Everyone: I wanted to thank you for making the initial contributions. It gave us a sense of how many applicants we had. We've decided to leave both projects (T263646 and T263860) open until the normal Outreachy deadline as I know a number of you are trying to balance a lot right now.

Oct 27 2020, 3:12 PM · Outreachy (Round 21)
Isaac added a comment to T263874: Outreachy Application Task: Tutorial for Wikipedia Page Protection Data.

The API endpoint that gives the list of most viewed pages doesn't seem to be giving results anymore.

@SafiaKhaleel perhaps a temporary issue. This is working for me though: https://en.wikipedia.org/w/api.php?action=query&list=mostviewed

Oct 27 2020, 3:03 PM · Outreachy (Round 21)

Oct 26 2020

Isaac added a comment to T266375: Add timestamps of important revision events to mediawiki_history.

Thanks @nettrom_WMF for creating this ticket. I think I'm going to leave it just as Morten requested (which I agree would be useful) because I was misremembering what fields were in mediawiki_history and my ask is bigger than I had thought.

Oct 26 2020, 1:52 PM · Product-Analytics, Analytics
Isaac added a comment to T263874: Outreachy Application Task: Tutorial for Wikipedia Page Protection Data.

Everyone's discussion comments too here have been very helpful, thank you all :))

@Chiral-carbon thanks and glad to hear!

Oct 26 2020, 12:56 AM · Outreachy (Round 21)

Oct 23 2020

Isaac created T266339: Summarize feedback on first draft of knowledge gaps taxonomy.
Oct 23 2020, 2:10 PM · Research (FY2020-21-Research-October-December)
Isaac added a comment to T264455: Measure equity impact of current recommender systems.

weekly update: no progress

Oct 23 2020, 2:06 PM · Research (FY2020-21-Research-October-December)
Isaac added a comment to T171635: Prototype new models to facilitate sockpuppet detection.

Weekly updates:

  • No feedback so far on tool -- looking into ways to reduce barriers to testing with checkusers
  • Started due diligence on making tool code public -- reached out to NK, PE, AS, LZ
Oct 23 2020, 2:05 PM · Research (FY2020-21-Research-October-December), Anti-Harassment, artificial-intelligence

Oct 22 2020

Isaac added a comment to T263874: Outreachy Application Task: Tutorial for Wikipedia Page Protection Data.

which email address can I send my notebook for feedback. Can I use the one here I see on the notebook by cell 13

@Thulieblack yes: isaac@wikimedia.org. Make sure to also record an initial contribution on Outreachy.

Oct 22 2020, 5:17 PM · Outreachy (Round 21)

Oct 21 2020

Isaac added a comment to T207171: Have a way to show the most popular pages per country.

Have in mind that per population data is not necessarily needed (it will be great to have at some point but it feels like scope creep in this task). In the case of the small "malasyan" bucket of "san marino" the country population is not helping much, for example. What quantifies that the pool is too small in that case (more or less) is the the <# pageviews on malasyan in san marino>/<# total pageviews in san marino>

Nuria makes a very good point and I would also add that tourists would also greatly complicate interpretation of these numbers (see this list of countries where tourists greatly outnumber citizens).

Oct 21 2020, 9:22 PM · Patch-For-Review, Analytics-Wikistats, Privacy Engineering, Inuka-Team, Language-strategy, Tool-Pageviews, Analytics
Isaac added a comment to T263646: Develop an approach to infer which countries are associated with a given Wikipedia article.

How do we submit this task?

Welcome @Chelsi -- when you have completed the task (T263874), you can submit the notebook link as a contribution via the Outreachy site. There are more details in the task description though.

Oct 21 2020, 9:00 PM · Outreachy (Round 21), Outreach-Programs-Projects
Isaac added a comment to T263874: Outreachy Application Task: Tutorial for Wikipedia Page Protection Data.

Welcome to all the new applicants since I last posted a welcome! One request for everyone working on this task:

  • To get a good sense of how many people are intending to apply to each project (T263646 and/or T263860), I'd ask that you make an initial contribution on the Outreachy site with a link to your current progress in the next two days (so by end-of-day October 23rd).
Oct 21 2020, 8:32 PM · Outreachy (Round 21)
Isaac created T266180: Request increased quota for wmf-research-tools Cloud VPS project.
Oct 21 2020, 6:47 PM · cloud-services-team (Kanban), Cloud-VPS (Quota-requests)

Oct 19 2020

Isaac added a comment to T263874: Outreachy Application Task: Tutorial for Wikipedia Page Protection Data.

I think you can use pvipcontinue to extract more data

@Amamgbu @SafiaKhaleel indeed -- depending on your exact query, you can use a continue parameter to get more results or just pass a new set of pageIDs to the API to get more data.

Oct 19 2020, 9:03 PM · Outreachy (Round 21)

Oct 17 2020

Isaac added a comment to T263874: Outreachy Application Task: Tutorial for Wikipedia Page Protection Data.

Does that mean the title of the Article cannot be edited except by the admin user(sysop) yet anyone can edit the body of the article since no edit protection exists in the “protection” key?

@Amamgbu that is correct and @Vanevela pointed to the appropriate prior discussion about this. More details: the restrictiontypes field is just what restrictions could be applied to the page, not which ones are applied -- a fuller description of what you could find in that field can be found here. For most pages, you'll see edit and move and can verify this by choosing a random page without restrictions and querying the API. I'd suggest ignoring the field as it won't tell you much.

Oct 17 2020, 4:47 PM · Outreachy (Round 21)

Oct 15 2020

Isaac moved T171635: Prototype new models to facilitate sockpuppet detection from In Progress to FY2020-21-Research-October-December on the Research board.
Oct 15 2020, 7:55 PM · Research (FY2020-21-Research-October-December), Anti-Harassment, artificial-intelligence
Isaac moved T264455: Measure equity impact of current recommender systems from Staged to FY2020-21-Research-October-December on the Research board.
Oct 15 2020, 7:55 PM · Research (FY2020-21-Research-October-December)
Isaac added a comment to T171635: Prototype new models to facilitate sockpuppet detection.

Weekly updates:

Oct 15 2020, 7:55 PM · Research (FY2020-21-Research-October-December), Anti-Harassment, artificial-intelligence