Fri, Jan 15
In Czech Wikipedia, all templates using data from Wikidata are in a category, so we can use that for the analysis. Note the category doesn't contain _only_ infoboxes, but infoboxes can be easily filtered out thanks to their specific page titles.
Adding to that, there's also a more specific category on certain wikis such as English that's not just Wikidata templates but Wikidata-infobox templates -- e.g., en:Category:Infobox_templates_using_Wikidata. This will get you a conservative estimate of pages that just maybe might be adding an image automatically via Wikidata.
Wed, Jan 13
Thanks @Ottomata! The three that I can think of quickly are:
- mwapi for making easy requests to Mediawiki APIs. Obviously less relevant for worker nodes but I often use this in the same notebook as I'm running PySpark jobs for various data augmentation etc. I think only available via pip.
- mwxml for easy access to the dumps. Again, less for worker nodes and more for being able to mix analyses that depend on the cluster with dump processing. I think only available via pip.
- shapely for spatial analyses. If this one is a big package, it's okay to leave it off as it's probably more specific to me. Available via pip but also a conda package.
Tue, Jan 12
Regarding privacy, there are always various options for how to implement this that James will help guide. This dataset is perfect for differential privacy but unfortunately I assume we're not ready yet to apply it. In the future though, I'd love to come back to do that. In the meantime, I'm going to assume we're using our most straightforward approach of defining a threshold of data that a datapoint must exceed to be released -- e.g., at least 1000 pageviews to a given country+lang+browser+OS for that data point to be released. This is a simple thing to enforce in the data pipeline -- e.g., add HAVING COUNT(1) >= 1000 in the final clause of the example pipeline above. This is an update from an earlier analysis above when we were just considering browser family and not OS (so the data was more aggregated).
I might be slightly ahead but I wanted to give an update on where we are with the queries / privacy analyses. This first post focuses on the dataset generation code. I think it's pretty straightforward (thanks to wmf.pageview_actor) and just requires a few choices to be made around privacy (future posts will address that). Basic code below and some notes:
Mon, Jan 11
Fri, Jan 8
Thanks @Milimetric -- @bmansurov will be leading the technical work on this so we're going to start work on this and greatly appreciate whatever code review / support Analytics is able to provide along the way. My assumption is that this will be a reportupdater query like the existing browser/OS stats but I'm largely ambivalent about where/how the data is generated.
Wed, Jan 6
This looks awesome @Isaac! Can't wait to try it out.
Thanks! If there's anything that doesn't make sense, let me know and I'll try to explain. Or additional features that would be useful and I can see if they'd be easy to incorporate.
Tue, Jan 5
If it's not hard, I'd ask to retain the geocoded data.
It isn't hard, we can do!
Mon, Jan 4
let us know if this schema needs client IP and/or geocoded data? If not, it will be removed as part of this migration.
If it's not hard, I'd ask to retain the geocoded data. Client IP is nice for determining unique number of users (UA+IP) but there's a user token in the data that also works for that purpose. I do use the geocoded data (country specifically) for looking at geographic diversity of users for the system though so would prefer to retain it.
Update: fkaelin helped me identify that the problem was with Null values being passed to the point-in-country lookup (and causing the function to error out). With the null checking in place (WHERE lat IS NOT NULL AND lon IS NOT NULL), the code now runs (and quite fast for all Wikidata items at ~20 minutes). Results are sufficiently close to what I had through the old Python script that I think everything is working correctly (I'm working with slightly newer data and a few other small adjustments so some difference was expected).
Assigning this to myself as it's clear that it's active now. Looks like TREC is interested in a task around building WikiProject worklists while taking into account equity aspects of the articles that show up in the list. I'll continue to meet with the organizers to help shape the task and will provide dataset support over the next few months.
Hey all -- happy new year!! I had differential privacy on the mind and my feeling is that we were stalled on how to choose the parameters that determine how much "privacy" is being assured. So I made an attempt to set up a simple interface to help us see the impact of different choices of parameters for differential privacy on potential result lists. I know that we should make our decisions about privacy separate from how it affects the perceived utility of the results but I found this really useful for thinking about the different approaches / parameters and what they mean. I mostly based the tool and my corresponding thoughts on the links that @TedTed shared to differential privacy blogposts, Facebook's mobility data, and Google's search trends data (thank you!).
Wed, Dec 23
- Equity impact complete for Suggested Edits. Summary:
The analyses demonstrated that depending largely on a random selection of content for recommendation reinforces the status quo around gender and geography -- i.e. heavy imbalance towards men, the United States, and United Kingdom -- and therefore the net effect of the recommender is to improve content about men more than women or other gender identities and content about the US/UK more than other regions. The exact regions improved depends heavily on language -- i.e. US/UK for English Wikipedia but Japan for Japanese Wikipedia or Germany for German Wikipedia -- but the trend remains that editors do not themselves seem to exert additional selection bias over the recommendations. Analogously, the gender associated with the content recommended does not seem to affect whether editors choose to make an edit or not.
- I still would like to repeat this analysis on another recommender system. I'll likely go with Newcomer tasks because there is good data collection, it's been used quite a bit in a number of languages outside of English, and it has the added variable of maintenance templates (which might skew the potential recommendations) and topic preferences (which might skew what recommendations are actually shown). So while Suggested Edits ended up being a pretty straightforward story (biased content -> biased recommendations -> biased edits), Newcomer Tasks might be more complicated.
leaving myself this link too so I don't lose it on optimizing point-in-polygon lookups. I suspect not necessary if we can get the PySpark pipeline working but a reminder that there are other possible optimizations to speed up this pipeline: https://gis.stackexchange.com/questions/120955/understanding-use-of-spatial-indexes-with-rtree/144764#144764
@fkaelin no hurry on this but maybe we can walkthrough this code at our next meeting. My first attempt at this is not working. I can get the shapely library I'm using for point-in-polygon operations onto the worker nodes and the operation to run for test examples but it fails when running on even smallish subsets of Wikidata (e.g., just items with sitelinks to Hausa Wikipedia, which only has ~7000 articles (and therefore Wikidata items) of which not all have coordinate data. Notebook is here but also can be found on stat1008 at /home/isaacj/wiki-region-data.ipynb. I _think_ the code is technically correct and the issue is just with how much lifting the point-in-polygon operation requires on the worker nodes but I can't be certain...
Tue, Dec 22
Its easy enough, and already huge (as you noticed) so no harm in adding more packages.
Sounds good -- in the new year, I'll likely come back to this in a new task and start building a list.
does https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark#pyspark_and_external_packages help at all? You could certainly pass those args in a custom pyspark kernel.json, but perhaps you can also do the same via the SparkSession API? https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter#Launching_as_SparkSession_in_a_Python_Notebook
Yep, just gave this a try (zipping up my local conda environment after installing shapely and adding to the PySpark environment; PDF of notebook attached) and all seems to be working. Thanks for the pointer (I actually had no idea there was a Spark page on wikitech outside of the Jupyter page)! A few thoughts:
- The zipped conda environment is 750MB, which seems quite large. I'm not sure if this is something that is problematic for the workers or can be improved, but I figured worth mentioning because its size surprised me. I guess it's because it contains a complete Python environment + lots of packages (and hence why it works...)
- I of course would not object to shapely being included by default in the Anaconda environment (easier for me in most cases than configuring my own kernel) but do appreciate that I can have an environment over which I have full control of packages and versions. However, for the sake of simplicity with most of what I do, how much effort is it to add a package and at what point does the Anaconda environment become too bloated? If it's easy, I can create a task for it because I'd request at least mwapi (easy Mediawiki API calls) and mwxml (parsing XML dumps) as well, which aren't for the Spark side but for other common analyses I do in a notebook.
Sorry for the radio silence! I just finished up final exams, so I'm now freed up to make more progress on this.
No worries - thanks for working on this!
Dec 18 2020
- Ran equity impact for actual edits from Suggested Edits + gender and updated Meta page: https://meta.wikimedia.org/wiki/Research:Prioritization_of_Wikipedia_Articles/Recommendation#Suggested_Edits
- Began process of adding geographic impact to equity analysis:
- Added geographic groundtruth API endpoint -- e.g., https://wiki-region.wmcloud.org/api/v1/region?qid=Q358755 for Mount Takahe or https://wiki-region.wmcloud.org/api/v1/region?qid=Q42070 for Gobi Desert
- Uploaded table with QIDs and regions to Hive
- Reached out to tgr and identified which templates Growth team uses to generate recommendations for newcomer tasks (e.g., for cswiki: https://cs.wikipedia.org/wiki/MediaWiki:NewcomerTasks.json). This means I should have everything I need to expand the analysis to Newcomer Tasks next week
I'm going to close this task out unless there are any objections -- my work on this has largely been complete for a while now and no issues have come up yet in the productization that would require serious rework of the approach (though plenty of improvements have been made to the stability of the prototype). Future tasks that we might open are:
- Making updates based on Checkuser feedback
- Further research into other types of data / modeling that could help inform the ranking.
Dec 16 2020
Received -- thanks @Dzahn !
Thanks for the quick response @Dzahn ! I emailed the list to ask for nominations and @Ladsgroup graciously volunteered so we will be the two new admins. If you could add the following as admins, that'd be much appreciated:
Dec 15 2020
Dec 14 2020
Some data from December 10th to help us think about privacy. Raw data can be found in isaacj.search_engine_data in Hive and data pipeline in stat1004:/home/isaacj/notebooks/Search_Engine_Traffic.ipynb. Specifically looking at how much data we'd have for each country if our daily threshold was at least 500 pageviews. I'll try to provide some additional analyses on how different privacy thresholds and k values (e.g., only reporting top 100) affect how much data is made available.
There is no quick and easy way to do either of these, but I don't think it matters for this particular survey, since it lives on the project page and is not presented to users where they wouldn't expect it.
And even if it were possible, hiding the survey still wouldn't do all that much, because the rest of the survey context (the wikitext surrounding it) would still exist.
Yeah, good point, that makes sense. Best of luck!
I've switched from custom kernels to the generic Python (not pyspark) kernel, and can install packages directly in the notebook's environment
Thanks for chiming in @awight. First off, cool notebook! Installing packages locally and using them for Python functions in the notebook also works for me -- the challenge is when I want Spark workers to also have access to the library so I can parallelize the computation. It didn't seem like you were doing that in your notebook, but let me know if I missed it.
To make this more relevant, I now find myself in a situation where I do want a Python lib on the workers that is not available in the standard conda environment: shapely
It's used for doing spatial analyses and I intend to use it to parallelize this job that gathers all items with lat/lon coordinates in Wikidata and checks which countries the coordinates are in: https://github.com/geohci/wiki-region-groundtruth/blob/main/gather_wikidata_region_groundtruth.py#L169
Dec 11 2020
- wrote up notebook for gathering data on how the Suggested Edits module has actually been used. This will complement the analysis of what types of recommendations are made and indicate whether there is any bias towards skipping recommendations along gender / geography lines. For instance, since May 2020 (v4 of suggested edits), there have been 28,331 edits made via the module to images that have associated Wikipedia articles (and therefore I can directly infer gender / geography associated with those images)
- TODO: read through results from Growth experiments to help guide impact analysis of that module
- Continued support of productization
- Regenerated data through all of November -- whole pipeline was about 20 minutes start to finish from collecting all the relevant edit history from the cluster to outputting the TSV files the tool uses.
Dec 7 2020
There isn't; the regex is applied per-project. So you can invent whatever norms you want!
Oooh fun! In that case, instances whose name ends with -test, -build, or -prototype would be the three regexes I'd feel comfortable putting in place for this project (and recommendation-api project if you'd like too). I can confirm that currently those regex would capture two projects that don't require backup and I'll start using them for new instances that won't require backup and try to document this for our team.
We have a few more hypervisors online now so will be granting the quota change soon.
Dec 4 2020
weekly update: no progress though the start of the Outreachy project on a country classifier for articles will help greatly with the geographic equity component of this work (T263646)
- Still no feedback from Checkusers -- at this point, I believe the expectation is that we will productize it so they can access the tool directly, which should make it much easier for them to provide feedback.
- Tool code has been moved to Gerrit: https://gerrit.wikimedia.org/r/admin/repos/mediawiki/services/similar-users
- I'm largely just playing a consultation role right now but really excellent progress on productization of the tool as being tracked here: T265722
- I produced datasets of text diffs and which sections were edited by each user to explore with DD
Update: closing this task as the summarization is complete and we have moved to make changes to the taxonomy. High-level changes to taxonomy documented here: https://docs.google.com/spreadsheets/d/1QPo8_AYHJfVBMBkAhtTi5tXvZX_Z9DrWeq376MmGqS8/edit#gid=0
Changes to the taxonomy will be tracked so that they can be linked back to the original prompts from the feedback that led to them.
Dec 3 2020
Additional data to hopefully help see the impact of the different privacy unique actor thresholds would have on what countries would actually be able to benefit from this data (this is based on the data from T207171#6615009). I look at 1000 vs. 500 unique actor thresholds and how many countries in each continent show up on the resulting list with at least k articles. In general, if a country only has one or two articles on a list, that means Main Page and Special:Search (so not particularly useful data for that region and it looks to me that the list starts becoming useful at around at least 5 articles)
Dec 2 2020
Has it ever been used in production or are there plans to use it in production?
@AlexisJazz yes -- see for more details: https://meta.wikimedia.org/wiki/Research:Surveys_on_the_gender_of_editors/Report
I assume this survey is still useful for testing and intentionally undismissable?
@Jdlrobson at this point, the survey can be disabled. Sorry, didn't realize it was still around, but as you point out, it's purely for testing purposes and we are not expecting anyone to respond to it. If you are able to do it quickly, I'd much appreciate you disabling it. If not, let me know, but it may take me a while to get to as I rarely do Gerrit patches.
Just two quick notes:
- On the This is a bunch of text to explain what the survey is about!... text: editors often (reasonably) assume that the survey follows them around based on their username when in fact whether it shows up or not is browser-based. You'll probably want to clarify that in that text (e.g., "If you use multiple browsers, you may see the survey multiple times and should just dismiss it if you've already taken it") or you'll likely get some questions along the lines of "I thought I already took this" or "Please make this go away".
- I don't see a dismiss button for the survey. Apologies for not bringing this up earlier -- I think this is a basic issue with QuickSurveys but I had forgotten. If it's an easy fix, it'd be great to add that. If not, is there a button that could be easily added that is "Prefer not to say" or "Dismiss" or something like that? Otherwise, people who do not want to answer but also don't want to see the survey will get pretty frustrated (in the past, I've shared how to update their local browser cache to remove the survey but this is obviously far from ideal): https://meta.wikimedia.org/wiki/Research:Surveys_on_the_gender_of_editors/Report#QuickSurveys
Dec 1 2020
Once he completes his analysis, there should only be a few more minor design considerations to work out, and then I can start on implementation.
@lexnasser great to hear! and thank you for leading this work and taking all of these points into consideration. I'm excited to see this come to fruition!
Nov 30 2020
Thanks for the update @aborrero
Nov 19 2020
@Andrew just checking in to see if we have a new expected date for these changes? Thanks!
Nov 16 2020
For the sake of consistency, I'd rather continue using page_title as identifier.
Thanks @JAllemandou for these additional details. What you say makes sense and for this dataset I'm more open to using page_title because of the dataset's clear intent to help editors and the fact that the ranking is presumed to be more valuable than the underlying pageview counts (so missing a few pageviews that came from a redirect feels like less of a concern). A few additional thoughts:
- It goes against consistency, but another option is page_title for daily and page_id for monthly. This will handle page moves that happen mid-month, provide higher-quality (in my opinion) data for at least one of the datasets, and be far far easier to actually execute (because you can just join against the page table for the canonical title to associate with that page ID)
- From a privacy perspective, the one thing I'll note is that preserving redirects can bring with it some implications because there will be page redirects that are only used by e.g., one external site that could have enough interest to make it onto the top articles list while still being so specific as to reveal information about exactly where those pageviews are coming from. Aggregating redirects helps with this because if an article has enough interest that it makes the list, it probably is receiving pageviews from a variety of independent sources.
Nov 10 2020
I'll be looping in a privacy engineer very soon, and I'm interested in hearing their opinion regarding the breadth of these privacy concerns.
Excellent - glad to hear!
Nov 9 2020
And regarding unique pageview threshold, I threw together this table of # of pages (and unique languages/projects) that would be on the list for each country for 100, 500, and 1000 unique pageviews. Example query for k = 1000 and data for all below. My takeaway is that unless we have a strong reason for k=1000, I'd push for k=500 or k=100 given that many countries are included and go from maybe 1-2 pages to 10 or more as you push k lower, which feels like a sizable jump in value for these countries. In general, I'd argue for pushing the k value as low as we feel comfortable so that more countries can be included (and would drop bucketed pageview data if that makes us feel comfortable with pushing the k lower).
Thanks for making the table @lexnasser! A few thoughts below:
Nov 6 2020
weekly update: no progress
- Still no feedback
- Waiting on decision around making code public. Will follow up next week.
- Meeting with NK/EP to discuss productization in the meantime and we seem to have good agreement there.
Weekly update: moving slowly but been debating with MG about what changes to recommend based on the feedback we've collected. Biggest challenges are around:
- What is a barrier? What is a gap? Sometimes this is obvious but other times it's not.
- For example, with internet connectivity, it feels odd to say that we're aiming to have a high diversity of readers / contributors based on internet connectivity (ideally everyone would have good internet speeds / access). Internet connectivity, however, is clearly a barrier to diversity of readers / contributors because populations of people who would provide valuable perspectives to Wikipedia are prevented from doing so due to internet connectivity.
- But what about disabilities? Do we view physical / mental / etc. disabilities as merely barriers to access or do we understand that individuals with these disabilities often identify in cultural communities around these disabilities and thus it's less about it being a barrier and more that we do want a diversity of people based on how able-bodied they are because that will bring new perspectives to Wikipedia.
- How do we center the concept of power in the taxonomy? What does it mean to e.g., have gender as a gap while at least some of sexual orientation / race / ethnicity / nationality / political orientation / religion might not be specific gaps but are clearly important for many of the same reasons (people with these different identities bring new viewpoints to Wikipedia and have been excluded from history / Wikipedia). The current motivation is clear but feels lacking: gender is very well-studied, relatively easy to measure, and relatively universal in how it impacts representation. Race, on the other hand, while no less important, is less well-studied with regard to wikis, not well-tracked on the wikis, and highly contextual (i.e. what race means and its relationship with power varies by country and changes greatly over time). While ability to measure at a global scale is relevant to certain use cases for the taxonomy and an arguably objective way to choose which gaps are elevated and which are discussed but not central, it ignores the use-cases for the taxonomy that look to it as defining what is important to understand and work on when it comes to diversity and readers/contributors/content. There's no obvious way to solve this, but the feedback clearly is that we need to continue to think about our inclusion criteria and what it means for aspects of identity / representation that are not elevated as individual gaps.
Thanks all for this fascinating (and hopefully productive) conversation!
update: we needed to order some new hardware to get those cloudvirts online so things are delayed a bit. Hopefully not more than another week or two :(
Bummer to hear but thanks for the update and continuing to work on this!
Nov 5 2020
Oct 30 2020
A comment on the Wikidata-based approach. @diego if you weren't aware, Media Search on Commons is experimenting with using subclass-of for displaying what they are calling concept chips and are essentially search recommendations based on Wikidata. You can see an example here and more details about how it works at: T256431 Might be that you can make some recommendations to them or see what has worked / not worked in their experiments.
- Doc where this work is happening: https://docs.google.com/document/d/10ndmVhteCbdNGiQlyqnjmruLEeDy_qmKTOQph3UYyr0/edit?usp=sharing
- All feedback so far broken down by recommended changes (outline below)
- I did an initial pass of reactions and MG has done so as well
- I'm working on summarizing how I would recommend to proceed based on this but my summary is as follows:
Most of the requested changes are more minor (or so big that I don't think worth it) but there are a few larger changes that I would argue for: * Add Power to taxonomy to bring together race, ethnicity, nationality, religion, politics, etc. (NOTE: this will require some additional thinking because as MG pointed out, what does it mean to have a power gap that is separate from e.g., gender?) * Add section on Barriers / Causes and move a number of gaps to this section * Making geography / language gaps consistent across all three dimensions (and other standardization where possible) * Better clarifying upfront the scope (not metrics yet; why just reader/contributor/content, etc.) and terminology (e.g., why gaps vs. diversity)
Comments were broken up into the following categories (more details in doc):
- Individual Gaps
- Contributor Contextual Gaps
- Nationality / Race / Ethnicity / Religion Gap
- Structured Data
- Policy Gaps
- Sexual Orientation
- Recency bias / Time Gap
- Mediawiki / Tools / Bot infrastructure
- Causes / Barriers
- Definitions / Terminology
- What's Missing?
- Measurement / Action / Next Steps
- Metric Definitions
- Selection vs. Extent vs. Framing
- Internal vs. External
- Still no feedback
- Feedback collected from AS about making code public but was requested by PE to give several more days for discussion before making a decision
weekly update: no progress.
I agree that threshold of 100s pageviews seems small for privacy.
I agree if we're delivering raw data and using pageviews as our sole threshold. I'm more open to e..g, 100 pageviews as a minimum threshold if...
- We're bucketing -- e.g., buckets in 100s -- i.e. 100-200 pageviews, 200-300 pageviews, ... 1000-1100 pageviews, ... I'd also be open to thousands buckets (100-1000, 1000-2000, ...) if that's deemed safer. 100 pageviews from a given country to a given article in a day is a pretty high level of traffic for the smaller language editions and I'd like to see us try to include it if possible.
- We're reporting pageviews but using unique # of users as the threshold. This is something that @lexnasser had indicated a willingness to consider. How does this change things? Is this still feasible? I'm much more comfortable with 100 pageviews if I know that's coming from 100 different UA+IPs in a day.
Oct 29 2020
Hey everyone! A few days left to get in those final contributions on the Outreachy site. Make sure you complete your final application there (you can do this today and still edit it up until the deadline). Diego also posted some good general feedback about notebooks at T263860#6589759 that I wanted everyone to see:
I have a general recommendation to all of you: Keep the notebook easy to read. That means:
Oct 28 2020
Hey @Thulieblack -- thanks for putting this together. In the past, the guidance had been to create a phabricator task for feedback / application, but we're now asking that you fill out your application via the Outreachy portal (see for more details, specifically step #11: https://www.mediawiki.org/wiki/Outreachy/Participants#Application_process_steps). As I already provided feedback on your initial notebook, I likely won't be able to give you any further feedback while I prioritize applicants who haven't submitted their notebooks yet for feedback. I'm going to resolve the task, but don't hesitate to let me know if you have any further questions.
Thanks for the update @Andrew ! 1-2 weeks is fine -- we like stable instances :) I'll check back then if I haven't heard.
Oct 27 2020
Everyone: I wanted to thank you for making the initial contributions. It gave us a sense of how many applicants we had. We've decided to leave both projects (T263646 and T263860) open until the normal Outreachy deadline as I know a number of you are trying to balance a lot right now.
The API endpoint that gives the list of most viewed pages doesn't seem to be giving results anymore.
@SafiaKhaleel perhaps a temporary issue. This is working for me though: https://en.wikipedia.org/w/api.php?action=query&list=mostviewed
Oct 26 2020
Thanks @nettrom_WMF for creating this ticket. I think I'm going to leave it just as Morten requested (which I agree would be useful) because I was misremembering what fields were in mediawiki_history and my ask is bigger than I had thought.
Everyone's discussion comments too here have been very helpful, thank you all :))
@Chiral-carbon thanks and glad to hear!
Oct 23 2020
weekly update: no progress
- No feedback so far on tool -- looking into ways to reduce barriers to testing with checkusers
- Started due diligence on making tool code public -- reached out to NK, PE, AS, LZ
Oct 22 2020
which email address can I send my notebook for feedback. Can I use the one here I see on the notebook by cell 13
@Thulieblack yes: firstname.lastname@example.org. Make sure to also record an initial contribution on Outreachy.
Oct 21 2020
Have in mind that per population data is not necessarily needed (it will be great to have at some point but it feels like scope creep in this task). In the case of the small "malasyan" bucket of "san marino" the country population is not helping much, for example. What quantifies that the pool is too small in that case (more or less) is the the <# pageviews on malasyan in san marino>/<# total pageviews in san marino>
Nuria makes a very good point and I would also add that tourists would also greatly complicate interpretation of these numbers (see this list of countries where tourists greatly outnumber citizens).
How do we submit this task?
Welcome to all the new applicants since I last posted a welcome! One request for everyone working on this task:
Oct 19 2020
I think you can use pvipcontinue to extract more data
Oct 17 2020
Does that mean the title of the Article cannot be edited except by the admin user(sysop) yet anyone can edit the body of the article since no edit protection exists in the “protection” key?
@Amamgbu that is correct and @Vanevela pointed to the appropriate prior discussion about this. More details: the restrictiontypes field is just what restrictions could be applied to the page, not which ones are applied -- a fuller description of what you could find in that field can be found here. For most pages, you'll see edit and move and can verify this by choosing a random page without restrictions and querying the API. I'd suggest ignoring the field as it won't tell you much.
Oct 15 2020
- Prepared model and internal API based on co-edit history to test out with checkusers
- Email was sent to checkusers at the start of the week notifying them that they could test out the model but so far no requests
- Meta page updated: https://meta.wikimedia.org/wiki/Research:Sockpuppet_detection_in_Wikimedia_projects
- Some descriptive statistics generated on what ties together sockpuppet accounts: https://meta.wikimedia.org/wiki/Research:Sockpuppet_detection_in_Wikimedia_projects#Descriptive_Analyses
- Working on reworking text diff pipeline in PySpark and to also indicate which sections were edited by which users