Page MenuHomePhabricator

Isaac (Isaac Johnson)
Research Scientist

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Oct 1 2018, 2:19 PM (140 w, 4 d)
Availability
Available
IRC Nick
isaacj
LDAP User
Isaac Johnson
MediaWiki User
Isaac (WMF) [ Global Accounts ]

Recent Activity

Yesterday

Isaac added a comment to T238437: Co-organize Fair Ranking Track at TREC.

Weekly updates:

  • Discussed fairness "goals" for a given result. This is necessary for computing the fairness aspect of the performance. For example, if 90% of the biographies in a WikiProject are about men, what is the expectation for the percentage of biographies in the results list that are about men? Is it 81% (current distribution on enwiki)? ~50% (ideal world where all is equal w/r/t gender)? Something in between 90% and 50%? We're leaning towards the latter (e.g., halfway between 90% and 50%) not because it's actually where we think the goal is, but it provides a mixture of feasibility (the model can only work with the existing articles) and a strong push towards equity. There will be some exceptions of course -- e.g., WikiProject Women Scientists wouldn't be expected to be "fair" w/r/t gender.
Fri, Jun 11, 5:28 PM · Research (FY2020-21-Research-April-June)
Isaac added a comment to T281912: Prototype misalignment API.

Weekly updates:

  • Slow week as catching up from week off but continue to gather informal feedback on model. In particular thinking and playing with a few ideas:
    • Should average misalignment across an entire wiki be 0 and the goal is to understand what topics are under/overproduced not to label wikis as over/underproduced? There is no clear definition of over/underproduced so applying a single number to a wiki (beyond being quite reductive) is very liable to misinterpretation.
    • How to include missing content? Past research has tackled this question from a more focused recommender system angle. Not every wiki obviously should have full coverage of all articles but perhaps taking the cultural content approach from Wikipedia Diversity Observatory (content relevant to language/geography should be covered by a wiki)? And questions of scalability for applying this to all wikis and 20 million potential Wikipedia articles from across all the languages.
    • I'll continue thinking on these questions but they are beyond the scope of this initial prototype
  • Remaining TODO to close this task out: build a simple API for getting individual quality, demand, and misalignment scores for any given article to complement the aggregate scores. This probably will be computed on-demand (i.e. gather data from Mediawiki APIs for scoring as opposed to extracting from a database of pre-computed scores) to simplify the space requirements.
Fri, Jun 11, 5:13 PM · Research (FY2020-21-Research-April-June)

Thu, Jun 10

Isaac committed rRLP689d38ac62b0: Add Emily to team and update IRC contact. (authored by Isaac).
Add Emily to team and update IRC contact.
Thu, Jun 10, 10:41 PM
Isaac updated the task description for T219903: Keep research.wikimedia.org landing page updated.
Thu, Jun 10, 9:14 PM · Patch-For-Review, periodic-update, Research

Mon, Jun 7

Isaac added a comment to T283285: Story idea for Blog: New Search Referral Dataset.

@srodlund ahh that's awesome -- did a pass and all looks good. Thanks for getting this out!

Mon, Jun 7, 6:00 PM · Technical-blog-posts
Isaac added a comment to T283285: Story idea for Blog: New Search Referral Dataset.

It would be great if you added the screenshot to commons! Thanks! Just share the link here when you have it up.

@srodlund Done! https://commons.wikimedia.org/wiki/File:Wikipedia_search_referrals_dashboard.png

Mon, Jun 7, 5:19 PM · Technical-blog-posts
Isaac added a comment to T283285: Story idea for Blog: New Search Referral Dataset.

This is almost ready to go-- except for the featured image. For the blog, we typically use a photo for this, rather than a graphic. Would either of the following images work?

Mon, Jun 7, 4:39 PM · Technical-blog-posts
Isaac added a comment to T283285: Story idea for Blog: New Search Referral Dataset.

For some reason, the notification for this slipped past me!

@srodlund no worries but thanks! Documentation on the tech blog was clear it could be a two-week process at least so I wasn't in a hurry.

Mon, Jun 7, 12:05 PM · Technical-blog-posts

Thu, May 27

Isaac added a comment to T238437: Co-organize Fair Ranking Track at TREC.

Weekly updates:

  • Nothing from me this week -- team's focus is on preparing the evaluation metrics and a baseline system which aren't pieces I'm responsible for
Thu, May 27, 3:00 PM · Research (FY2020-21-Research-April-June)
Isaac added a comment to T281912: Prototype misalignment API.

Weekly updates:

Thu, May 27, 2:56 PM · Research (FY2020-21-Research-April-June)

Wed, May 26

Isaac added a comment to T280385: Apache Beam go prototype code for DP evaluation.

Thanks @Nuria for weighing in!

Wed, May 26, 11:58 PM · Analytics, Research, Privacy Engineering, Privacy, Data-release
Isaac added a comment to T281317: Create a tutorial for deploying a model on toolforge.

FYI -- Research has two templates for this that might be of use:

  • API on Cloud VPS: https://github.com/wikimedia/research-api-endpoint-template
    • The Cloud VPS component may be less useful to community members (as onboarding/permissions take more time) but the underlying Flask app code was initially built with Toolforge in mind. I glanced at Chris' docker example and that looks great too. My only suggestions would be to allow CORS as that's a common headache w/ "why is my API not working?" and perhaps add some basic examples of validating/normalizing values passed via URL parameters (where common parameters would be pageid, pagetitle, qid, language, username, and I'm sure a few others)
    • I also have had a really good experience with Flask and found it quite good for building simple apps. On Cloud VPS, we've also gotten a node backend and Go server backend) working too.
  • API Interface on Toolforge (this has a bunch of Research branding in it but that could always be removed): https://github.com/wikimedia/research-api-interface-template
    • The interface obviously isn't required but I think having a good template promotes transparency (source code, documentation, who to contact, what logging is happening if any, etc.) and making it far easier for users to test out the API.
Wed, May 26, 6:42 PM · artificial-intelligence, Lift-Wing, Machine-Learning-Team (Active Tasks)

Fri, May 21

Isaac added a comment to T281912: Prototype misalignment API.

Weekly updates:

  • Calculated demand scores based on 99th percentile pageviews + log transform + floor of 100. So e.g., if the top 1% of articles in a wiki got 10000 pageviews per month, then log10(10000) = 4 would be the max and an article with 1000 pageviews would be scored as log10(1000) / 4 = 0.75. If the top 1% of articles in a wiki only got 50 pageviews, then log10(100) = 2 would be the normalizing factor instead of log10(50).
  • For each wiki, I calculate the misalignment score for each article: quality - pageviews. Values close to 1 indicate overproduced content -- i.e. top quality but low reader interest -- and values close to -1 indicate underproduced content -- i.e. low quality but high reader interest. Then to summarize misalignment in a wiki, I calculate average misalignment as well as how many articles had extreme misalignment -- i.e. |misalignment| > 0.5. I'm still exploring the data to find the best way to summarize it, but this gives a start. Results below and a few things stand out:
    • Wikis with many bot-generated articles like cebwiki and svwiki have very high misalignment scores (content quality is way higher than demand) and many articles in extreme misalignment
    • Scores further from zero for the very small wikis because the quality/pageview scores are based on smaller numbers presumably and so more variable.
    • English Wikipedia has one of the lowest misalignment scores of the large wikis reflecting the heavy reader demand for content (lots of articles with lots of pageviews makes it hard to keep up quality-wise). Japanese Wikipedia also reflects this, which makes sense based upon what I know. Will be curious to dig into the other larger wikis with particularly low misalignment scores.
    • French and Polish Wikipedia stand out among larger wikis as having very low misalignment scores (yay them!)
Fri, May 21, 8:30 PM · Research (FY2020-21-Research-April-June)
Isaac updated the task description for T280369: Isaac Academic Service 2021.
Fri, May 21, 6:18 PM · Research
Isaac renamed T238437: Co-organize Fair Ranking Track at TREC from Identify and prepare a data-set for Fair Ranking Track at TREC to Co-organize Fair Ranking Track at TREC.
Fri, May 21, 2:15 PM · Research (FY2020-21-Research-April-June)
Isaac added a comment to T238437: Co-organize Fair Ranking Track at TREC.

Weekly updates:

  • No team meeting this week but I generated evaluation metadata to be used in task
  • Remaining support from me should be mostly peripheral for at least the next few months
Fri, May 21, 2:13 PM · Research (FY2020-21-Research-April-June)

Thu, May 20

Isaac created T283285: Story idea for Blog: New Search Referral Dataset.
Thu, May 20, 8:20 PM · Technical-blog-posts

Fri, May 14

Isaac added a comment to T219903: Keep research.wikimedia.org landing page updated.

Turning in for the week but weirdly only one of the four changes in this patch seems to have actually gone live. Notably, after clearing caches, the publications page is updated but Knowledge Gaps / Integrity / Foundational are not. I'll send another patch next week to hopefully push through the changes if it doesn't fix itself.

Fri, May 14, 9:49 PM · Patch-For-Review, periodic-update, Research
Isaac updated the task description for T219903: Keep research.wikimedia.org landing page updated.
Fri, May 14, 7:54 PM · Patch-For-Review, periodic-update, Research
Isaac committed rRLP1eee059a3b7d: Paper / blurb updates on TREC, COVID-19 dataset, list-building, wiki… (authored by Isaac).
Paper / blurb updates on TREC, COVID-19 dataset, list-building, wiki…
Fri, May 14, 7:52 PM
Isaac added a comment to T219903: Keep research.wikimedia.org landing page updated.

@Reedy thanks! I'll try later today then (or next week)

Fri, May 14, 4:53 PM · Patch-For-Review, periodic-update, Research
Isaac updated subscribers of T219903: Keep research.wikimedia.org landing page updated.

@Reedy I'm getting build failures on recent set of updates to the landing page but can't figure out the source. It looks like some changes were made to the test pipeline that maybe are causing this new behavior but I don't have great insight into whether that's catching an issue in the HTML that was previously missed or just buggy in some way (https://integration.wikimedia.org/ci/job/research-landing-page-pipeline-test/28/changes). You helped us initially get the automatic testing setup so I'm wondering if you have any quick insights into what is going on or who to ask? See: https://integration.wikimedia.org/ci/job/trigger-research-landing-page-pipeline-test/28/console

Fri, May 14, 4:42 PM · Patch-For-Review, periodic-update, Research
Isaac added a comment to T281912: Prototype misalignment API.

Weekly updates:

Fri, May 14, 4:18 PM · Research (FY2020-21-Research-April-June)
Isaac added a comment to T238437: Co-organize Fair Ranking Track at TREC.

Weekly updates:

  • Generated dataset of links to/from articles to assist in any graph-based approaches to list-building.
Fri, May 14, 4:16 PM · Research (FY2020-21-Research-April-June)

May 11 2021

Isaac added a comment to T276862: Load outlinks topic model in to KFServing.

Are the only two required parameters are lang and page_title?

There should be three (also threshold -- see below). Regarding lang and page_title though: as it is currently setup, the model is providing a prediction based off of the current revision of a Wikipedia article. This is because it uses the pagelinks table (which is only ever the current state). If needed, we could extend it to extract links from old revisions of a page using the wikitext though it'd likely be imperfect as it wouldn't gather links inserted via templates (which are a pretty large proportion of links for many stub articles) and would just be more computationally/API-intensive. With that in mind, I can imagine four options:

  • Current version of article only:
    • lang + page_title: current behavior as you point out. The reason I went with it is that it's easiest for people to play with the API. But for production, page_title is not great because the API follows redirects and therefore sometimes the results are not for the given page_title but for the page it points to (which can be confusing).
    • lang + page_id: ideal behavior from research perspective because page_id is nice and stable. Easy to update the API to support this and I'm happy to provide that code.
    • lang + QID: probably not necessary but technically an option. The Wikidata ID would just be mapped to a page ID or title then before gathering links etc.
  • Any version of the article:
    • lang + revid: how most of ORES works and allows for processing historical revisions, which is nice. It would be a larger lift though because it would require fetching wikitext, processing the wikitext, and then hitting the APIs again for Wikidata IDs associated with the links. So this approach would almost certainly have higher latency because it would process more data and make more API calls than just gathering the pagelinks directly AND it would be incomplete because it would only parse links that are found in the wikitext. But it obviously greatly expands the capabilities of the API.
May 11 2021, 4:57 PM · Patch-For-Review, Machine-Learning-Team (Active Tasks), Lift-Wing

May 7 2021

Isaac added a comment to T281912: Prototype misalignment API.

Weekly updates:

  • wrote up quality model description: https://meta.wikimedia.org/wiki/Research:Prioritization_of_Wikipedia_Articles/Language-Agnostic_Quality
  • tested end-to-end pipeline for generating misalignment scores for articles but am going to split it into its individual components so it's easier to extract the data from intermediate steps for other purposes
  • so far, the entire approach to quality and demand has been wiki-specific and as a result, e.g., a top-quality article in Simple Wikipedia is far smaller than a top-quality article in English Wikipedia and a top-demand article in Simple Wikipedia gets far fewer pageviews than a top-demand article in English Wikipedia. This makes plenty of sense though reduces the comparability of resulting quality/demand/misalignment scores across languages. However, I'm considering whether there should be some language-agnostic thresholds -- e.g., a top quality article will have at least 10 sections regardless of language and maybe this number can go higher in certain languages. Same for images and references. Harder for page length because it depends heavily on language whether you need e.g., 100 bytes or 1000 bytes, but perhaps some basic minimum threshold could still be set.
May 7 2021, 8:13 PM · Research (FY2020-21-Research-April-June)

May 6 2021

Isaac added a comment to T238437: Co-organize Fair Ranking Track at TREC.

Weekly updates:

May 6 2021, 8:08 PM · Research (FY2020-21-Research-April-June)
Isaac added a comment to T280385: Apache Beam go prototype code for DP evaluation.

I'm making this high so we can try and pick it up, but it's still behind lots of other work. The proof of concept is very useful, thanks for the good work

Thanks @Milimetric!

May 6 2021, 3:26 PM · Analytics, Research, Privacy Engineering, Privacy, Data-release

May 5 2021

Isaac added a comment to T266375: Add timestamps of important revision events to mediawiki_history.

Oh interesting. Perhaps we should capture the expiry in that stream too!

Yeah, if it's straightforward, that'd be appreciated! I actually had a use-case for this yesterday where an affiliate was interested in how to calculate how many users on their wiki had been blocked for at least one week in a given year. I suggested they change their criteria to # of blocks but duration of block would have been better as a metric (Cell 76 (three down from this header)).

May 5 2021, 8:38 PM · Product-Analytics, Analytics
Isaac added a comment to T266375: Add timestamps of important revision events to mediawiki_history.

@Ottomata thanks for the ping. Yeah, I'm aware of the table but the challenge has always been whether you can reconstruct the page restrictions on a page at any given moment in the past and that table unfortunately doesn't give any information about expiration of the blocks (as can be seen e.g., in the Special/Log pages). I honestly haven't looked too deeply into it so maybe there's another table that maintains that information or an event that triggers when the expirations expire but the few times I've looked into it briefly, it wasn't clear to me.

May 5 2021, 8:26 PM · Product-Analytics, Analytics

May 4 2021

Isaac added a comment to T280385: Apache Beam go prototype code for DP evaluation.

We probably do not want to install beam on the cluster just for this experiment so can we use jupyter rather and run beam on python? https://beam.apache.org/get-started/quickstart-py/

@Nuria this is where I get out of my expertise, but my understanding is the Beam differential privacy library is only available in Go (and thus needs the Go SDK to run). So not sure that Python is an option unless the differential privacy library gets ported to Python. Hal's tool uses the Beam Go SDK so we have confirmation that it works with a local runner (SQLlite backend) and now the main questions are:

  • Policy: what configuration of parameters / privacy do we use?
  • Engineering: does the Apache Beam differential privacy library work on our cluster? Hal has shown that it works using a local SQLlite database so really there isn't any more intermediate steps to test and next would be trying it with the Spark backend runner. This starts to go over my head though I assume there are two parts to this:
    • Make sure Go Beam SDK can be installed on the cluster (or wherever this job would be run from)
    • Swap out the SQLlite backend for the Spark runners and test this out on the data in HDFS (a simple choice is using wmf.pageview_actor because that will support both pageview-level privacy and user-level privacy depending on our choice)
May 4 2021, 10:27 PM · Analytics, Research, Privacy Engineering, Privacy, Data-release
Isaac moved T281912: Prototype misalignment API from Staged to FY2020-21-Research-April-June on the Research board.
May 4 2021, 6:19 PM · Research (FY2020-21-Research-April-June)
Isaac created T281912: Prototype misalignment API.
May 4 2021, 6:18 PM · Research (FY2020-21-Research-April-June)

May 3 2021

Isaac added a comment to T280385: Apache Beam go prototype code for DP evaluation.

When using the Laplace distribution, the noise doesn't consume the δ, however some of the δ is consumed by something we call partition selection.

@TedTed thanks for this update, the delta component now makes much more sense. I see that I was doing this already in the initial Python prototype -- i.e. I also was trying to determine the appropriate threshold for data release such that we didn't have to also add noise to the millions of 0-pageview pages / need a 100% accurate accounting of what pages existed on any given day. I just hadn't connected this parameter to the role of delta. Even though it's technically possible to provide Beam w/ a list of all possible pages, I think I'd advocate for the library's delta/threshold approach for its simplicity. We could always revisit that if we moved to releasing fuller differentially-private datasets for researchers looking to do large-scale quantitative analyses that required the ability to e.g., accurately compute averages over all articles. Right now I think a top-k list for editor review is still the main motivation for this work though and that aligns nicely with throwing out the long-tail of low pageview data.

May 3 2021, 8:02 PM · Analytics, Research, Privacy Engineering, Privacy, Data-release

Apr 30 2021

Isaac updated the task description for T280369: Isaac Academic Service 2021.
Apr 30 2021, 5:36 PM · Research
Isaac closed T277894: Support BKC Research Sprint on Digital Self-Determination as Resolved.

Successful session! Doing some debriefing with speakers and will continue to support sprint around some of the final products it's producing, but bulk of work is complete!

Apr 30 2021, 5:36 PM · Research
Isaac updated the task description for T277894: Support BKC Research Sprint on Digital Self-Determination.
Apr 30 2021, 5:35 PM · Research
Isaac added a comment to T238437: Co-organize Fair Ranking Track at TREC.

Weekly update:

  • Checked with Legal and based on advice, no filtering of articles but including URLs to articles as a better form of attribution and making sure to have both a discussion of the limitations of structured data and completeness of data (specific to Wikidata as a source of some of our fairness constructs).
  • Preparing training data set -- list of articles associated w/ each of our chosen WikiProjects, predicted quality, associated continents
  • Official description / dataset should go out shortly
Apr 30 2021, 5:35 PM · Research (FY2020-21-Research-April-June)
Isaac added a comment to T280385: Apache Beam go prototype code for DP evaluation.

Just wanted to update with some of the work that @Htriedman has done and discussions we've had off-ticket (feel free to jump in Hal to correct / add / etc.):

Apr 30 2021, 2:53 PM · Analytics, Research, Privacy Engineering, Privacy, Data-release

Apr 29 2021

Isaac added a comment to T276270: Outreachy Round 22: Use PAWS to create a series of notebook based tutorials that help users access and work with data on Wikimedia projects .

Just a heads up in case you're unaware:

Apr 29 2021, 5:29 PM · Outreachy (Round 22), Outreach-Programs-Projects
Isaac updated the task description for T270140: Release dataset on top search engine referrers by country, device, and language.
Apr 29 2021, 3:46 PM · Privacy Engineering, Research, Analytics

Apr 28 2021

Isaac added a comment to T276270: Outreachy Round 22: Use PAWS to create a series of notebook based tutorials that help users access and work with data on Wikimedia projects .

Please tell me how to format them as hyperlinks?

@Palak199 Not sure but plaintext is completely fine. Thanks

Apr 28 2021, 4:43 PM · Outreachy (Round 22), Outreach-Programs-Projects

Apr 27 2021

Isaac added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

FYI PAWS should be back but if you continue to have intermittent issues, don't be too surprised, just try back in another e.g., 10 minutes.

Apr 27 2021, 4:10 PM · Outreachy (Round 22)
Isaac added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

@Pikaa97 it looks like there is some maintenance work going on. Hopefully will be up in an hour. I'll try to give an update if I hear but know that it's not specific to you and should be returned soon.

Apr 27 2021, 3:36 PM · Outreachy (Round 22)

Apr 23 2021

Isaac added a comment to T238437: Co-organize Fair Ranking Track at TREC.

I suggest you check about the question of which articles to potentially exclude with Legal and Security as well

Sounds good -- I'll try to check with them next week.

Apr 23 2021, 7:26 PM · Research (FY2020-21-Research-April-June)
Isaac added a comment to T238437: Co-organize Fair Ranking Track at TREC.

Weekly update:

  • Nothing concrete but discussion around how to prepare for getting an initial dataset out for participants and do the assessments that will be necessary in the summer when we build the test set -- i.e. for a WikiProject of our creation, is any given Wikipedia article relevant to its scope? NOTE: we won't actually create the WikiProject and tag articles -- this will just be an external dataset for labeling.
Apr 23 2021, 5:56 PM · Research (FY2020-21-Research-April-June)
Isaac added a comment to T276274: Outreachy Round 22 Microtask : Complete PAWS notebook tutorial.

Hi @srodlund & @Isaac, could you please confirm that you both got my email with the notebook?

@Slst2020 thanks for letting us know. It went to my spam folder and I had not checked. I will try to get you the review by Monday.

Apr 23 2021, 5:05 PM · Outreachy (Round 22)

Apr 22 2021

Isaac added a comment to T276270: Outreachy Round 22: Use PAWS to create a series of notebook based tutorials that help users access and work with data on Wikimedia projects .

Hey all -- I've gotten a few questions about the "Write a library that could work with SQL dumps" part of the outcomes so I wanted to give a few more details:

Apr 22 2021, 3:05 PM · Outreachy (Round 22), Outreach-Programs-Projects

Apr 21 2021

Isaac added a comment to T270140: Release dataset on top search engine referrers by country, device, and language.

Huge huge thanks to @JFishback_WMF for the privacy review! Everything makes sense from my side.

Apr 21 2021, 7:10 PM · Privacy Engineering, Research, Analytics
Isaac added a comment to T280385: Apache Beam go prototype code for DP evaluation.

Also, our privacy policy prevent us from keeping data at the user level, so DP notions that are user centric will not really serve our use case. I doubt they serve the case of any service you can use while not authenticated.

@Nuria User-level privacy would not require retaining user-level data. The privacy unit just dictates how we filter the initial data, but once the filtering is done, there is no need to retain the userhashes and the final dataset will have the exact same format. For user-level privacy, we could easily work within the 90-day retention. We would probably start with pageview_actor and then just apply the filtering to arrive at the final <country, language, article, count> tuples.

Apr 21 2021, 4:15 PM · Analytics, Research, Privacy Engineering, Privacy, Data-release

Apr 20 2021

Isaac added a comment to T280029: Easy dimensional data visualization.

All of those options make sense to me long-term. For now, it's pretty easy to deploy a new Turnilo instance on Cloud VPS (all the requirements etc. are handled via a shell script) so I can always help people build their own if that's necessary before we arrive at a more general solution: https://github.com/wikimedia/research-api-endpoint-template/tree/turnilo

Apr 20 2021, 8:29 PM · Analytics
Isaac added a comment to T280385: Apache Beam go prototype code for DP evaluation.

Thanks for starting on this @Htriedman ! I wanted to elevate something that has been discussed in different places and impacts the implementation approach, which is: what is our privacy unit? It's not a blocker yet but something we will have to decide on that will affect the code. There are two main approaches that we can choose between (I'm largely pulling from @TedTed's comment in T267283#6607616 with some additional context of my own):

  • User:
    • The goal is to limit the number of pageviews any given individual contributes to a data release to some maximum threshold -- e.g., 5 pageviews per day -- so that you can't tell whether they are part of the data release. In theory this provides really strong individual privacy guarantees and in practice it's a bit dicey for Wikimedia because we don't track users so it's hard to enforce a threshold.
    • This is a pretty commonplace approach by other organizations doing differential privacy because they do track users via their accounts and so the theory matches well with the practical.
    • Providing a guarantees that no one can determine if any given person contributed any pageviews to a data release is useful for a couple of reasons:
      • Editors have a very well-documented string of pageviews (in terms of pages they edited) and if someone was trying to identify a particular editor and whether they were from a given country, they might be able to determine this from a data release that didn't provide user-level privacy.
      • By enforcing user-level privacy, we are reducing the impact any given person should be able to have on the dataset. In certain ways, this makes the dataset more robust because it helps to ensure that it reflects many readers' interests etc. instead of just a few prolific readers. Without additional filtering, an individual e.g., on a desktop computer could contribute up to 800 pageviews per day to the dataset before they'd be labeled as automated and filtered out.
    • In practice, going with a user-level privacy unit is difficult at Wikimedia because we don't track readers in this way so it means we need a proxy for "user" that would likely be user-agent (device) + IP address (location) -- i.e. userhash. This currently works okay over the time period of e.g., 1 day, but also suffers for mobile users (IP addresses change more frequently) and could suffer further with Chrome's proposed reduction of the user-agent to a much more generic string (T242825). This introduces some hesitation into trusting that there is a nice 1:1 mapping between userhashes and individuals in our data. For example, individuals with multiple devices or no fixed IP address will have their data spread out amongst many userhashes and thus the filtering will not help them much. Conversely, shared IP proxies and Chrome's proposed changes may mean that we're combining many individuals' data together under a single userhash and thus filtering out much more than we ideally would.
    • For implementation, this would mean that you're probably starting with a table of individual pageviews with associated userhashes and then I assume Privacy on Beam does the requisite filtering / counting based on your input parameters.
  • Pageview:
    • We would do almost no filtering of the data and make no assumptions about who contributed which pageview. We will still be able to guarantee that any given pageview will be private but patterns in the aggregate might reveal information about individuals -- e.g., editors who edit many different pages would be at increased risk of having their country revealed. In practice it would be quite hard to deidentify them but there would be no guarantees of privacy for their whole pattern of pageviews.
      • NOTE: if we go this route, we could try to filter out editors from the data separately from the differential privacy aspect as they are our most at-risk group, but we still lose our formal guarantees.
    • The allure of this approach is it simplicity and transparency -- we don't have to make assumptions about our userhashes that may turn out to be incorrect. We probably choose more conservative privacy parameters to account for the lower guarantees. Most readers still get excellent protection (assuming they don't read many pages per day) and other controls like the threshold at which we release data will all combine to make this still pretty privacy-protecting.
    • For implementation, you would likely follow the suggestion here from TedTed: T267283#6608103
Apr 20 2021, 7:56 PM · Analytics, Research, Privacy Engineering, Privacy, Data-release
Isaac added a comment to T276274: Outreachy Round 22 Microtask : Complete PAWS notebook tutorial.

so why simple.wikipedia was required?

We went with Simple Wikipedia because the size is much smaller than English Wikipedia so it was more reasonable that you could process it via these notebooks.

Apr 20 2021, 3:53 PM · Outreachy (Round 22)
Isaac added a comment to T276274: Outreachy Round 22 Microtask : Complete PAWS notebook tutorial.

Only thing left is- how can I make this content readable, this is revision diff from Craig Noone article:

@DaneshwariK I'm not sure if this is what you're asking, but the Compare API will provide you with HTML diffs for edits -- e.g., https://en.wikipedia.org/w/api.php?action=compare&fromrev=930870273&torev=933163076
The diff that you pasted there is wikitext, which is the raw code that is used for writing Wikipedia articles. It must be parsed then into HTML to be "readable". You can also just screenshot diffs in the visual mode on Wikipedia if that's easier and what you want -- e.g., https://en.wikipedia.org/w/index.php?title=Craig_Noone&diff=933163076&oldid=930870273&diffmode=visual

Apr 20 2021, 12:47 PM · Outreachy (Round 22)

Apr 19 2021

Isaac added a comment to T276274: Outreachy Round 22 Microtask : Complete PAWS notebook tutorial.

Can someone help me figure out where I might be going wrong?

@Nizz009 I forgot to mention namespaces in the tutorial template, but this page will explain how they work: https://en.wikipedia.org/wiki/Wikipedia:Namespace
In short -- articles with the same "title" can exist in different namespaces where they serve different purposes. You should focus on namespace 0, which is what we think of as traditional Wikipedia articles. Namespace is an attribute in the XML dump that you can easily access and filter on. In general, when trying to figure these things out, one trick is that you can easily figure out what article is associated with a page ID like this: if the page ID is 413520, then you can see the associated article by going to: https://simple.wikipedia.org/wiki/?curid=413520. Good catch!

Apr 19 2021, 12:50 PM · Outreachy (Round 22)

Apr 16 2021

Isaac added a comment to T276274: Outreachy Round 22 Microtask : Complete PAWS notebook tutorial.

Is there any way we can convert an article like this one https://simple.wikipedia.org/w/index.php?title=india&action=history&year=2020 into its own independent XML dump file and use it?

@Palak199 I've never tried it so don't know if it would work, but there is an export tool that might do this: https://simple.wikipedia.org/wiki/Special:Export
If you use this in your notebook, just make sure you explain clearly how to reproduce the results.

Apr 16 2021, 7:22 PM · Outreachy (Round 22)
Isaac closed T272175: Prototype article importance metrics, a subtask of T155541: [Epic] Article importance prediction model, as Resolved.
Apr 16 2021, 5:21 PM · Research, Machine-Learning-Team, artificial-intelligence
Isaac closed T272175: Prototype article importance metrics as Resolved.

Weekly update:

  • Resolving this task. I had wanted to get a bit further down the proof-of-concept road before I committed to a metric but that's really the work of building an API (goal for Q4)
  • We submitted a CSCW paper on this project yesterday. Building on my thoughts in T272175#6894768, one of the really interesting aspects to come out of that work/discussion was that there really isn't a good way to model article importance (which is a measure of an ideal distribution of quality of the projects in line with encyclopedic values) separately from the current state, which is a reflection of editor/reader interest. Classic measures for importance like PageRank or pageviews are highly wrapped up with the current state. Evaluations of editor effort (in terms of # of edits) and article importance (in terms of Vital Articles) show some pretty big gaps, indicating that implicit data sources are always going to be an incomplete proxy and you really need explicit crowdsourced ratings from "experts" (e.g., in the form of Vital Articles or WikiProject Importance ratings) to guide what content should be prioritized to be high quality on Wikipedia. This feeds into this project not just being about developing a ranking but exposing filters for editors to narrow down content to what they see as important, better annotation systems for WikiProjects (which is outside the scope of this work), and the work to tie edit recommender systems to campaigns (which are in effect community-driven importance assessments).
  • In the absence of those crowdsourced ratings, however, reader demand is the other component. In line with past thoughts, a misalignment metric (reader interest vs. current quality) is not actually the ideal measure of article importance but it is simple and can complement importance ratings along with content filters. Warncke-Wang et al. used a binned version of this metric because they had access to quality classes. Because I'm building language-agnostic models, the reader demand and quality scores are both fixed to [0-1] so the difference is nicely scoped to [-1,1]. This can always be mapped back to qualitative classes for the purposes of explaining recommendations -- e.g., C-class article w/ FA-level article pageviews. Continuing the buildout of this into an API will occur in Q4.
Apr 16 2021, 5:21 PM · Research (FY2020-21-Research-January-March)
Isaac added a comment to T238437: Co-organize Fair Ranking Track at TREC.

Weekly update:

  • Completed keyword generation and realized that we'll likely want to allow for manual generation of keywords as well because many projects are complicated to describe via search keywords but have a well-defined scope.
  • Put together simple code for extracting articles that can be used for generating the dataset that is presented to participants: https://public.paws.wmcloud.org/55703823/Processing%20Text%20Dumps.ipynb
Apr 16 2021, 3:45 PM · Research (FY2020-21-Research-April-June)
Isaac updated the task description for T277894: Support BKC Research Sprint on Digital Self-Determination.
Apr 16 2021, 2:50 PM · Research
Isaac created T280369: Isaac Academic Service 2021.
Apr 16 2021, 2:48 PM · Research
Isaac added a comment to T276274: Outreachy Round 22 Microtask : Complete PAWS notebook tutorial.

if I store the revision id, comments and tags for every revision would that be sufficient?

@Palak199 you're comparing the API results with the data you gathered from the dumps, so I'd just gather whatever information you stored from the dumps so that you can directly compare the two.

Apr 16 2021, 12:53 PM · Outreachy (Round 22)

Apr 15 2021

Isaac closed T203042: Output 2.2: Characterizing readership by demographics as Resolved.

With the academic paper accepted at ICWSM, closing out this epic task. Any continued work in this area will likely continue under the umbrella of the Knowledge Gaps Index.

Apr 15 2021, 6:53 PM · Research, address-knowledge-gaps, Epic
Isaac closed T230677: Share out results from demographics surveys as Resolved.

Closing this task out -- blogpost may still happen but possibly under the umbrella of Knowledge Gaps Taxonomy.

Apr 15 2021, 6:51 PM · Research
Isaac closed T230677: Share out results from demographics surveys, a subtask of T203042: Output 2.2: Characterizing readership by demographics, as Resolved.
Apr 15 2021, 6:51 PM · Research, address-knowledge-gaps, Epic
Isaac updated the task description for T230677: Share out results from demographics surveys.
Apr 15 2021, 6:51 PM · Research

Apr 14 2021

Isaac updated the task description for T219903: Keep research.wikimedia.org landing page updated.
Apr 14 2021, 8:58 PM · Patch-For-Review, periodic-update, Research
Isaac committed rRLP7ec707cd02e8: Research of the Year Award, move some unmaintained pages to redirects, change… (authored by Isaac).
Research of the Year Award, move some unmaintained pages to redirects, change…
Apr 14 2021, 8:58 PM

Apr 13 2021

Isaac committed rRLPc5161e7c9422: Add awards page and update navigation. Add Diego's ICWSM paper. (authored by Isaac).
Add awards page and update navigation. Add Diego's ICWSM paper.
Apr 13 2021, 8:58 PM
Isaac updated the task description for T277548: Improve robustness of data processing pipeline.
Apr 13 2021, 8:47 PM · Research, Platform Team Workboards (Green)
Isaac updated the task description for T219903: Keep research.wikimedia.org landing page updated.
Apr 13 2021, 8:02 PM · Patch-For-Review, periodic-update, Research
Isaac awarded T280029: Easy dimensional data visualization a Love token.
Apr 13 2021, 3:13 PM · Analytics

Apr 9 2021

Isaac added a comment to T276274: Outreachy Round 22 Microtask : Complete PAWS notebook tutorial.

All: I forgot to mention but there are currently holidays today (Friday) - Monday so responses will be a bit slower than usual. I'll do my best to check occasionally though and respond if it is quick.

Apr 9 2021, 7:51 PM · Outreachy (Round 22)

Apr 8 2021

Isaac closed T273325: Develop metrics for Geographic gaps as Resolved.

Resolving this. I also updated the prototype API for this data so that it doesn't just provide regions but also provides the aggregations to reflect the current status -- e.g., WandaVision: https://wiki-region.wmcloud.org/api/v1/region?qid=Q65980217

Apr 8 2021, 6:32 PM · Research (FY2020-21-Research-January-March)
Isaac closed T273325: Develop metrics for Geographic gaps, a subtask of T242172: Taxonomy of Knowledge Gaps, as Resolved.
Apr 8 2021, 6:32 PM · Research, Epic
Isaac added a comment to T276274: Outreachy Round 22 Microtask : Complete PAWS notebook tutorial.

Does anyone know if Wikimedia SQL dumps always contain a single table, or if there can be several in the same file?

@Slst2020 the SQL dumps should only ever have a single table in them though obviously some tables are small while others are much larger.

Apr 8 2021, 1:23 PM · Outreachy (Round 22)

Apr 7 2021

Isaac added a comment to T270140: Release dataset on top search engine referrers by country, device, and language.

I'm so sorry, is this stuck on me?

No worries -- as you said, privacy review is still ongoing. FYI @JFishback_WMF I just discovered that we've been publishing some of this data in a highly aggregated form here for the past several years. Not sure if that helps with the privacy review at all.

Apr 7 2021, 9:23 PM · Privacy Engineering, Research, Analytics
Isaac added a comment to T238437: Co-organize Fair Ranking Track at TREC.

Weekly update:

  • Chose initial set of ~50 WikiProjects for training data. Now identifying what keywords we would provide as "queries" for each WikiProject -- e.g., WikiProject Agriculture would be associated with: agriculture, crops, livestock, forests, farming.
  • Came up with an ad-hoc way of assessing WikiProject activity (as a proxy for likely completeness). I look at how many annotations (new articles tagged, new quality assessments, or new importance assessments) were made by a given WikiProject in the last 90 days. No clear threshold between active/inactive (for example, a project might have few recent annotations but still have excellent coverage if their topic is not one that often has new articles and they did much of the tagging work years ago) but it's a good gut-check. Data: https://analytics.wikimedia.org/published/datasets/one-off/isaacj/list-building/enwiki_wikiproject_activity_2021_04_06.tsv
Apr 7 2021, 8:42 PM · Research (FY2020-21-Research-April-June)
Isaac added a comment to T276398: Experiment with on-wiki model documentation.

Where do you host the Python code?

This is a prototype so we're just running this off stat1004 using this code. There are two cronjobs: the first to run the job to generate the data and then a second to publish to the wikis. The separation is because the first accesses HDFS and so needs to kinit (which can be done automatically but the script has to be run from a more generic account then). The second must be run under my account though because it has to be able to access the file with the credentials for publishing to the wikis as my account is stored under a file that is read-only for me. Hacky but it has worked for almost a year! I suspect there are better solutions though but heads up that the permissions issue might arise for you too (and you might consider pursuing the creation of a more generic MLPlatformBot_(WMF) account to ensure better continuity and somewhat reduce the permissions issue). If you're curious, the crontab looks like this: slides.

Apr 7 2021, 6:41 PM · Machine-Learning-Team (Active Tasks), Documentation, artificial-intelligence
Isaac added a comment to T276274: Outreachy Round 22 Microtask : Complete PAWS notebook tutorial.

I can't believe I misread the documentation so thoroughly D:

No worries -- this is why tutorials are useful :)

Apr 7 2021, 1:24 PM · Outreachy (Round 22)

Apr 6 2021

Isaac added a comment to T276398: Experiment with on-wiki model documentation.

This is a page that gets auto-updated by a bot as new information rolls in, so there is some precedence for automating articles on mediawiki.

I can try to help if more is needed but quick overview of how this works:

  • We run a script on the servers that generates that data that we want to publish on a regular cadence
  • When the data is prepared, we use some simple Python functions to format that data into wikitext (in our case, a large table with some explanatory text): code
  • We then use the APIs to write that wikitext to the specified page. It makes the edit under my account using an owner-only OAuth token (see documentation)
Apr 6 2021, 7:05 PM · Machine-Learning-Team (Active Tasks), Documentation, artificial-intelligence
mpopov awarded T270140: Release dataset on top search engine referrers by country, device, and language a Stroopwafel token.
Apr 6 2021, 3:36 PM · Privacy Engineering, Research, Analytics
Isaac added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

Now, as per the first clickstream data dump, it's in English. But in langlink API, it's not in English.

@Tru2198 make sure that you're viewing all the results. This can be done via the continuation parameter in mwapi (documentation) or just increasing the lllimit parameter via your API call to return more results at once. Hope that helps.

Apr 6 2021, 2:43 PM · Outreachy (Round 22)

Apr 5 2021

Isaac added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

Just dropping in a comment to say belatedly that we generally won't be responding to comments over the weekend. Looking back through the discussions, it looks like nothing requires our attention but let me know if I missed something. Thanks all!

Apr 5 2021, 3:14 PM · Outreachy (Round 22)
Isaac closed T263646: Develop an approach to infer which countries are associated with a given Wikipedia article as Resolved.

@srishakatux thanks for the ping. The Outreachy-specific work is complete -- all the other tasks are follow-on tasks for the overall project (not pieces that were expected to be completed during the internship). I'll close this task and then at some point this week move those to another parent task.

Apr 5 2021, 1:45 PM · Outreachy (Round 21), Outreach-Programs-Projects
Isaac closed T263874: Outreachy Application Task: Tutorial for Wikipedia Page Protection Data, a subtask of T263646: Develop an approach to infer which countries are associated with a given Wikipedia article, as Resolved.
Apr 5 2021, 1:45 PM · Outreachy (Round 21), Outreach-Programs-Projects
Isaac closed T263874: Outreachy Application Task: Tutorial for Wikipedia Page Protection Data as Resolved.
Apr 5 2021, 1:45 PM · Outreachy (Round 21)
Isaac added a comment to T276274: Outreachy Round 22 Microtask : Complete PAWS notebook tutorial.

This results in a JSONDecode error, possibly because an HTML (error?) page is being returned instead of JSON. Has anyone tried this successfully?

@Christalee_b the documentation isn't great for that but here's a better example of how to do continuation that will hopefully solve your problem: https://github.com/mediawiki-utilities/python-mwapi#query-with-continuation

Apr 5 2021, 1:42 PM · Outreachy (Round 22)

Apr 2 2021

Isaac added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

I had a doubt regarding recording contributions. Since we are not making formal pull requests in this project, at what points and in which form are we required to record our contributions? Do we need to submit the public link of our notebook after completing a few to-dos and that will count as a contribution?

@rachita_saha There is no formal point right now in which you are required to submit an initial contribution for this task (we might ask for it later depending on interest). It's a good practice to submit one after you've got an initial draft of your notebook as a checkpoint and to identify what improvements you want to make. When we evaluate the submissions at the end though (when submission period closes and you also fill out an application for this project), we'll only look at your final contribution. Hope that helps.

Apr 2 2021, 3:42 PM · Outreachy (Round 22)
Isaac added a comment to T272175: Prototype article importance metrics.

Weekly updates:

  • Looked into moving from a straight linear regression to essentially just a weighted average of the four features currently being used. That leads to some sacrifice in accuracy on English (from 0.913 to 0.867 linear correlation between predicted scores and ORES scores) but has a few nice properties:
    • Automatically bounded between 0 and 1 (because each feature is bounded between 0 and 1)
    • Very easy to interpret
  • Plots below of linear regression and "weighted average" approach (which is just a linear regression w/ no intercept and the weights normalized so they sum to 1):
    • Linear Regression:
    • Weighted-Average:
  • Started process of pulling in data from a few other languages to test generalizability of weights learned for English. Two options that I'm pursuing:
    • ORES scores are available in bulk for a few wikis on HDFS: euwiki, glwiki, and a few others. These are likelihoods for each article quality class from that wiki so would need to be mapped to a float between 0 and 1 if I was to use the same modeling approach.
    • Groundtruth data of WikiProject assessments from PageAssessments is available for Arabic and French (and Turkish and Hungarian) on MariaDB. These also would need to be mapped to floats because they are the native wiki's article quality classes.
Apr 2 2021, 3:39 PM · Research (FY2020-21-Research-January-March)
Isaac added a comment to T238437: Co-organize Fair Ranking Track at TREC.

Weekly updates:

  • Moving onto process of choosing WikiProjects for training data. Ideally WikiProjects with a mixture of attributes so the fairness criteria can be reasonably applied. Luckily, the spreadsheet I put together earlier of all WikiProjects, # of articles, and details on biographies and geography will help with this process. I would like to add a measure of activity -- probably something around # of articles tagged or rated for quality in the last e.g., 3 months -- to help identify the projects most likely to have high coverage.
Apr 2 2021, 3:29 PM · Research (FY2020-21-Research-April-June)
Isaac added a comment to T273325: Develop metrics for Geographic gaps.

Weekly update: wrapped up entry into metric schema doc for geographic gaps across readers/contributors/content. Can reasonably close out this task but some more long-term follow-up aspects:

  • Update geographic metrics based on findings from Marc's work on other metrics and feedback that GDI receives about their geographic metrics
  • Expand out model for labeling articles with relevant countries to improve content geographic metric coverage
  • Consider any improvements to geographic content approach -- e.g., also model sub-country level?
Apr 2 2021, 3:13 PM · Research (FY2020-21-Research-January-March)
Isaac added a comment to T276274: Outreachy Round 22 Microtask : Complete PAWS notebook tutorial.

Checking and not checking namespace(0) give two different figures on edits. I believe that this is a normal behavior but just want to make sure that I am not missing anything important (i.e. it is acceptable to analyze edits only in the article namespace)?

Yep -- it would not be surprising that the numbers would be different (as many editors just edit articles namespace 0 and don't bother with talk pages or other namespaces). Feel free to just proceed w/ namespace 0 for your analysis but if you still have the data on the difference, it could be interesting to include.

Apr 2 2021, 1:01 PM · Outreachy (Round 22)
Isaac added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

Should I record my first contribution and await the review on the Outreachy site?

@Tru2198 Recording contributions helps with our ability to track interest in the project and get a sense of where folks are at. We don't automatically review any contributions though. At some point when you feel pretty happy with the notebook and want some detailed feedback, you can send me an email with the public PAWS link and I'll try to respond with feedback within a workday or two. I would suggest doing this when your notebook is in a pretty stable state though as I can only guarantee that I will have time to do it once per applicant. In the meantime, feel free to continue to ask specific questions here though.

Apr 2 2021, 12:44 PM · Outreachy (Round 22)

Apr 1 2021

Isaac added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

for the task: Compare Reader Behavior across Languages, will the comparison between two languages suffice, as most articles are supported by only one or two languages, other than the English language that intersects with the available clickstream data and that of the langlinks API?

@Tru2198 yes, that would suffice. If the comparison ends up not being interesting, you can always look for different article too.

Apr 1 2021, 7:15 PM · Outreachy (Round 22)
Isaac added a comment to T276274: Outreachy Round 22 Microtask : Complete PAWS notebook tutorial.

Hi everyone, my name is Zhansaya. I have a question regarding the documentation: should I explain every line of my code or is it OK to give a detailed summary of 10-15 lines? Thanks in advance!

Welcome Zhansaya! There's no strict guidelines but generally a concise summary for a given cell in your notebook is sufficient. You can always add Markdown cells if you need more as well. Most style guides essentially say that you don't want to describe the code -- i.e. you can assume the person accessing the tutorial knows basic Python and can figure out what each line does -- but do give a sense of the goal of the code or any unclear choices you might have made in the code.

Apr 1 2021, 7:08 PM · Outreachy (Round 22)
Isaac added a comment to T278551: Duplicate wikitext entries for a bunch of wikis in 2021-02 snapshot.

Thanks @JAllemandou for tracking this down and fixing it!

Apr 1 2021, 1:03 PM · Analytics-Kanban, Analytics

Mar 31 2021

Isaac added a comment to T276274: Outreachy Round 22 Microtask : Complete PAWS notebook tutorial.

Can I get an estimate of how long it should take an optimized solution? Or suggestions on how to approach this problem? I'm willing to post what I have so far if that would help.

Welcome @Christalee_b -- a few thoughts:

  • Looping through the history dump for simplewiki should take around 30 minutes.
  • It'll go a good bit faster as well (~20 minutes) if you only process pages in the article namespace (0). Every page on Wikipedia is associated with a namespace and this information is surfaced by the mwxml library so you can filter on it. This will discard pages with very long edit histories like some talk pages.
  • If it's still taking a long time, then there's probably something in your code that is slowing down the processing. My suggestion for figuring out what's happening is to calculate how long it takes to process e.g., the first 500 or 1000 pages. Then go through the code and comment out parts and rerun and compare the time. This should help you quickly identify what part of the code is so slow (there are other, more formal ways too) and then you can think about why it's slow and how you could speed it up.
Mar 31 2021, 6:54 PM · Outreachy (Round 22)
Isaac added a comment to T276274: Outreachy Round 22 Microtask : Complete PAWS notebook tutorial.

Hey @Palak199 thanks for the additional details. A few thoughts that hopefully help:

  • It's the mobile edit tag in particular that I want you to focus on -- i.e. only tag ID #5 so you can ignore mobile web edit, mobile app edit, and advanced mobile edit. If you want to understand the differences between these, you can look at the descriptions on Simple Wkipedia's tag page: https://simple.wikipedia.org/wiki/Special:Tags
  • The revision IDs associated with any particular tag can be found in the file whose name is stored in the TAG_DUMP_FN parameter. You'll have to write some basic code to loop through that file and extract the revision IDs associated with the mobile edit tag. You can see details of that file earlier in the notebook to help you understand how this file is formatted.
Mar 31 2021, 6:24 PM · Outreachy (Round 22)
Isaac added a comment to T273325: Develop metrics for Geographic gaps.

Trying to summarize with a few more details why mapping content to urban/rural distinctions is quite challenging:

  • The best work I know on "what is rural?" comes from Hardy et al. They describe how various researchers have operationalized rurality (Section 2.1), which are broken into a two main categories:
    • Descriptive rurals: observable/measurable features such as population size/density, distance to urban area, economic indicators
      • Most of these are quite difficult to do at scale -- in particular, population density, distance to urban area, economic indicators. These can often be gathered for a single country but interpreting what is urban and what is rural from these numbers varies greatly country-to-country and I don't know of good global datasets for this sort of categorization (see UN page, Section D for more details). This is what I did though in 2017 for my urban-rural research on the US/China and Wikipedia (paper).
      • As mentioned in the UN page, the best approach is based on population size though it is not comparable between countries. I do have a script for doing this based on place names or coordinates (description) but that greatly narrows the scope of articles being considered to just those with coordinates (or existing population data via Wikidata). At that point, given that it's a poor measure of urban/rural and would be greatly limited in what articles it could be applied to compared to the country approach, the value is not clear.
    • Sociocultural/symbolic rurals: defining rural based on the values people hold or cultural traditions
      • This captures the contextual/social nature of "urban vs. rural" that simple measurements like population density can miss. It is nearly impossible to delineate and would have to be captured by categories or similar tagging systems by Wikipedians. A quick skim of categories such that such tagging is far from complete and very focused on identifying descriptive rurals such as towns (e.g., enwiki categories starting w/ 'rural'; enwiki 'Rural tourism' category).
Mar 31 2021, 3:07 PM · Research (FY2020-21-Research-January-March)