This is great -- what I'm seeing here @Dibyaaaaax is that the GBC model mostly performs very similarly to the fasttext model when given the same data, but its recall does suffer for low-data topics. We'll have to discuss whether this slightly higher performance in fasttext warrants the complexity of adding the new fasttext class permanently to revscoring and making sure that it would work in production. I'll mention some other things we discussed but jump in if you have more concrete data:
- GBC models train in >2 hours whereas fasttext trains in ~2 minutes. Makes me wonder whether the HistGradientBoostingClassifier would provide the same performance as GBC (and be super easy to implement) but train much more quickly.
- Even though you've got fastText setup for training, I'm not certain how it would look like in production if we decided the performance was worth it. It fine-tunes the word embeddings that it's provided so produces a second set of embeddings that are slightly different from the ones trained via mwtext. We maybe just dump those fine-tuned embeddings to a file and reproduce how fastText works with numpy like T242013#6155316.
stat1004 reimaged during this week or the next
@elukey just a heads up that I'm running some long-running SWAP notebooks via stat1004 but it's okay to kill those processes as part of the reimaging if they're still going when you proceed. They're long running because they run a number of sequential pyspark queries and it's easy for me to pick up from where they left off if they get killed. No need to check with me in advance.
See https://github.com/wikimedia/wikitax/pull/6 for implementation of these changes
Fri, Jul 31
This is "overall articles for all projects", correct?
It's actually just for English Wikipedia. The number from the WMDE dashboard for all Wikipedia projects is 31.99% (i.e. the inverse of the 68.01% number provided under "% of Articles that use Wikidata" in the tinier table that aggregates each project family). It varies a lot by wiki too -- vecwiki seems to have almost every article with some form of Wikidata transclusion whereas 62% of articles on Japanese Wikipedia don't have a single Wikidata-based template. This data was only recently added there (see T257962).
Update: looks likely that I'll be able to work with a contractor on the comprehensive comparison for the month of August, so I'm waiting to hear formally about that before proceeding.
Weekly update: didn't meet this week
Weekly update: met with DD and was given an overview of the model choices and future directions. Will be receiving a pointer to code / documentation in the near future. For now, though, I have a decent understanding of the current state of the project, which will hopefully be enough to make interpretation of the code relatively straightforward.
For reference, I followed up here: T249654#6352573
@Nuria: following up on T247099#6346344 here as this seems a more relevant task. I provide high-level details below regarding the nature of Wikidata transclusion on English Wikipedia. Here is a more thorough description of how I came to my conclusion regarding the importance of different types of Wikidata transclusion that occurs. @Addshore @GoranSMilovanovic @Lydia_Pintscher FYI in case you're interested as I know you're well aware of the limits of wbc_entity_usage for measuring Wikidata transclusion in articles. I'm very open to feedback so let me know if you see any mistaken assumptions etc.
Wed, Jul 29
@Nuria thanks for the ping -- I finally have been making progress on this and am hoping to have some early statistics in about a week. FYI right now this has been focused on enwiki to start because the main challenge is that there isn't fine-grained data of the sort that we need for really understanding Wikidata usage. I'm aiming for this initial analysis to answer the following question: for the 62% of English Wikipedia articles that supposedly transclude Wikidata content (based on wbc_entity_usage), what is the breakdown of that transclusion into the following categories: populating a metadata template, populating external links, populating an infobox, tracking categories with no change to the page?
Tue, Jul 28
it should be possible to test this explanation. We can make QuickSurveys use button tags rather than a tags, removing the ability to right-click + open in new tab. This should be a relatively simple change as OOUI provides consistent styling for both tags when used as buttons.
i'm certainly interested about whether this does explain the whole issue, but regardless this would be desirable if someone has the time and it fixes the right-click issue. I don't see any drawbacks to this approach and improving logging for QuickSurveys is pretty important to it being useful.
Mon, Jul 27
but if you want to make sure you can try to install them on stat1005/stat1008 that are already running debian 10 (just to double check that nothing explodes etc..)
Ahh good point -- done and no issues. Thanks!
Fri, Jul 24
- created simple placeholder: https://meta.wikimedia.org/wiki/Research:Knowledge_Gaps_Index
- waiting for meta templates to redesign to be more user-friendly
- Created standardized template for hosting models on Cloud VPS that handles all the setup via a simple script so pretty easily extendable to other models (already using for link-based and wikidata-based models).
- Created UI for easily comparing models: https://wiki-topic.toolforge.org/comparison
- You can input a language + article title to compare results for specific articles or just the language (but leave title blank) to have the UI choose a random article for you
- Current model performance report card but I'd like to standardize this a bit more
- Initial pass at comparing Wikidata and link-based models but need to expand this to include ORES and be more accessible
Wed, Jul 22
Tue, Jul 21
Mon, Jul 20
+1 to this. Discussed in IRC but having this handled by default would be hugely hugely appreciated as I definitely do not trust myself to get it right!
Thanks @Nuria! Yeah, no hurry on our end to fix either, but I know we're excited about this table in Research for its potential for speeding up a lot of the querying / session-building work we do and so I want to make sure it's eventually fixed or at least clearly documented somewhere so we don't unknowingly reach wrong conclusions when we work on multilingual reading behavior that includes the apps.
Weekly update: discussed two possible directions with this work with my colleagues at UMN that would provide some insight into the new article importance metrics:
- Identify metrics that reasonably proxy the importance factor and measure how well existing approaches to importance capture these new factors -- e.g., if "political impact" is a new factor, then you might assert that one way to identify articles that would have a political impact is to find articles that are under WikiProject Politics and have page protections in place (assuming that page protections means that either the article is impactful and attracted vandalism or was deemed potentially impactful and so was protected in advance of vandalism). Then you could look at how well pageviews or inlinks capture these new importance factors.
- Identify important measures of bias (taking care to define) such as gender bias and look at how the different article importance metrics would contribute to or reduce bias if used in recommender systems.
Weekly update: began process of systematically identifying main ways in which Wikidata is transcluded in enwiki and determining how they affect the wbc_entity_usage table. Had been inspecting the table for various examples to identify patterns but I just realized that I could probably use a sandbox page to actually verify without being disruptive. Also coding each instance with these criteria.
Weekly update: setup meeting for next week to start onboarding process.
Wed, Jul 15
Happily! I had done some analysis of these sorts of actor signatures a while back with app users to see how stable the signatures are (meta) and so had thought (erroneously) that accept_language wasn't stable on any device so was glad to find out that it's just the app where it switches.
The URL is seemingly to an old (or, probably more accurately out of date) version of the repo
Ooof...thanks for catching that. Easy to fix. I'll start going through some of the other package information to do some of the other cleaning too.
Ok, let's go ahead and enable it then to see!
Tue, Jul 14
I note, after adding the CI stuff, some remedial work might be needed to get things to a better state before moving forward; whether fixing issues or changing the rules/setup used by the Gruntfile
@Reedy correct me if I'm wrong -- in practice, this would not noticeably change anything about our process of pushing changes to the research page? It might fail if the node10-docker has issues with it, but that would be a larger problem and very unlikely something triggered by the research landing page (and therefore likely fixed somewhat quickly because it will affect every other code-base that uses the node10-docker)?
Mon, Jul 13
I'm going to go ahead and claim this epic task as we're looking to begin work on article importance. I'm going to update the task description as well to make this a broader task for the work we're hoping to do around measuring article importance (as opposed to any specific question).
Fri, Jul 10
English Wiki has 15m articles (I believe)
a full enwiki dump is clocking in at 944gb or something insanely large
I'm pretty sure a large part of this issue is based on how you handle redirects really and not compression format. Enwiki has 9.3M redirects. Right now the HTML of an article is fully reproduced for a redirect (i.e. not just redirect to [[article]] but the full-text of that article that the reader would see). English Wikipedia has just over 6M articles in the classic sense, so reproducing the full article text in the redirects would probably be what explodes it to 15M full articles and a very large file (as opposed to 6M full articles and ~9M very tiny files that just indicate that they are redirects).
Mon, Jul 6
I should be clearer: what I meant is that sendBeacon will consistently fail if and only if the browser is ad-blocking. The failure is systematic in a way that both the Initiation and Responses events will not be sent, therefore adblock cannot explain the gap here.
Jun 25 2020
FYI I added a row in the documentation table for this. Feel free to improve
Sounds good, I'll take a look. A good reminder too that I need to update my existing API to use the template.
Resolving this task -- iteration will continue (just added the 2020 Community Insights survey!) but full draft is complete.
Jun 19 2020
Gaps write-ups from Overleaf copied below. Still some iteration likely but at this stage, I would consider this task complete. @leila let me know if you concur.
- Full first draft completed and added to Overleaf! Will continue to iterate with the team on this for the rest of the quarter.
Weekly update: no progress. End date pushed to August 31st (Betterworks updated).
Jun 15 2020
tl;dr: I'll incorporate some of the below into the literature and metrics sections around accessibility + readership.
Jun 12 2020
Weekly update: added draft of accessibility section to Readers taxonomy on Overleaf.
- Added draft of sociodemographic gaps to taxonomy
- Began going through surveys to identify trends -- e.g., median age of editors vs. country/world population
- Updated missing Editor/Reader Survey categories to Meta to simplify the process of identifying these surveys in the future
Weekly update: no progress
Jun 11 2020
- Paper submitted!
- Will wait to hear initial response from NHB before choosing whether to upload submission to arxiv (if positive, then upload; if negative, then decision to upload depends on what we choose to do with the paper)
I wanted to preserve this info somewhere. We have discussed whether or not the Wikidata statements should be ordered by mwtext (see Examples section here). Here's my current thinking:
Jun 9 2020
Another data point that is interesting in this discussion: Youtube provides Wikipedia articles as fact-checks / context for a variety of conspiracy theories / state-sponsored broadcasting companies. For all of those Wikipedia article links, regardless of platform, they also provide a URL parameter that tells us that the person is coming from Youtube. This provides a rare opportunity to compare pageviews that have Youtube as a referrer with pageviews that we know came from Youtube. On top of that, I did some of self-experimentation to see how the usage of different apps / browsers affects the Youtube referrer. Summary is that 40% of referrals from Youtube are None referrers and that this happens when the user starts in the Youtube app and switches to a mobile browser that is not Android+Chrome. This is not going to fully apply to every app as they each handle referrers differently but it does provide support that app traffic often comes through as None referrers. Hard to know how big of the pie this is though. The None traffic part is about 200 thousand per day for Youtube and other apps presumably produce similar or higher traffic counts.
Jun 8 2020
+1 to moving forward with page IDs and addressing the Special pages when the need actually arises. It's exciting to see this functionality be added and I will pass back the information to my team!
Having it in HDFS first would allow it to be more easily used by internal WMF researchers and analysts.
Speaking personally but from the Research team, I also +1 this many many times over. There is so much text-based machine learning and analytics that would be many times easier / faster if we could access HTML in HDFS (because then we can take advantage of the SWAP system). Some recent research also recreated the full parsed HTML revision history for English Wikipedia and noted for example that over half of internal article links are only evident from the parsed article and not the raw wikitext. A few current examples of modeling etc. that would benefit that I know of:
- Parsed versions of Wikipedia articles have way more links / content in them, which can be valuable for ML models like topic classification or quality prediction
- Studying how much and what content is transcluded (has implications for patrolling etc.): T249654
- Measuring the consistency of content in different language versions of the same article: T243256
- Studying citation quality / usage, especially if templates like en:Cite Q see expanded usage in the wikis
- For link recommendation -- i.e. suggesting to a user that they should insert a wikilink into an article -- you might want to verify that the link does not already exist in the article, which would be best done against the parsed version of the article
Jun 5 2020
Template uploaded to Github: https://github.com/wikimedia/research-api-interface-template
- Paper complete. Waiting for go-ahead from all to submit and accompanying letter.
Weekly update: no progress
Weekly update: no progress
Weekly update: no progress
- Turned off public report -- haven't heard anything via email / talk pages
Jun 4 2020
only a statement from @Isaac that Research would prefer page ID.
I don't always stand by things I said a year ago, but in this case, yeah, I would still advocate strongly for using page ID as the preferred identifier for articles. Because QuickSurveys is language-specific, I see no value to using QID and it would add an additional place where things could go wrong (e.g., Wikidata item changes). Titles aren't stable enough as page moves would break the survey logic (pretty common in breaking news topics) and introduce all the standard issues with getting the right normalization, special characters, etc. I think the main challenge with page IDs was that Special pages do not have unique page IDs so could not be sampled under that approach.
Jun 2 2020
Jun 1 2020
Thanks @Nettrom for adding me to this -- I should have known to look for a task like this before :)
May 29 2020
Weekly update: no progress
Weekly update: no progress
- Focused on starting description of methods for executive summary / final report.
- Progress on writing -- goal to submit early next week
- Monitored talk pages / email thread but no responses yet
- Confirmed that data could continue to be collected beyond May 31st
- We will shut down the public-facing report though after May 31st so there is clarity that is not being maintained (we will of course still be open to feedback after that point should we hear it)
May 28 2020
The task title uses .wikipedia.org, do you mean .wikimedia.org
Hah, yes, good catch @RhinosF1 !
May 26 2020
May 22 2020
Weekly update: no progress.
- Updated leafs of taxonomy with the relevant surveys
- Began writing introduction to taxonomy
- Continued iteration on narrative / results with team
Weekly update: no progress.
- Did rough analysis of first two months of report
- Sent out emails to wiki-research-l + analytics-l about ending of pilot
- In the process of confirming with Privacy that the pilot could run beyond May 30th without raising any additional privacy concerns.