Wed, Nov 13
@leila I'm breaking this next update into a few. First is taking care of the smaller things that don't require reorganization. Does this team page look like what you were expecting? I'll also remove WikiCite from the events page and add the eliciting new editors blogpost. Right now we don't actually have a good spot to put the Understanding Thanks blogpost and something about Doris' Outreachy project, so I'm going to continue to think on that.
Mon, Nov 11
Not sure if we can do it in Jupyterhub, but probably we'll be able to add something to the MOTD of the stat/notebook hosts, so when people ssh they'll get instructions about what to do for kerberos, where to find docs, etc.. Nice suggestion thanks!
- Work on this project will come to a close on December 2nd (and the students will switch to writing)
- TFIDF-based "attention" scores works almost as well as learned attention scores but is slower in training right now. Looking into the issue -- might be a function of the TFIDF scores not being normalized to 1 for a given article.
- For any given language, the hope is to have model performance for the following experimental setups:
- Trained purely on that language. Aligned fastText embeddings. Randomly initialized model weights.
- Model trained on English with all weights frozen (no fine-tuning). Aligned fastText embeddings.
- Model trained on English with final layer fine-tuned to new language. Aligned fastText embeddings.
- Model trained on mixture of examples from different languages. Aligned fastText embeddings. No language-specific fine-tuning.
- For transfer learning problem (fine-tune general model to identify a specific wikiproject), examined a more difficult negative sample (positive = Human rights; negative = Politics articles) and found still quite high F1 (>0.9). Going to look into one-shot learning techniques and embedding cosine-distance as an even simpler approach.
Thu, Nov 7
The option that is currently available is a keytab
Ok, that works for me. I'll avoid it but it's good to know it's an option if needed.
@elukey I played around with it and didn't run into any major issues. Thanks for the detailed notes! My only two concerns:
- It doesn't seem that there is a good way to provide a password automatically to kinit (e.g., from a protected text file) so that long-running scripts can automatically renew credentials periodically. This is not a blocker for me but it would be nice to have the ability -- do you have any suggestions?
- Is the suggested workflow for running a SWAP notebook to open a terminal window in JupyterHub and kinit before starting a PySpark kernel? That works but I wasn't sure if there was a way to kinit from the notebook itself or another suggested approach.
Wed, Nov 6
@DED could you point me towards a photo that you'd like up on the website (https://research.wikimedia.org/team.html) or indicate that you want the "camera shy" default image? Easy to change later, but I should get something up there :)
Thanks @Nuria !
if your survey is happening all in desktop
The survey happens on both desktop and mobile. Unfortunately we can't tell whether the survey responses missing EventLogging were from mobile or desktop :/ But adblocking exists on both platforms.
Tue, Nov 5
@Halfak I didn't want to mess with the work you had done and in some cases didn't know how to merge my suggestions, so I just added them to the etherpad after your section. For posterity, here's what I had come up with from my own work. In general I think they complement your suggestions, but we might need some meeting time as a group at some point to merge everything.
@Jdlrobson I took a look at the survey responses we had and our ability to match them up with EventLogging. High-level is we're still seeing the problem where ~10% of survey responses have no corresponding EventLogging for neither the QuickSurveysResponses nor QuickSurveyInitiation schemas. It did seem to improve for English but I don't trust that because it got worse for Russian/Polish. I split the survey responses up between before and after the deployment (2019/10/24 13:06 UTC per above). For the three surveys that we were running, this is what we get:
Thu, Oct 31
- Pushed back comparison of attention to tf-idf weights to this week.
- Label-specific scores make sense: labels with low numbers of training samples are hard. labels like History and Society are hard. Attention-based model does slightly better than baseline in most labels but no topic where it clearly excels.
- File provided that maps page titles to page IDs (taking into account redirects). This will allow for an inlinks/outlinks-based model. An inlinks-only model is at 0.55 micro F1 score with minimal tuning, so still a ways to go (ideally above 0.7 at least)
- Very easy to fine-tune a model to a new label -- e.g., Human Rights. 95% accuracy on this binary task (replace final layer and freeze others) with 5000 positive examples and 5000 random negative examples. We're going to look into harder versions of this task that uses negative samples that look more like human rights (e.g., History/Society or Politics/Government) and also how few positive examples can be used.
Wed, Oct 30
Mon, Oct 28
@Halfak Current output bzipped JSON is on stat1007 at /home/isaacj/drafttopic/full_wptemplates.json.bz2
Fri, Oct 25
- LSTM + attention model working and the attention weights make some qualitatively (high proportions assigned to words that intuitively are linked to the predicted labels and most words being assigned ~0 probability). The question was raised whether this outperforms a simple non-learned tf-idf-based weighting, which will be verified hopefully next week.
- Baseline model was trained with English embeddings + articles and then tested with Russian embeddings + articles. It failed miserably. The model achieved comparable performance though post-fine-tuning. This week we will explore if the fine-tuning is quick or if the embeddings provide little value in transfer to other languages.
- This was using the aligned fastText embeddings, so theoretically it should be plug and play :)
- Some data challenges were identified with the inlinks / outlinks. I talked through namespaces and we decided to start just with namespace 0 to avoid accidentally including categories/templates in the training data that link directly to WikiProjects (and therefore the labels). I'm working on providing a mapping between page IDs, titles, and Wikidata IDs (taking into account redirects) for English Wikipedia so we can make sure that all the data is of the same type.
- I'm building a dataset that we can use for transfer learning exploration -- e.g., train a model to predict generic labels and then fine-tune for a specific WikiProject.
As I go to do this analysis, what UTC day/hour should be my cut-off for when QuickSurveys would have switched from the old approach to the new approach?
Yes, putting out country-specific data is still on my TODO list. Just let me know what would be helpful. Regarding KaiOS specifically, we are unfortunately past the 90-day window from the June surveys so I don't have any of the user-agent information that would let us split survey responses by OS. We are working to get out a Hindi Wikipedia survey though, which might further insight.
Thu, Oct 24
Thanks, will do -- I appreciate you being part of my learning curve!
Ahhh apologies @Gilles -- I did not realize think anything of those surveys because the coverage was set at 0. It looks like you have made them more explicit to avoid someone like me making this error in the future but anything else I should be aware of when deploying surveys in the future? There is the hope that we will deploy a few in Hindi (hi), bahasa Indonesian (id), and Portuguese (pt) in the next month or so.
Wed, Oct 23
Additionally, the Performance team is running their Perceived Performance survey right now.
Thanks for pointing this out -- if I'm not mistaken, that's an internal survey so I'll still extend my external survey until next week. The missing data is only evident in external surveys where we somehow have people responding to the survey via Google Forms with reasonable-looking survey codes and responses but no associated initiation/response eventlogging. Presumably the loss of data happens in internal surveys as well, but we have no second source of data that indicates that we're missing responses.
Tue, Oct 22
Do we think those were the reason for the missing events? What are the next steps? Confirming via data?
Oct 17 2019
Yep, now able to access -- thanks! I'll do my best to test today and report back with an lgtm if no issues arise.
Oct 16 2019
@elukey I'm having trouble ssh-ing into an-tool1006.eqiad.wmnet (ssh firstname.lastname@example.org) where it is not letting me on the server (doesn't accept my password) -- is it possible that I need to be added as a user to the machine or is it an issue on my end? thanks!
Oct 14 2019
@diego I didn't write the disinformation lit review into the initial submission but I'll make sure to work with you to get together a slide on it with some takeaways and links for people to follow up on. It might be possible to work in some of your findings into the slides on Jonathan's patrolling research too.
@elukey : also happy to help. thanks for reaching out!
- Code in place to preprocess/tokenize wikitext and perform embedding lookups
- Initial pass made at reimplementing the drafttopic model (average of an article's embeddings + simple classifier overtop). This ran into a few issues:
- NaNs being returned during training -- turned out that some of the articles were missing labels/text and this was causing issues at the averaging step because of a divide by zero error
- Micro stats look pretty good but macro precision/recall/f1 is quite low: not quite clear yet what is to blame here. could just be that the model needs more fine-tuning of hyperparameters and regularization but some label balancing might be necessary.
- Slow training: might not be necessary to process the entire article
- Basic code written for an LSTM model: slow training and some mismatch between loss and precision/recall results so we will make sure that everyone is using the same code for computing model statistics
Oct 11 2019
Adding in a few links for reference:
Definitely want to tie this to two projects from the Growth Team: Understanding First Day and Welcome Survey. Specifically, it's worth checking out Morten's reports for these two projects:
Done! Thanks for the reminder @elukey !
Oct 9 2019
Oct 7 2019
I wrote up a more complete report on topic modeling (for assigning topics to page views) and some of my recommendations / takeaways: https://meta.wikimedia.org/wiki/Research:Characterizing_Wikipedia_Reader_Behaviour/Demographics_and_Wikipedia_use_cases/Topic_Analysis
Oct 4 2019
Processing complete -- see individual subtasks for details. Focus on writing / sharing results now.
Analysis complete -- at this point the focus will shift to writing up and sharing the results (which has already been ongoing for several weeks).
Draft write-up here: https://meta.wikimedia.org/wiki/Research:Surveys_on_the_gender_of_editors/Report
- The code for this project will be in PyTorch. The students have the greatest familiarity with this library and a number of popular packages are implemented in it.
Oct 2 2019
Sep 30 2019
I proposed four separate tasks for each student. While all four tasks are focused on labeling a given Wikipedia article with appropriate topics, the tasks address different aspects of this problem:
- Multilingual word embeddings: take existing drafttopic model and explore how to expand this to use multilingual word embeddings (e.g., fastText) to expand the model to languages beyond English.
- Alternative language models: Focus on English and how more advanced language models -- e.g., recurrent neural networks, bidirectional models, or models with attention -- might improve labeling performance.
- Leveraging domain knowledge: How can we use our knowledge of Wikipedia to improve model performance while making the model more efficient -- e.g., using just links on a page or leveraging Wikidata
- Transfer learning: How can we efficiently tune a general topic model to label articles with a new set of labels.
Sep 27 2019
Sep 26 2019
Had no responses to the Village Pump posts or on the meta page so deployed the surveys this morning. The intent is for one month but we will monitor response count / feedback and adjust as needed.
Sep 23 2019
@MGerlach -- sounds good. It might be a useful introduction into Gerrit as well so next meeting I will walk you through getting set up to make this change.
plwiki => [ 'enabled' => true, "name" => "reader-demographics-pl", "type" => "external", "description" => "Reader-demographics-1-description", "link" => "Reader-demographics-1-link", "question" => "Reader-demographics-1-message", "privacyPolicy" => "Reader-demographics-1-privacy", "coverage" => 0.005, // 1 out of 200 "instanceTokenParameterName" => "entry.1791119923", "platforms" => [ "desktop"=> ["stable"], "mobile"=> ["stable"] ], ]
Sep 20 2019
ruwiki => [ 'enabled' => true, "name" => "reader-demographics-ru", "type" => "external", "description" => "Reader-segmentation-1-description", "link" => "Reader-demographics-2-link", "question" => "Reader-demographics-1-message", "privacyPolicy" => "Reader-demographics-1-privacy", "coverage" => 0.00167, // 1 out of 600 "instanceTokenParameterName" => "entry.1791119923", "platforms" => [ "desktop"=> ["stable"], "mobile"=> ["stable"] ], ]
Sep 19 2019
enwiki => [ 'enabled' => true, "name" => "reader-demographics-en", "type" => "external", "description" => "Reader-demographics-1-description", "link" => "Reader-demographics-2-link", "question" => "Reader-demographics-1-message", "privacyPolicy" => "Reader-demographics-1-privacy", "coverage" => 0.000833, // 1 out of 1200 "instanceTokenParameterName" => "entry.1791119923", "platforms" => [ "desktop"=> ["stable"], "mobile"=> ["stable"] ], ]
Sep 13 2019
this looks great -- thanks @mforns for writing this up so clearly!
Sep 12 2019
I'm currently exploring how to expand the ORES drafftopic model to languages besides English. The approach I have taken to this is to build a model that predicts the ~40 categories used by the ORES drafttopic model.
Sep 11 2019
Regarding the sampling rate for Polish Wikipedia, their page views are generally about one third of those to Russian Wikipedia (https://tools.wmflabs.org/siteviews/?platform=all-access&source=pageviews&agent=user&range=this-year&sites=ru.wikipedia.org|pl.wikipedia.org) and unique devices are also a bit under one third (https://stats.wikimedia.org/v2/#/pl.wikipedia.org/reading/unique-devices/normal|line|2-year|~total|monthly). Based on this, I will recommend setting the sampling rate to three times that of Russian Wikipedia.
Sep 10 2019
Sep 9 2019
Sep 4 2019
Sep 3 2019
Thanks for the ping @Elitre -- I'm comfortable with wrapping up this task if @Trizek-WMF is. There shouldn't be any further community relations support until we start sharing out results, but we can open a new task then if it is deemed necessary.
Aug 26 2019
My first attempt at building a language-independent means of representing article topic -- i.e. grouping article page views into categories regardless of which Wikipedia language edition that article was read in -- was to map each page view to its Wikidata item and then represent the item based on its instance-of / subclass-of properties. An example of how this might work is below . The goal is to build a set of higher-level categories to which any Wikidata item with an instance-of property can be mapped to. This is similar to the 14 categories in the Wikidata Concepts Monitor but ideally with no overlap and full coverage -- i.e. all items map deterministically to a single category.
I presented preliminary results at Wikimania: https://wikimania.wikimedia.org/wiki/2019:Research/Characterizing_Reader_Behavior_on_Wikipedia
Aug 23 2019
- See reader behavior features under T228285 for features that were used in debiasing.
- It was determined that a GradientBoostingClassifier performed best with respect to making the average features -- e.g., average pages viewed per session) for the survey respondents match the general population for the wiki -- though LogisticRegression also worked quite well in many cases.
- Wikidata instance-of ended up being relatively uninformative so I might revisit that with drafttopic categories.
- As part of this work, a few changes were required:
- African and Worldwide surveys (english/french) were separated because I realized that weights from debiasing would not be comparable if they came from two separate models (if a single model was used for english or french, country was a very strong predictor of whether someone took the survey or not)
- I trimmed the control sessions to exactly match the survey session timespan because the survey was launched / ended mid-day and that meant without careful control, that day of week became a strong predictor of whether someone took the survey or not.
Aug 22 2019
Aug 18 2019
Aug 13 2019
Surveys completed -- thanks @pmiazga (and any respondents)!
Yes, I've been compiling the feedback and will make sure to include some recommendations regarding the functionality of QuickSurveys for this type of survey. In particular:
- Ability to remove survey without responding explicitly. This should be a relatively straightforward UI fix but would require someone to do the work obviously.
- Ability to opt out completely from surveys like this. This would likely be more complicated as it would have to be a change to the user settings, databases, etc. and we would have to determine if it only applied to QuickSurveys or other extensions as well.
- Confusion around survey re-appearing in new browsers (or the same browser if cookies are deleted). This is how the sampling works but this is not at all evident to respondents, so there can be confusion when someone responds and then sees the survey again. I don't believe we can change the sampling strategy to be aware of whether a user account has already taken the survey in a different browser without compromising privacy, but we should consider having a better explanation of this within the survey to head off concerns for any future deployments.
- General frustration/anger with the presence of the survey (in some part tied to the difficulty of opting out from it).
Aug 9 2019
Sufficient responses have been reached -- plan is to undeploy these surveys in first available SWAT deployment by Tuesday (Aug 13).
Aug 8 2019
Thanks for the notification @Jc86035
Aug 7 2019
Sounds good -- FYI I'm responding to some comments on the meta page as I see them: https://meta.wikimedia.org/wiki/Research_talk:Surveys_on_the_gender_of_editors