Page MenuHomePhabricator

MGerlach
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Sep 9 2019, 9:50 AM (48 w, 4 d)
Availability
Available
LDAP User
MGerlach
MediaWiki User
MGerlach (WMF) [ Global Accounts ]

Recent Activity

Today

MGerlach added a comment to T258274: Code and data onboarding for link recommendation project.

Update week 2020-08-10:

Fri, Aug 14, 4:08 PM · Research (FY2020-21-Research-July-September)
MGerlach added a comment to T259363: Questionnaire for community feedback on taxonomy of knowledge gaps.

Update week 2020-08-10:

  • did one iteration on questionnaire doc based on feedback from leila, isaac, miriam
  • decided on format (meta-page and google-forms)
Fri, Aug 14, 4:04 PM · Research (FY2020-21-Research-July-September)
MGerlach added a comment to T259370: Experimental API for unsupervised topic modeling based on reading sessions.

Update weel 2020-08-10:

Fri, Aug 14, 4:00 PM · Research (FY2020-21-Research-July-September)

Tue, Aug 11

MGerlach updated the task description for T255702: Research Showcase August 2020.
Tue, Aug 11, 11:12 AM · Research (FY2020-21-Research-July-September)

Mon, Aug 10

MGerlach updated the task description for T255702: Research Showcase August 2020.
Mon, Aug 10, 8:37 AM · Research (FY2020-21-Research-July-September)

Fri, Aug 7

MGerlach added a comment to T258274: Code and data onboarding for link recommendation project.

Update week 2020-08-03:

  • extensive discussion with Djellel on the model
    • clarify pipeline of the current model to generate link recommendations
    • discuss bottlenecks: specifically, mwparserfromhell is crucial for parsing wikicode to get links. slow when parsing full dump
  • started to discuss current approach to implement backtesting protocol and identfiy main challanges:
    • regex-heuristics might break down for some languages
    • for most languages we dont have reliable parsers (tokenizers). we thus use regex- and other heuristics. this is problematic since it i) could break down for some languages, ii) extracting sentences via "." might give spurious sentences.
    • avoid use of articles from some categories (e.g. articles missing citations)

Next:

  • make shared repo and write documentation
Fri, Aug 7, 5:13 PM · Research (FY2020-21-Research-July-September)
MGerlach added a comment to T259363: Questionnaire for community feedback on taxonomy of knowledge gaps.

Update 2020-08-03:

  • drafted first set of questions + framing around it questoinnaire-doc
  • need to decide on format (e.g. google-forms, meta, other)
Fri, Aug 7, 5:04 PM · Research (FY2020-21-Research-July-September)
MGerlach added a comment to T259370: Experimental API for unsupervised topic modeling based on reading sessions.

Update week 2020-08-03:

  • build pipeline to tune hyperparameters (splitting train-test, prediction and evaluation on testset)
  • need to experiment with smaller sample datasets (eg. smaller wikis) to perform extensive grid search)
Fri, Aug 7, 5:02 PM · Research (FY2020-21-Research-July-September)

Fri, Jul 31

MGerlach added a comment to T259370: Experimental API for unsupervised topic modeling based on reading sessions.

Update week 2020-07-27:

Fri, Jul 31, 4:15 PM · Research (FY2020-21-Research-July-September)
MGerlach added a subtask for T258804: Language-Agnostic Topic Modeling: T259370: Experimental API for unsupervised topic modeling based on reading sessions.
Fri, Jul 31, 4:11 PM · Research, Epic
MGerlach added a parent task for T259370: Experimental API for unsupervised topic modeling based on reading sessions: T258804: Language-Agnostic Topic Modeling.
Fri, Jul 31, 4:11 PM · Research (FY2020-21-Research-July-September)
MGerlach created T259370: Experimental API for unsupervised topic modeling based on reading sessions.
Fri, Jul 31, 4:11 PM · Research (FY2020-21-Research-July-September)
MGerlach added a comment to T258274: Code and data onboarding for link recommendation project.

Update week 2020-07-27:

Fri, Jul 31, 3:57 PM · Research (FY2020-21-Research-July-September)
MGerlach added a comment to T259363: Questionnaire for community feedback on taxonomy of knowledge gaps.

Update week 2020-07-27:

Fri, Jul 31, 3:52 PM · Research (FY2020-21-Research-July-September)
MGerlach added a subtask for T242172: Taxonomy of Knowledge Gaps: T259363: Questionnaire for community feedback on taxonomy of knowledge gaps.
Fri, Jul 31, 3:46 PM · Research, Epic
MGerlach added a parent task for T259363: Questionnaire for community feedback on taxonomy of knowledge gaps: T242172: Taxonomy of Knowledge Gaps.
Fri, Jul 31, 3:46 PM · Research (FY2020-21-Research-July-September)
MGerlach created T259363: Questionnaire for community feedback on taxonomy of knowledge gaps.
Fri, Jul 31, 3:46 PM · Research (FY2020-21-Research-July-September)
MGerlach created T259361: Research Showcase September 2020.
Fri, Jul 31, 3:34 PM · Research (FY2020-21-Research-July-September)

Fri, Jul 17

MGerlach added a comment to T255462: Research Showcase July 2020.

Update week 2020-07-13:

  • successfully held showcase for july
  • communications team reached out to me to follow up with speakers as they were very interested to learn more from them how wikipedia is used (outside of wikipedia)
Fri, Jul 17, 2:55 PM · Research (FY2020-21-Research-July-September)
MGerlach added a comment to T258274: Code and data onboarding for link recommendation project.

Update week 2020-07-13:

  • met for 2 longer discussions with Djellel
    • getting overview of the aim of the project (incl plan for the year to work with product)
    • general approach of the existing algorithm
    • plans for impovement during this quarter, specifically the backtesting protocol for automatic offline evaluation and possible tunings for the algorithm
  • tuning the algorithm:
    • prepared navigation embeddings as additional features for link prediction
  • backtesting protocol:
    • several possibilities came up for creating a dataset of sentences for several languages which contain links to be predicted.
    • the most crucial features we agreed on were:
      • link completeness (how to find sentences for which we are sure that they contain all or most links that should be there); options include: using sentences only from abstract (to make sure link is notomitted bc it appeared before), sentences from featured or good articles (there exist collections of hundreds/thousands in several wikis [1],[2]); threshold on the link density (links per sentence length)
      • sampling from a diverse set of articles to capture different edge cases and that the task is not too easy to still be informative
Fri, Jul 17, 2:54 PM · Research (FY2020-21-Research-July-September)
MGerlach created T258274: Code and data onboarding for link recommendation project.
Fri, Jul 17, 2:44 PM · Research (FY2020-21-Research-July-September)

Thu, Jul 16

MGerlach added a comment to T258214: Requesting access to analytics-privatedata-users and researchers for Agaduran.

we have a new formal collaborator onboard: Alberto Garcia Duran. Alberto needs access to HDFS and stat machines for a new research project.

Thu, Jul 16, 6:21 PM · SRE-Access-Requests, Operations
MGerlach created T258214: Requesting access to analytics-privatedata-users and researchers for Agaduran.
Thu, Jul 16, 6:18 PM · SRE-Access-Requests, Operations
MGerlach closed T250356: Characterize reader behaviour: Train and evaluate one deep-learning model on reader sessions in comparison to navigation vectors as Resolved.

added results and documentation to meta-page of the project:
https://meta.wikimedia.org/wiki/Research:Characterizing_Readers_Navigation/Modeling_Reading_Sessions:_First_Round_of_Analysis

Thu, Jul 16, 12:45 PM · Research (FY2019-20-Research-April-June)

Jul 15 2020

MGerlach updated the task description for T255462: Research Showcase July 2020.
Jul 15 2020, 5:51 PM · Research (FY2020-21-Research-July-September)

Jul 13 2020

MGerlach added a comment to T255462: Research Showcase July 2020.

given the topic for this month's showcase, what do you say that we advertise for it in the following places as well?

and the internal slack channel #covid-19-data

Jul 13 2020, 9:41 AM · Research (FY2020-21-Research-July-September)

Jul 10 2020

MGerlach added a comment to T255462: Research Showcase July 2020.

@leila I think advertising to all of these places is a good idea. One concern I have with posting on the respective talk-pages is that it could be perceived as spamming since most of the discussions seem to be centered around specific articles, etc

Jul 10 2020, 1:29 PM · Research (FY2020-21-Research-July-September)

Jul 1 2020

MGerlach updated the task description for T255462: Research Showcase July 2020.
Jul 1 2020, 6:08 PM · Research (FY2020-21-Research-July-September)

Jun 29 2020

MGerlach added a comment to T250356: Characterize reader behaviour: Train and evaluate one deep-learning model on reader sessions in comparison to navigation vectors.

@leila Thanks for your feedback. These are my comments:

While from a theoretical perspective thinking about improving recall@1 can help you with learning more about the sessions, a friendly reminder here :) that for the majority of the Wikimedia related applications I can think of you will need recall@3, recall@5 or even a higher k. (I see in the documentation that you're computing those as well, which is great.)

Yes, we should also calculate recall@3 in order to compare directly to the RelatedArticles-recommendations. So far, we calculated recall@k for k =1,10,50, 100 and observe a huge improvement across the full range of k. Though biggest increase seems to be for small k. This is further corroborated by the improvement in the MRR-metric (the mean reciprocal rank): instead of binary judgement whether the target was among the top-k recommendations, it weighs the ranks (average of the inverse).

Can you compute/estimate the upperbound on recall@k? I generally expect recall@1 to be low, however, I wonder how far you are from what's possible to achieve?

Not sure, I understand correctly the upperbound on recall@k, my interpretation is that the upperbound should be 1. recall@k here is the fraction of items in the test-set (pairs of source+target articles) for which the target-article is among the top-k recommendations based on the source article. Thus, for the upper bound to be below 1, for some pairs the target would not be among the top-k recommendations -- besides out-of-vocabulary items, what would be the rationale why a perfect algorithm would not be able to do that?

As a side-note: now that Carol Dunn (the VP of Product) is with us, it's good to call for a meeting with Bob, Carol, Jon Katz, Margeigh, and maybe Grant as this is a topic of interest to him, plus you and I to discuss long term plans for products around readership.

Agree. Will bring this up in our next meeting to see how to move forward along these lines.

Jun 29 2020, 9:37 AM · Research (FY2019-20-Research-April-June)

Jun 26 2020

MGerlach added a comment to T250356: Characterize reader behaviour: Train and evaluate one deep-learning model on reader sessions in comparison to navigation vectors.

Update week 2020-06-22:

  • finishing first results for evaluation of the models on different wikis ( documentation)
  • Results are really impressive:
    • previous navigation vectors (word2vec) perform worse or equal than the simple text-based heuristic (RelatedArticles) for recommending the next article in a reading sessions; this captures qualitative online evaluation
    • in contrast, the LSTM model yields a huge improvement in all evaluation metrics. For example recall@1 (i.e. whether the first recommendation is the actual next pageview in the session) more than doubles (!) from 0.139 to 0.280.
Jun 26 2020, 3:18 PM · Research (FY2019-20-Research-April-June)
MGerlach updated the task description for T250356: Characterize reader behaviour: Train and evaluate one deep-learning model on reader sessions in comparison to navigation vectors.
Jun 26 2020, 3:07 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T250349: Build the first complete draft of the taxonomy of content gaps.

Update week 2020-06-22:

  • review on metrics for content-gender-gap
    • Isaac gave thoughtful and detailed feedback on the initial draft, which I incorporated
    • put a version on meta
  • helped miriam on the writing of the taxonomy, especially the part on the content (formatting tables, etc.)
Jun 26 2020, 3:06 PM · Research (FY2019-20-Research-April-June)
MGerlach closed T243079: Run office hours WMF/Research&Analytics as Resolved.

Update week 2020-06-22:

  • ran office hours (2 participants)
Jun 26 2020, 2:59 PM · Research (FY2019-20-Research-April-June)

Jun 23 2020

MGerlach updated the task description for T255702: Research Showcase August 2020.
Jun 23 2020, 8:17 AM · Research (FY2020-21-Research-July-September)
MGerlach updated the task description for T255462: Research Showcase July 2020.
Jun 23 2020, 8:16 AM · Research (FY2020-21-Research-July-September)

Jun 19 2020

MGerlach added a comment to T250356: Characterize reader behaviour: Train and evaluate one deep-learning model on reader sessions in comparison to navigation vectors.

Update week 2020-06-15:

  • waiting for some calculations to finish due to updates on stat1005 with respect to tensorflow
Jun 19 2020, 2:22 PM · Research (FY2019-20-Research-April-June)
MGerlach updated the task description for T250349: Build the first complete draft of the taxonomy of content gaps.
Jun 19 2020, 2:19 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T250349: Build the first complete draft of the taxonomy of content gaps.

Update week 2020-06-15:

  • finished write-up of literature review on different metrics for the content gender gap in this document
  • worked with Miriam on different sections of the taxonomy write-up on overleaf:
    • merged the related-work section with the introduction
    • wrote a first draft of the introduction section
Jun 19 2020, 2:15 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T243079: Run office hours WMF/Research&Analytics.

Update week 2020-06-15:

  • sent out invitations for June office hours on mailing lists:
  • experimenting with alternating time-slots from month to month to cover a wider range of timezone
Jun 19 2020, 2:09 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T250352: Prepare for co-organizing the monthly Research Showcase.

Update week 2020-06-15:

  • co-moderated the june showcase
  • confirmed 2 speakers for showcase in august on readership and navigation (Taha Yasseri, Dimitar Dimitrov)
  • created phab-tickets for showcases in july (T255462) and august (T255702)
Jun 19 2020, 2:01 PM · Research (FY2019-20-Research-April-June)

Jun 17 2020

MGerlach created T255702: Research Showcase August 2020.
Jun 17 2020, 6:20 PM · Research (FY2020-21-Research-July-September)

Jun 15 2020

MGerlach created T255462: Research Showcase July 2020.
Jun 15 2020, 5:08 PM · Research (FY2020-21-Research-July-September)

Jun 12 2020

MGerlach added a comment to T250356: Characterize reader behaviour: Train and evaluate one deep-learning model on reader sessions in comparison to navigation vectors.

Update week 2020-06-08:

  • running evaluation of reader-embeddings in next-article prediction for 14 different wikis of different sizes:
  • Next:
Jun 12 2020, 5:00 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T250349: Build the first complete draft of the taxonomy of content gaps.

Upate week 2020-06-08:

  • spent substantial time on writing for the taxonomy (overleaf)
    • made a full iteration on Section 3 on methodology; reorganized the section on the structure of the taxonomy to give consistent overview on the different levels (gap, facet, dimension).
    • added some additional references to the content-taxonomy
    • Next: make an iteration on the Related-work section
Jun 12 2020, 4:47 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T243079: Run office hours WMF/Research&Analytics.

Update week 2020-06-08:

  • None
Jun 12 2020, 4:39 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T250352: Prepare for co-organizing the monthly Research Showcase.

Update week 2020-06-08:

  • july showcase: confirmed 2 speakers for the showcase on medical content in Wikipedia (Giovanni Colavizza, Denise Smith)
  • august/september showcases:
    • in discussion with djellel, agreed on 2 themes: i) readership (Martin) and ii) censorship (djellel)
    • sent out invitations for 2 speakers for the readership showcase
Jun 12 2020, 4:38 PM · Research (FY2019-20-Research-April-June)

Jun 5 2020

MGerlach added a comment to T250356: Characterize reader behaviour: Train and evaluate one deep-learning model on reader sessions in comparison to navigation vectors.

Update week 2020-06-01:

  • extensive discussions with collaborators finalizing design choices for consistent evaluation of different algorithms on the different datasets
  • plan is to run evaluations in the next 2 weeks and document the results
Jun 5 2020, 3:13 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T250349: Build the first complete draft of the taxonomy of content gaps.

Update week 2020-06-01:

  • spent substantial time on writing for the taxonomy on the overleaf
    • made a full round of edits/comments for readership and content
    • wrote the section on future work (2 pages)
Jun 5 2020, 2:13 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T243079: Run office hours WMF/Research&Analytics.

Update week 2020-06-02:

  • None
Jun 5 2020, 12:39 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T250352: Prepare for co-organizing the monthly Research Showcase.

Update week 2020-06-01:

  • sent out invitations to speakers for july showcase on medical content (giovanni colavizza already accepted)
  • added showcase theme on readership with several possible speakers
Jun 5 2020, 12:24 PM · Research (FY2019-20-Research-April-June)
MGerlach updated the task description for T250352: Prepare for co-organizing the monthly Research Showcase.
Jun 5 2020, 12:21 PM · Research (FY2019-20-Research-April-June)

May 29 2020

MGerlach added a comment to T250356: Characterize reader behaviour: Train and evaluate one deep-learning model on reader sessions in comparison to navigation vectors.

Update week 2020-05-25:

  • created datasets (train, dev, test) of reading sessions for 14 different wikipedias (same as in why the world reads wikipedia) with the aim of showing that results hold across different wikis (small, large, language, etc)
  • considerably extended documentation of methods in write-up
May 29 2020, 4:16 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T250349: Build the first complete draft of the taxonomy of content gaps.

Update week 2020-05-25:

  • did one iterations of comments/changes on the readership-gaps in the overleaf
  • added references for the content-gaps in the overleaf
  • collected literature to consider and organize for a detailed discussion of the gender content gap here
    • several community inititiaves/discussions
    • 4 tools to measure/visualize gender content gaps
    • 19 papers
May 29 2020, 4:07 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T243079: Run office hours WMF/Research&Analytics.

Update week 2020-05-25:

  • had succesful office hours on 2020-05-27
  • fixed office hours for june:
May 29 2020, 3:58 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T250352: Prepare for co-organizing the monthly Research Showcase.

Update week 2020-05-25:

May 29 2020, 3:55 PM · Research (FY2019-20-Research-April-June)

May 22 2020

MGerlach added a comment to T250352: Prepare for co-organizing the monthly Research Showcase.

Update week 2020-05-18

May 22 2020, 4:24 PM · Research (FY2019-20-Research-April-June)
MGerlach updated the task description for T250352: Prepare for co-organizing the monthly Research Showcase.
May 22 2020, 4:20 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T243079: Run office hours WMF/Research&Analytics.

Update week 2020-05-18:

  • send invitations for May office hours on mailing lists:
    • Wiki-research: wiki-research-l@lists.wikimedia.org
    • Wikidata: wikidata@lists.wikimedia.org
    • Analytics: analytics@lists.wikimedia.org
May 22 2020, 4:18 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T250356: Characterize reader behaviour: Train and evaluate one deep-learning model on reader sessions in comparison to navigation vectors.

Update week 2020-05-18:

May 22 2020, 4:15 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T250349: Build the first complete draft of the taxonomy of content gaps.

Week update 2020-05-18:

  • wrote first draft of section on content gap explaining content gaps in more detail here
    • what is a content gap
    • which sources we consulted
    • what are the facets and why did we include them
  • gave section to Miriam for feedback and comments
May 22 2020, 4:00 PM · Research (FY2019-20-Research-April-June)
MGerlach updated the task description for T250349: Build the first complete draft of the taxonomy of content gaps.
May 22 2020, 3:56 PM · Research (FY2019-20-Research-April-June)

May 15 2020

MGerlach added a comment to T250349: Build the first complete draft of the taxonomy of content gaps.

Week update 2020-05-11:

  • adding revised content taxonomy to unified table
  • discussed outline of document, starting to write explanation and description of facets around content-components
May 15 2020, 5:02 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T250353: Characterize Reader beahviour: For at least one wiki, create list of related article pages which co-occur in sessions involving articles on covid-19..

Update week 2020-05-11:

  • started to build an API to query a list of related articles; this was one feedback on my talk-page upon sharing the list of related articles wrt covid
  • Discussed with Alex Stinson to extend similar analysis to better understand social dimensions of covid related to xenophobia and gender
May 15 2020, 4:51 PM · Research (FY2019-20-Research-January-March)
MGerlach added a comment to T250356: Characterize reader behaviour: Train and evaluate one deep-learning model on reader sessions in comparison to navigation vectors.

Update week 2020-05-11:

  • discussion about onboarding Alberto (postdoc in Bob's group) to project
  • first results for LSTM-model trained on reading sessions shows substantial improvement for next-page-recommendation (metric = mean-reciprocal-rank, larger is better):
    • wikispeedia sessions: MRR = 0.4 vs MRR = 0.22 from word2vec-baseline
    • simplewiki sessions: MRR = 0.3 vs MRR = 0.17 from word2vec-baseline
    • allwiki sessions: MRR = 0.12 vs MRR = 0.16 from word2vec-baseline; for LSTM we can only use a small fraction of the data for training bc of computational limitations (only 1 GPU). still figuring out how to work around.
  • Planned improvements:
    • compare to baseline from morelike-search, the currently deployed recommendation using text-based features
    • assess performance in use-case of automatic list building using data from wikiprojects
May 15 2020, 4:21 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T243079: Run office hours WMF/Research&Analytics.

Update week 2020-05-11:

  • None
May 15 2020, 3:31 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T250352: Prepare for co-organizing the monthly Research Showcase.

Update week 2020-05-11:

  • agreed on 3 possible themes for the next 2 showcases (1 backup)
    • medical content
    • bias/fairness
    • censorship
  • started to collect list of speakers for the themes here
  • planning to invite speakers starting next week
May 15 2020, 3:30 PM · Research (FY2019-20-Research-April-June)

May 7 2020

MGerlach added a comment to T250349: Build the first complete draft of the taxonomy of content gaps.

Week update 2020-05-04:

  • performed literature review on: notability, language complexity, cultural context
  • Gathering first round of feedback from team members on updates to content-taxonomy, reached out to asaf bartov to get additional comments
  • ongoing discussions on revising structure of taxonomy:
    • which policies to reflect (in particular notability since it plays such a crucial role; suggestion: as a cause)
    • language complexity in terms of accessbility and knowledge as a service (grouped with content types)
  • making structure consistent with miriam's proposal: objective, source, cause/barrier, how to measure
May 7 2020, 6:47 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T250356: Characterize reader behaviour: Train and evaluate one deep-learning model on reader sessions in comparison to navigation vectors.

Update week 2020-05-04:

  • starting to run LSTM-models on gpus with Akhil
  • working around computational challenges, in particular since we have only 1 gpu amount of data we can use is limited, so far ~1M sessions; considering sessions from all wikis we have many more different pages so we are undersampling (sparsity) which leads to poor accuracy. we are testing different strategies to mitigate this:
    • Reduce vocabulary by only looking at top-k pages
    • Choose single wiki, e.g. simplewiki
    • initialize LSTM-layer with word2vec-embeddings which we can train for full data
  • ongoing: evaluation metric: supervised classification task on reader sessions from reader survey (reader motivation)
  • discussing with Bob about another postdoc, Alberto from his group, joining the project.
May 7 2020, 6:41 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T250356: Characterize reader behaviour: Train and evaluate one deep-learning model on reader sessions in comparison to navigation vectors.

Update week 2020-05-04:

  • starting to run LSTM-models on gpus with Akhil
  • working around computational challenges, in particular since we have only 1 gpu amount of data we can use is limited, so far ~1M sessions; considering sessions from all wikis we have many more different pages so we are undersampling (sparsity) which leads to poor accuracy. we are testing different strategies to mitigate this:
    • Reduce vocabulary by only looking at top-k pages
    • Choose single wiki, e.g. simplewiki
    • initialize LSTM-layer with word2vec-embeddings which we can train for full data
  • ongoing: evaluation metric: supervised classification task on reader sessions from reader survey (reader motivation)
  • discussing with Bob about another postdoc, Alberto from his group, joining the project.
May 7 2020, 6:40 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T250353: Characterize Reader beahviour: For at least one wiki, create list of related article pages which co-occur in sessions involving articles on covid-19..

Update week 2020-05-04:

  • list was used by wikidata-wikiprojects for covid link
  • ongoing discussions with alex stinson about expanding approach:
    • support collaboration of Partnerships team with UN-groups organizing webinar on human rights/gender perspective of readers during covid
    • make a tool to generate such lists available to community for future events
May 7 2020, 4:13 PM · Research (FY2019-20-Research-January-March)
MGerlach updated the task description for T250353: Characterize Reader beahviour: For at least one wiki, create list of related article pages which co-occur in sessions involving articles on covid-19..
May 7 2020, 4:08 PM · Research (FY2019-20-Research-January-March)
MGerlach added a comment to T243079: Run office hours WMF/Research&Analytics.

Update week 2020-05-04:

  • considering/thinking through plan to experiment with video office hours for next time
May 7 2020, 4:08 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T250351: Co-mentor a Google Summer of Code student (Inequality metrics on wikistats).

Update week 2020-05-04:

  • proposal was not accepted for gsoc
May 7 2020, 4:07 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T250352: Prepare for co-organizing the monthly Research Showcase.

Update week 2020-05-04:

  • first discussion with Djellel to plan showcases starting from July
  • will collect a set of themes around which we plan the showcases
  • after selection of themes will look for potential speakers to invite
May 7 2020, 4:06 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T248964: Implement inequality metrics for WikiStats.

@Quasipodo sorry to hear that the project was not selected.
I am still happy to keep an eye on this task and help out if I can if you decide to pick up any of this in any case.

May 7 2020, 1:40 PM · Analytics-Wikistats, Analytics

Apr 30 2020

MGerlach added a comment to T250349: Build the first complete draft of the taxonomy of content gaps.

Week update 2020-04-27:

  • Performed literature review on content/information types (images, citations, structured data) and policy-related content (npov, verifiability), and general resources for deciding whether to include indicator or not (MTP, movement strategy, knowledge gap whitepaper, etc)
  • Revised content taxonomy, see latest version here
    • 3 facets: policy (npov, verifiability), content types (images, structured data), local&diverse content (gender, cultural context, topics); where the latter contains previously proposed indicators
    • proposed 4 columns for each indicator: why to include (e.g. movement strategy, scientific literature), how to measure (e.g. verifiability via citations), objective (when would the gap be closed), causes (what could be possible interventions); does not need to be filled completely, explicitly acknowledge if we lack insight.
Apr 30 2020, 9:05 AM · Research (FY2019-20-Research-April-June)
MGerlach updated the task description for T250356: Characterize reader behaviour: Train and evaluate one deep-learning model on reader sessions in comparison to navigation vectors.
Apr 30 2020, 8:57 AM · Research (FY2019-20-Research-April-June)
MGerlach updated the task description for T250356: Characterize reader behaviour: Train and evaluate one deep-learning model on reader sessions in comparison to navigation vectors.
Apr 30 2020, 8:57 AM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T250356: Characterize reader behaviour: Train and evaluate one deep-learning model on reader sessions in comparison to navigation vectors.

Update week 2020-04-27:

  • Defined set of evaluation metrics
    • recommendation: i) next-page prediction, ii) target (final)-article prediction, iii) session-prediction (preliminary)
    • classification of reader motivation from reader survey
  • starting to train supervised classifier for reader motivation using navigation embeddings from word2vec
  • starting to experiment with LSTM sequence model to train reader sessions
Apr 30 2020, 8:56 AM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T250353: Characterize Reader beahviour: For at least one wiki, create list of related article pages which co-occur in sessions involving articles on covid-19..

Update week 2020-04-27:

  • published list for 10 wikis and summary of method on meta
  • Alex Stinson communicated list to community, e.g. Wikidata WikiProject Covid-19
  • gathering feedback, e.g. what are possibilities to publish data or make results interactive to explore via API
Apr 30 2020, 8:50 AM · Research (FY2019-20-Research-January-March)
MGerlach added a comment to T243079: Run office hours WMF/Research&Analytics.

Update week 2020-04-27:

  • Held office hour on 2020-04-29
  • fixed date for may office hour: 2020-05-27
Apr 30 2020, 8:46 AM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T250351: Co-mentor a Google Summer of Code student (Inequality metrics on wikistats).

Update week 2020-04-27: None (waiting for acceptance on May 4)

Apr 30 2020, 8:45 AM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T250352: Prepare for co-organizing the monthly Research Showcase.

Update week 2020-04-27: Starting the planning with @DED next week.

Apr 30 2020, 8:44 AM · Research (FY2019-20-Research-April-June)

Apr 22 2020

MGerlach added a comment to T250349: Build the first complete draft of the taxonomy of content gaps.

Week update 2020-04-20:

  • started literature review for content types on use of images and citations (see here
Apr 22 2020, 5:03 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T250356: Characterize reader behaviour: Train and evaluate one deep-learning model on reader sessions in comparison to navigation vectors.

Update week 2020-04-20:

  • ran grid-search for word2vec-model to tune hyperparameters in order to establish baseline
Apr 22 2020, 5:00 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T250353: Characterize Reader beahviour: For at least one wiki, create list of related article pages which co-occur in sessions involving articles on covid-19..

Update week 2020-04-20:

  • started to write summary of the results to put on metawiki in order to get feedback for potential of the methodology for campaign building and in general
Apr 22 2020, 4:58 PM · Research (FY2019-20-Research-January-March)
MGerlach added a comment to T243079: Run office hours WMF/Research&Analytics.

Update week 2020-04-20: None

Apr 22 2020, 4:57 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T250351: Co-mentor a Google Summer of Code student (Inequality metrics on wikistats).

Update week 2020-04-13: None (waiting for acceptance on May 4)

Apr 22 2020, 4:55 PM · Research (FY2019-20-Research-April-June)
MGerlach added a comment to T250352: Prepare for co-organizing the monthly Research Showcase.

Update week 2020-04-20: None (short week)

Apr 22 2020, 4:54 PM · Research (FY2019-20-Research-April-June)

Apr 17 2020

MGerlach updated subscribers of T250349: Build the first complete draft of the taxonomy of content gaps.

Week update 2020-04-13:

  • sourcing relevant literature and tools which investigate gender gap and explore different content types in wikipedia (such as wikidata concepts monitor)
Apr 17 2020, 5:23 PM · Research (FY2019-20-Research-April-June)
MGerlach updated subscribers of T250356: Characterize reader behaviour: Train and evaluate one deep-learning model on reader sessions in comparison to navigation vectors.

Update week 2020-04-13:

  • built pipeline to run word2vec-model in order to establish a baseline for the evaluation
  • discussing with Akhil/Tiziano/Bob on which metrics are informative for evaluation of models: i) recommendation (e.g. next or target pageview in session) and ii) classification (external labels from survey on,e.g., reading intent)
  • obtaining GPU-access for team to run models
Apr 17 2020, 5:21 PM · Research (FY2019-20-Research-April-June)
MGerlach updated subscribers of T250353: Characterize Reader beahviour: For at least one wiki, create list of related article pages which co-occur in sessions involving articles on covid-19..
Apr 17 2020, 5:15 PM · Research (FY2019-20-Research-January-March)
MGerlach added a comment to T250353: Characterize Reader beahviour: For at least one wiki, create list of related article pages which co-occur in sessions involving articles on covid-19..

Update week 2020-04-13:

  • built pipeline to create list of related articles to covid-19 based on reading sessions using approach from navigation vectors, detailed write-up here
  • created first version of lists for 10 different wikis together with pageview statistics from current and previous week
  • got feedback from Alex Stinson and members from community (via Alex) to refine list of articles (e.g. removing disambiguation pages and overrepresentation of technical articles on viruses)
Apr 17 2020, 5:15 PM · Research (FY2019-20-Research-January-March)
MGerlach added a comment to T243079: Run office hours WMF/Research&Analytics.

Update week 2020-04-13:

  • iterated with team on improvements (here)
  • sent out announcement for April edition to mailing lists since we moved 1 week later in the month
Apr 17 2020, 5:09 PM · Research (FY2019-20-Research-April-June)
MGerlach updated subscribers of T250351: Co-mentor a Google Summer of Code student (Inequality metrics on wikistats).

Update week 2020-04-13:

  • iterated on proposal
  • proposal likely to not be accepted as there is a strict lower limit for the time effort of 30hrs/week (proposal states 20 hrs/week)
Apr 17 2020, 5:06 PM · Research (FY2019-20-Research-April-June)
MGerlach updated subscribers of T250352: Prepare for co-organizing the monthly Research Showcase.

Update week 2020-04-13: None

Apr 17 2020, 5:03 PM · Research (FY2019-20-Research-April-June)

Apr 16 2020

MGerlach added a comment to T248964: Implement inequality metrics for WikiStats.

@Quasipodo perhaps you already know, but I came across the wikidata knowledge imbalance dashboard which calculates the gini-coefficient for different wikidata items and does a good job of putting the obtained numbers into context and explaining what they mean. Maybe a useful reference and inspiration for how to present metrics.

Apr 16 2020, 10:04 AM · Analytics-Wikistats, Analytics
MGerlach created T250356: Characterize reader behaviour: Train and evaluate one deep-learning model on reader sessions in comparison to navigation vectors.
Apr 16 2020, 9:33 AM · Research (FY2019-20-Research-April-June)
MGerlach created T250353: Characterize Reader beahviour: For at least one wiki, create list of related article pages which co-occur in sessions involving articles on covid-19..
Apr 16 2020, 9:00 AM · Research (FY2019-20-Research-January-March)
MGerlach created T250352: Prepare for co-organizing the monthly Research Showcase.
Apr 16 2020, 8:54 AM · Research (FY2019-20-Research-April-June)