Page MenuHomePhabricator

MGerlach (Martin Gerlach)
Research Scientist

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Sep 9 2019, 9:50 AM (92 w, 6 d)
Availability
Available
IRC Nick
mgerlach
LDAP User
MGerlach
MediaWiki User
MGerlach (WMF) [ Global Accounts ]

Recent Activity

Fri, Jun 18

MGerlach added a comment to T272727: Start developing metrics for content-diversity gaps.

Update week 2021-06-14:

  • created short summary-presentation of results for final round of community engagement (to be sent out next week)
  • submitted proposal for session at Wikimania around indicators for wikimedia projects (in collaboration with work around indicators for knowledge integrity and community health
  • plan for the remaining weeks is to finish the non-technical documentation and upload to the meta project-page )
Fri, Jun 18, 3:29 PM · Research (FY2020-21-Research-April-June)
MGerlach added a comment to T272731: In-depth analysis of link-recommendation model .

Update week 2021-06-14:

  • implemented changes to the model based on volunteer-feedback (T279434)
  • retrained the model for the 4 existing wikis and 7 new wikis (T284481);
  • added results of the backtesting evaluation to the project-page on meta. results look promising: the performance is not negatively affected by the changes. model-performance in the 7 new wikis selected for deployment looks promising (no red-flags, performance similar/better to already deployed wikis).
Fri, Jun 18, 3:13 PM · Research (FY2020-21-Research-April-June)
MGerlach added a comment to T284666: Add a link: unnecessary articles on units are often suggested.

@Ankan_WMF Do you have one or two example articles where you observed the suggestions you mention above. We have made some changes to the model based on your feedback (T279434) and would like to check manually whether this resolves the issue or not or creates new problems (before deploying). Thanks.

Fri, Jun 18, 3:08 PM · Patch-For-Review, Growth-Team (Current Sprint), Add-Link, Growth-Structured-Tasks
MGerlach added a comment to T279434: Add a link: algorithm improvements: Define filter for not linking specific articles types.

I trained the model with the updated filter for the 4 initial wikis (ar,bn,cs,vi), the 7 new wikis mentioned in T284481, and enwiki (for comparison). The backtesting evaluation suggests that performance is ok (precision is similar, recall decreases slightly); see detailed results on the project-page on meta.

Fri, Jun 18, 3:03 PM · Patch-For-Review, Growth-Team (Current Sprint), Growth-Team-Filtering, Growth-Structured-Tasks
MGerlach added a comment to T284481: Deploy Add a link to the second set of wikis.
  • Trained the models for each wiki on stat1008 (data should be available in the staging-db)
  • Backtesting evaluation does not raise any flags about performance of the model in these wikis; similar to other wikis, i.e. for default threshold 0.5 we get precision ~80% with recall ~40% (or better). I added the detailed results for this round of training/evaluation to the project-page on meta.
Fri, Jun 18, 2:56 PM · Growth-Team (Current Sprint), Add-Link, CommRel-Specialists-Support (Apr-Jun-2021)
MGerlach updated the task description for T284481: Deploy Add a link to the second set of wikis.
Fri, Jun 18, 2:52 PM · Growth-Team (Current Sprint), Add-Link, CommRel-Specialists-Support (Apr-Jun-2021)

Fri, Jun 11

MGerlach added a comment to T272727: Start developing metrics for content-diversity gaps.

Update week 2021-06-07:

Fri, Jun 11, 3:22 PM · Research (FY2020-21-Research-April-June)
MGerlach added a comment to T272731: In-depth analysis of link-recommendation model .

Update week 2021-06-07:

Fri, Jun 11, 3:20 PM · Research (FY2020-21-Research-April-June)
MGerlach updated the task description for T279993: Research Showcase June 2021.
Fri, Jun 11, 3:08 PM · Research (FY2020-21-Research-April-June)
MGerlach added a comment to T279434: Add a link: algorithm improvements: Define filter for not linking specific articles types.

@kostajh ack.

Fri, Jun 11, 10:58 AM · Patch-For-Review, Growth-Team (Current Sprint), Growth-Team-Filtering, Growth-Structured-Tasks
MGerlach updated the task description for T279434: Add a link: algorithm improvements: Define filter for not linking specific articles types.
Fri, Jun 11, 9:50 AM · Patch-For-Review, Growth-Team (Current Sprint), Growth-Team-Filtering, Growth-Structured-Tasks
MGerlach added a comment to T279434: Add a link: algorithm improvements: Define filter for not linking specific articles types.

Based on the suggestions in T279434#7143095 T279434#7093628 T284666 I am adding the following entity-types to the filter (all links that are an instance of these entities are removed from the set of candidates):

Fri, Jun 11, 9:49 AM · Patch-For-Review, Growth-Team (Current Sprint), Growth-Team-Filtering, Growth-Structured-Tasks

Thu, Jun 10

MGerlach added a subtask for T279434: Add a link: algorithm improvements: Define filter for not linking specific articles types: T284666: Add a link: unnecessary articles on units are often suggested.
Thu, Jun 10, 10:55 AM · Patch-For-Review, Growth-Team (Current Sprint), Growth-Team-Filtering, Growth-Structured-Tasks
MGerlach added a parent task for T284666: Add a link: unnecessary articles on units are often suggested: T279434: Add a link: algorithm improvements: Define filter for not linking specific articles types.
Thu, Jun 10, 10:55 AM · Patch-For-Review, Growth-Team (Current Sprint), Add-Link, Growth-Structured-Tasks
MGerlach added a comment to T284666: Add a link: unnecessary articles on units are often suggested.

I guess we'd want the wikidata items for units of measurement?

Thu, Jun 10, 10:45 AM · Patch-For-Review, Growth-Team (Current Sprint), Add-Link, Growth-Structured-Tasks

Tue, Jun 8

MGerlach added a comment to T279434: Add a link: algorithm improvements: Define filter for not linking specific articles types.

Because this blocks us deploying to more wikis (T284481), we want to prioritize it. @kostajh -- is this something that Growth engineers do, or that @MGerlach does?

Tue, Jun 8, 10:19 AM · Patch-For-Review, Growth-Team (Current Sprint), Growth-Team-Filtering, Growth-Structured-Tasks
MGerlach added a comment to T279993: Research Showcase June 2021.

updates week 2021-06-07:

  • moving showcase from june 16 to june 23 due to all-hands 2021
Tue, Jun 8, 9:08 AM · Research (FY2020-21-Research-April-June)
MGerlach updated the task description for T279993: Research Showcase June 2021.
Tue, Jun 8, 9:07 AM · Research (FY2020-21-Research-April-June)

Fri, Jun 4

MGerlach added a comment to T272727: Start developing metrics for content-diversity gaps.

Update week 2021-05-31:

Fri, Jun 4, 4:22 PM · Research (FY2020-21-Research-April-June)
MGerlach added a comment to T272731: In-depth analysis of link-recommendation model .

Update week 2021-05-31:

Fri, Jun 4, 3:59 PM · Research (FY2020-21-Research-April-June)
MGerlach added a comment to T279993: Research Showcase June 2021.

updates week 2021-05-31:

Fri, Jun 4, 3:53 PM · Research (FY2020-21-Research-April-June)

Thu, Jun 3

MGerlach closed T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data as Resolved.
Thu, Jun 3, 1:40 PM · Outreachy (Round 22)
MGerlach closed T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data, a subtask of T275608: Outreachy Project: Build a tool for analyzing and visualizing reader navigation on wikipedia., as Resolved.
Thu, Jun 3, 1:40 PM · Research (FY2020-21-Research-April-June), Outreachy (Round 22), Outreach-Programs-Projects

Tue, Jun 1

MGerlach added a comment to T279434: Add a link: algorithm improvements: Define filter for not linking specific articles types.

Hello @MGerlach, some suggestions are for dates in cswiki. Here are some examples:

  • in Let Korean Air 858, "27. července" was suggested to link to 27. červenec (an article about date), "12. listopadu" was suggested to link to "12. listopad" (an article about date) and "18. listopadu" was suggested to link to "18. listopad"
  • in "Letiště Edinburgh", "60. let" was suggested to link to "1960-1969" (an article about a decade) and "20. století" was suggested to link to "20. století" (an article about a century)
  • in "Eduard Pagáč", 8. května" was suggested to link to "8. květen" (an article about a date) and "12. března" was suggested to link to "12. březen" (an article about a date).

Could those be excluded, please? Thanks!

Tue, Jun 1, 7:55 AM · Patch-For-Review, Growth-Team (Current Sprint), Growth-Team-Filtering, Growth-Structured-Tasks

Thu, May 27

MGerlach added a comment to T283715: Add a link in bnwiki: algorithm improvements: articles are not being suggested at their first appearance.

I can reproduce that behaviour for the first example (article: ফ্রেড_বারাট, anchor: ইনিংস) . From what I can see this comes from the difference of the strings "ইনিংস" and "ইনিংসে".

Thu, May 27, 5:10 PM · Add-Link, Growth-Structured-Tasks, Growth-Team
MGerlach added a comment to T272727: Start developing metrics for content-diversity gaps.

Update week 2021-05-24:

  • finished processing of data to generate relevant tables with scores for the different gaps
  • created a first set of visualizations for the selection and extent metric for all 5 gaps for different wikis; we will likely make some iterations since this will be important when getting final round of feedback from community
  • we are finishing the technical documentation on how to calculate the gaps (including the visualization)
  • we are preparing a final round of community feedback now that we have calculated a set of metrics for different wikis
Thu, May 27, 3:20 PM · Research (FY2020-21-Research-April-June)
MGerlach added a comment to T272731: In-depth analysis of link-recommendation model .

Update week 2021-05-24:

  • revised the paper and submitted in time for the deadline. the next step is to put a version on arxiv to share publicly.
  • started to work on improvements: T283715 mentions parsing issues in bnwiki
Thu, May 27, 3:12 PM · Research (FY2020-21-Research-April-June)
MGerlach renamed T283821: Add information about Research office hours on research landing page (events) from Add information about Research office hours to events on research landing page to Add information about Research office hours on research landing page (events).
Thu, May 27, 1:57 PM · Research
MGerlach created T283821: Add information about Research office hours on research landing page (events).
Thu, May 27, 1:57 PM · Research

Wed, May 26

MGerlach added a comment to T279434: Add a link: algorithm improvements: Define filter for not linking specific articles types.

Some examples of suggested articles on days of the year:

  1. এরোমাঙ্গা সেনসেই: suggesting ১০ নভেম্বর (November 10), ৯ এপ্রিল (April 9), ২৫ জুন (June 25)
  2. মার্গারিটা সালাস: suggesting ৩০ নভেম্বর (November 30), ৭ নভেম্বর (November 7)
  3. ডালিয়া গ্রাইবস্কেইট: suggesting ১ মার্চ (March 1)
  4. ইপ্সিতা পাটি: suggesting ১৮ জুন (June 18)

Figure from example 2:


I can provide more examples if needed.

Wed, May 26, 11:16 AM · Patch-For-Review, Growth-Team (Current Sprint), Growth-Team-Filtering, Growth-Structured-Tasks

Fri, May 21

MGerlach added a comment to T272727: Start developing metrics for content-diversity gaps.

Update week 2021-05-17:

  • we were able to finish a first iteration on processing a snapshot of the dumps to retrieve all articles of a wiki relevant to a specific gap, assigning them to one of the gaps' groups, and retrieve relevant features to calculate selection and extent scores; this was done for one snapshot of all Wikipedia-projects for the 5 gaps under consideration
  • the next step is to derive a lighter table only containing the selection and extent metrics for the respective gaps to be used for generating the histograms for each project/gap
  • next priority is to finish the technical documentation for generating tables and scores for easier replication in follow-up tasks
Fri, May 21, 3:02 PM · Research (FY2020-21-Research-April-June)
MGerlach closed T279992: Research Showcase May 2021 as Resolved.

Update week 2021-05-17:

Fri, May 21, 2:49 PM · Research (FY2020-21-Research-April-June)
MGerlach updated the task description for T279992: Research Showcase May 2021.
Fri, May 21, 2:47 PM · Research (FY2020-21-Research-April-June)
MGerlach added a comment to T272731: In-depth analysis of link-recommendation model .

Update week 2021-05-17:

  • unfortunately, paper submitted to KDD was not accepted despite relatively positive reviews (3 weak accepts and no major reject)
  • currently working on revising the manuscript using feedback from reviewers, in addition we have a chance to include more data from the recent manual evaluation with the volunteers/embassadors T278864
  • plan is to submit to CIKM in the applied research track next week (deadline: 2021-05-26)
Fri, May 21, 2:43 PM · Research (FY2020-21-Research-April-June)

May 17 2021

MGerlach updated subscribers of T279434: Add a link: algorithm improvements: Define filter for not linking specific articles types.

All dates of the year like "12 May" have to be filtered. They are subclasses of Q14795564.

@geraki thanks for this catch.

May 17 2021, 5:12 PM · Patch-For-Review, Growth-Team (Current Sprint), Growth-Team-Filtering, Growth-Structured-Tasks

May 10 2021

MGerlach updated the task description for T279992: Research Showcase May 2021.
May 10 2021, 2:14 PM · Research (FY2020-21-Research-April-June)

May 7 2021

MGerlach added a comment to T272727: Start developing metrics for content-diversity gaps.

Update week 2021-05-04:

May 7 2021, 5:10 PM · Research (FY2020-21-Research-April-June)

May 4 2021

MGerlach added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

Hi everyone,
the final application deadline has passed. I wanted to thank you for all your hard work and effort you put into your submissions. you all did a really good job not only in your analysis and notebooks but also in being curious, asking questions, and helping each other out!

May 4 2021, 11:15 AM · Outreachy (Round 22)

May 2 2021

MGerlach added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

Hi all,
just a reminder: if you have not done already, dont forget to submit your final application on the outreachy website before the deadline on Monday, May 3 at 4pm UTC (less than 24 hours).
Even if you sent your notebook to Isaac or me for feedback during the past weeks (thanks for anyone who shared their progress), you still need to submit the application on the outreachy-site.

May 2 2021, 6:13 PM · Outreachy (Round 22)

Apr 30 2021

MGerlach added a comment to T272727: Start developing metrics for content-diversity gaps.

Update week 2021-04-26:

Apr 30 2021, 4:16 PM · Research (FY2020-21-Research-April-June)
MGerlach added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

Hello mentors, @MGerlach
I am Ananya Mahato, an outreachy applicant. Due to my semester exams as well as poor health conditions, I wasn't able to contribute to any project. But since the contribution deadline has been extended. I wish to ask you, whether I can contribute to this project now, or are there a sufficient number of applicants already.

Also adding @Isaac here

Apr 30 2021, 7:57 AM · Outreachy (Round 22)

Apr 29 2021

MGerlach added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

Just a remember guys, Don't forget to submit your final application on Outreachy website. We have another 25 and half hours. The deadline is April 30 2021 4 PM UTC

Apr 29 2021, 4:13 PM · Outreachy (Round 22)

Apr 23 2021

MGerlach moved T272727: Start developing metrics for content-diversity gaps from FY2020-21-Research-January-March to FY2020-21-Research-April-June on the Research board.
Apr 23 2021, 3:59 PM · Research (FY2020-21-Research-April-June)
MGerlach added a comment to T272727: Start developing metrics for content-diversity gaps.

Update week 2021-04-19:

  • continued to work on the write-up
  • prepared slides for the presentation of initial results for next tuesday-meeting
Apr 23 2021, 3:58 PM · Research (FY2020-21-Research-April-June)

Apr 22 2021

MGerlach added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

Hello, @MGerlach and @Isaac right from the inception of my contribution, I have chunked down the microtask into further tasks based upon my understanding, Can I reflect the same while recording contributions on the Outreachy website providing links to the respective notebooks? Do we have any limitations to the number of Contributions TIA

Apr 22 2021, 9:39 AM · Outreachy (Round 22)

Apr 19 2021

MGerlach added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

Hello, @MGerlach and @Isaac this is the last question in the Outreachy final application :

Please work with your mentor to provide a timeline of the work you plan to accomplish on the project and what tasks you will finish at each step. Make sure to take into account any time commitments you have during the Outreachy internship round.

So I would like to know if is ok to use the phases described in T275608 as steps in the project timeline and if you have any suggestion or guidance about it.

Apr 19 2021, 5:39 PM · Outreachy (Round 22)
MGerlach added a comment to T280112: Fix import paths in utils.py when importing MySqlDict .

@kostajh patch for a quick fix: https://gerrit.wikimedia.org/r/c/research/mwaddlink/+/681095

Apr 19 2021, 3:24 PM · Growth-Team (Current Sprint), Growth-Team-Filtering, Add-Link
MGerlach added a comment to T280112: Fix import paths in utils.py when importing MySqlDict .

@MGerlach do you have capacity to work on a patch for this?

Apr 19 2021, 10:37 AM · Growth-Team (Current Sprint), Growth-Team-Filtering, Add-Link
MGerlach added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

Hi everyone,
welcome everyone who joined since the last posts. Great to see the ongoing discussion .

Apr 19 2021, 10:35 AM · Outreachy (Round 22)

Apr 16 2021

MGerlach added a comment to T272727: Start developing metrics for content-diversity gaps.

Update week 2021-04-12:

  • started this week a write-up of the work in this document.
    • this focuses on 5 content gaps and lays out a strategy/proposal for measuring those gaps
    • the aim was to bring together the individual pieces into a coherent story; this will serve as a comprehensive guideline for implementation containing the justification for the choices made along the way.
    • we will probably spend the next week filling in the gaps
Apr 16 2021, 3:57 PM · Research (FY2020-21-Research-April-June)
MGerlach added a project to T275608: Outreachy Project: Build a tool for analyzing and visualizing reader navigation on wikipedia.: Research (FY2020-21-Research-April-June).
Apr 16 2021, 3:49 PM · Research (FY2020-21-Research-April-June), Outreachy (Round 22), Outreach-Programs-Projects

Apr 14 2021

MGerlach created T280112: Fix import paths in utils.py when importing MySqlDict .
Apr 14 2021, 9:01 AM · Growth-Team (Current Sprint), Growth-Team-Filtering, Add-Link

Apr 13 2021

MGerlach created T279993: Research Showcase June 2021.
Apr 13 2021, 8:47 AM · Research (FY2020-21-Research-April-June)
MGerlach created T279992: Research Showcase May 2021.
Apr 13 2021, 8:45 AM · Research (FY2020-21-Research-April-June)

Apr 8 2021

MGerlach updated the task description for T279427: Republish datasets with primary key ID column included.
Apr 8 2021, 10:45 PM · Patch-For-Review, Growth-Team (Current Sprint), Add-Link
MGerlach added a comment to T272727: Start developing metrics for content-diversity gaps.

Update week 2021-04-05:

  • discussed strategies for validation of the set of articles relevant for the gap. Based on a set of labels (in most cases from Wikidata to scale across all languages), what is the precision (number of selected articles with the correct label) and the recall (how many relevant articles are captured with the used labels). The recall is important when comparing the total number of articles with respect to a gap in a project. While for the gender gap the Wikidata-labels related to gender have a high coverage of biography-articles, we know that other labels in Wikidata for identifying other gaps have a much lower and biased coverage (such as P172); this in turn would make estimates of selection unreliable.
  • reviewed literature around gaps of sexual orientation and age/recency. We discussed in more detail what is the best way to identify relevant articles
Apr 8 2021, 6:35 PM · Research (FY2020-21-Research-April-June)
MGerlach added a comment to T272731: In-depth analysis of link-recommendation model .

Update week 2021-04-05:

  • went through evaluation from volunteers T278864
    • precision of recommendations around 70-90% similar to offline backtesting evaluation (very encouraging)
    • helped fix issues around character encoding leading to poor performance in viwiki T279037
    • from comments identified possible improvements for the model in next iteration T279434 (not recommending links that are of a certain type, such as calendar dates), T279519 (avoiding links in specific section such as "Sources"), T279521 (Improve parsing to generate anchor placement)
  • helped finding possible solutions around performance issues T279411; as well as republishing all datasets T279427
Apr 8 2021, 6:17 PM · Research (FY2020-21-Research-April-June)
MGerlach updated the task description for T279427: Republish datasets with primary key ID column included.
Apr 8 2021, 2:16 PM · Patch-For-Review, Growth-Team (Current Sprint), Add-Link
MGerlach closed T279508: run-pipeline.sh fails for enwiki, a subtask of T261396: Add a link: engineering tasks for initial release, as Resolved.
Apr 8 2021, 10:09 AM · Add-Link, Growth-Structured-Tasks
MGerlach closed T279508: run-pipeline.sh fails for enwiki as Resolved.
Apr 8 2021, 10:09 AM · Growth-Team, Add-Link
MGerlach updated the task description for T279427: Republish datasets with primary key ID column included.
Apr 8 2021, 6:18 AM · Patch-For-Review, Growth-Team (Current Sprint), Add-Link

Apr 7 2021

MGerlach committed rRMWA3f5c1fdf08df: Limiting use of resources in cluster to prevent spark-jobs from failing. (authored by MGerlach).
Limiting use of resources in cluster to prevent spark-jobs from failing.
Apr 7 2021, 1:41 PM
MGerlach added a comment to T279508: run-pipeline.sh fails for enwiki.

This seems to be caused by the job taking too much resources on the cluster (and thus being killed).
According to the documentation on wikitech I should add the following arguments when starting the spark-job:

--conf spark.dynamicAllocation.maxExecutors=64

In the discussion on #wikimedia-analytics, Joseph recommended to set this to 128 (but not more).

Apr 7 2021, 10:28 AM · Growth-Team, Add-Link
MGerlach added a subtask for T253278: Add a link: link recommendation algorithm: T279521: Add a link: algorithm improvements: Improve parsing of text for generating anchor-text candidates.
Apr 7 2021, 10:07 AM · Growth-Structured-Tasks, Growth-Team
MGerlach added a parent task for T279521: Add a link: algorithm improvements: Improve parsing of text for generating anchor-text candidates: T253278: Add a link: link recommendation algorithm.
Apr 7 2021, 10:07 AM · Growth-Team-Filtering, Add-Link, Growth-Team
MGerlach created T279521: Add a link: algorithm improvements: Improve parsing of text for generating anchor-text candidates.
Apr 7 2021, 10:06 AM · Growth-Team-Filtering, Add-Link, Growth-Team
MGerlach added a subtask for T253278: Add a link: link recommendation algorithm: T279519: Add a link: algorithm improvements: Avoid recommending links in sections that usually don't have links.
Apr 7 2021, 9:52 AM · Growth-Structured-Tasks, Growth-Team
MGerlach added a parent task for T279519: Add a link: algorithm improvements: Avoid recommending links in sections that usually don't have links: T253278: Add a link: link recommendation algorithm.
Apr 7 2021, 9:51 AM · Growth-Team-Filtering, Add-Link, Growth-Team
MGerlach created T279519: Add a link: algorithm improvements: Avoid recommending links in sections that usually don't have links.
Apr 7 2021, 9:50 AM · Growth-Team-Filtering, Add-Link, Growth-Team
MGerlach added a subtask for T261396: Add a link: engineering tasks for initial release: T279508: run-pipeline.sh fails for enwiki.
Apr 7 2021, 9:09 AM · Add-Link, Growth-Structured-Tasks
MGerlach added a parent task for T279508: run-pipeline.sh fails for enwiki: T261396: Add a link: engineering tasks for initial release.
Apr 7 2021, 9:09 AM · Growth-Team, Add-Link
MGerlach created T279508: run-pipeline.sh fails for enwiki.
Apr 7 2021, 9:09 AM · Growth-Team, Add-Link

Apr 6 2021

MGerlach added a comment to T279434: Add a link: algorithm improvements: Define filter for not linking specific articles types.

@MGerlach -- thank you for creating this task. Could you imagine these types of filters being configured downstream at the wiki-level? For instance, if one community did want to link to dates, and another did not? Or does it need to happen upstream at model training?

Apr 6 2021, 6:02 PM · Patch-For-Review, Growth-Team (Current Sprint), Growth-Team-Filtering, Growth-Structured-Tasks
MGerlach added a comment to T279434: Add a link: algorithm improvements: Define filter for not linking specific articles types.

I dug a bit deeper into the wikidata-ontology to identify article-types related to the calendar such as years or centuries.

Apr 6 2021, 5:33 PM · Patch-For-Review, Growth-Team (Current Sprint), Growth-Team-Filtering, Growth-Structured-Tasks
MGerlach added a comment to T279130: Add a link engineering: exclude good and featured articles.

Good/featured flags can be accessed internally as Wikidata badges, so it can't be too complicated to do it directly either, although I never worked with that part of the codebase.

Not sure if this is helpful or you might already know all of this.
You can get the relevant articles via the wikidata query service.

Apr 6 2021, 5:20 PM · Add-Link, Growth-Team (Current Sprint), Growth-Structured-Tasks
MGerlach added a comment to T279434: Add a link: algorithm improvements: Define filter for not linking specific articles types.

As a potential guide, I counted how often a given article-type appears as a link. For example, the most common article-type for links in enwiki is the Wikidata-item Q5 (human) with more than 38M occurrences (i.e. biographies). There are only 639 links to articles that are an instance of "century" (Q578) at rank 9068.

Apr 6 2021, 3:34 PM · Patch-For-Review, Growth-Team (Current Sprint), Growth-Team-Filtering, Growth-Structured-Tasks
MGerlach archived P15181 link-recommendation entities (instance-of).
Apr 6 2021, 3:24 PM
MGerlach created P15181 link-recommendation entities (instance-of).
Apr 6 2021, 3:22 PM
MGerlach created T279434: Add a link: algorithm improvements: Define filter for not linking specific articles types.
Apr 6 2021, 2:13 PM · Patch-For-Review, Growth-Team (Current Sprint), Growth-Team-Filtering, Growth-Structured-Tasks
MGerlach added a comment to T278864: Add a link: evaluate link recommendation (Mar 30 2021).

@MMiller_WMF thanks for the summary of the evaluation.

Apr 6 2021, 9:09 AM · Growth-Team-Filtering, User-Urbanecm_WMF (Ambassador), Growth-Structured-Tasks, Growth-Team

Apr 1 2021

MGerlach added a comment to T275608: Outreachy Project: Build a tool for analyzing and visualizing reader navigation on wikipedia..

Is there any specific way to submit our work for the review and checks??

Apr 1 2021, 5:57 PM · Research (FY2020-21-Research-April-June), Outreachy (Round 22), Outreach-Programs-Projects
MGerlach added a comment to T272726: Evaluate list-building tools for ad-hoc topic modeling.

Update week 2021-03-29:

Apr 1 2021, 5:11 PM · Research (FY2020-21-Research-April-June)
MGerlach added a comment to T272727: Start developing metrics for content-diversity gaps.

Update week 2021-03-29:

  • polishing the write-up of the general procedure to identify relevant content and choosing relevant metrics
  • discussed some of the specifics around some of the gaps, specifically Time/recency and cultural context. For example, there are different possibilities for how to choose which articles containing time references; or how we can assess the accuracy of identifying the set of articles we associate with a gap. for gender, both, precision and recall is known to be very high; for gaps such as time or sexual orientation, we do not know. this will likely require some form of manual assessment of a smaller random subsample. ensuring high recall (or at least an absence of systematic bias) is important for making sure that metric such as selection are trustworthy
  • we started to plan the work related for the prototype-implementation of metrics for 5 gaps: gender, geography, cultural background, time, and sexual orientation. the plan is to be able to calculate the relevant metrics for each gap for one Wikipedia for one snapshot. we are confident at the moment we can do this in the next 4-6 weeks.
  • Marc started to do literature review on works that tried to measure gaps in Wikipedia related to geography, time, and sexual orientation (there are not many compared to gender but we want to make sure we have not missed anything important)
Apr 1 2021, 5:08 PM · Research (FY2020-21-Research-April-June)
MGerlach added a comment to T272731: In-depth analysis of link-recommendation model .

Update week 2021-03-29:

  • started to discuss details of the processing pipeline for the model T276438, such as how often the model should be re-trained. consulted Fabian about how to best approach deal with some decisions, but there are still many unknowns in how to best set this up with the current infrastructure.
  • the link-recommendation tool is available online for 7 languages (ar, bn, cs, vi, en, fr, simple); before deployment, there is another cycle of manual evaluation by volunteers T278864.
  • going through some early feedback surfaced serious problems in the accuracy of the suggested links in viwki T278864#6961431. after spending some time debugging, I believe the poor performance is due to errors in character encoding in the mysql-database used in production T279037; in the backtesting evaluation (in which viwiki was one of the best-performing wikis) this error did not surface since we are using the locally-stored in-memory pickle-files which for which the encoding works without problem. thus, it seems we can find an easy fix for this issue (hopefully).
  • an interesting observation surfaced about viwiki: upon aggregating a set of articles for evaluation, one volunteer realized that for viwiki there are many articles for which the link-recommendation does not generate a single link. qualitative observation of random articles in viwki often yields articles with a single sentence already containing several links. As a result, there are not many possibilities to actually insert a link by the link recommendation. One potential explanation could be that viwiki seems to contain many articles created by bots (this seems to be supported when comparing the ratio of pages that never receive a single pageview). We can speculate that many of the single-sentence articles were created by bots. This suggests that these articles (and their links) were created in a very structured pattern which could explain why the link recommendation model shows the highest performance in the backtesting data (containing many of these sentences).
Apr 1 2021, 4:53 PM · Research (FY2020-21-Research-April-June)
MGerlach added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

@MGerlach @Isaac Is it necessary to use the mwapi library for accessing the Wikimedia api? I am getting some connection failure errors with mwapi but using request library works.

Apr 1 2021, 12:40 PM · Outreachy (Round 22)
MGerlach added a comment to T279037: Character encoding issues in MySQL anchor dictionaries for viwiki.

Apparently case insensitive collations (we use utf8mb4_unicode_ci now) are also accent-insensitive. Modern versions of MySQL / MariaDB allow setting those two flags separately but that's since 8.0 / 10.4 and we are still on 10.1 (which is roughly equivalent to 5.7 I think).

If we don't care about case insensitivity, binary is certainly the best choice for a lookup table as it is easy to reason about and also improves performance somewhat.

Apr 1 2021, 10:14 AM · Growth-Team (Current Sprint), Add-Link
MGerlach added a subtask for T261396: Add a link: engineering tasks for initial release: T279037: Character encoding issues in MySQL anchor dictionaries for viwiki.
Apr 1 2021, 7:26 AM · Add-Link, Growth-Structured-Tasks
MGerlach added a parent task for T279037: Character encoding issues in MySQL anchor dictionaries for viwiki: T261396: Add a link: engineering tasks for initial release.
Apr 1 2021, 7:26 AM · Growth-Team (Current Sprint), Add-Link
MGerlach created T279037: Character encoding issues in MySQL anchor dictionaries for viwiki.
Apr 1 2021, 7:26 AM · Growth-Team (Current Sprint), Add-Link

Mar 31 2021

MGerlach added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

data_frame.group by('Destination').nunique()>=20 returns most of the values with false in Source, destination and link columns.

Yes, but if you apply the filter to the dataframe there should still be (hundreds of) thousands of destination that fit this criterion.

Mar 31 2021, 11:25 AM · Outreachy (Round 22)
MGerlach added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

Hello, for the microtask, I am trying to convert the file to CSV with pandas and subsequently data frames, but I am getting the error where several rows have conflicting columns. So the parameter, error_bad_lines=False to ignore the troubling lines can be used. Is that option viable?

Mar 31 2021, 8:29 AM · Outreachy (Round 22)
MGerlach added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

@Isaac @MGerlach In the to-do where we are required to visualize the data for a chosen destination article, it says "Pull all the data in the clickstream dataset for that article (both as a source and destination)". Does this mean we need to pull the data for that article in all available languages for the month of January or only the English language?

Mar 31 2021, 8:00 AM · Outreachy (Round 22)

Mar 30 2021

MGerlach added a comment to T275608: Outreachy Project: Build a tool for analyzing and visualizing reader navigation on wikipedia..

Hi all. the application task is T276315. Dont hesitate to ask questions on that task. I will try to answer open questions but feel free to help each other out too.

Mar 30 2021, 9:02 AM · Research (FY2020-21-Research-April-June), Outreachy (Round 22), Outreach-Programs-Projects

Mar 29 2021

MGerlach added a comment to T276438: Establish processes for running the dataset pipeline.

Some thoughts from my side.

Mar 29 2021, 5:41 PM · Growth-Scaling, Growth-Team (Current Sprint), Add-Link
MGerlach added a comment to T278679: Unable to run pipeline due to permissions errors.

Yeah I think prefixing with $USER would work, /tmp/$USER/mwaddlink

ok, I will upload a patch.

Mar 29 2021, 11:09 AM · Patch-For-Review, Growth-Team (Current Sprint), Add-Link
MGerlach added a comment to T278679: Unable to run pipeline due to permissions errors.

@kostajh
Thank for pointing this out; I did not think this would lead to permission issues.
The path in question is simply a path in hdfs where to write an intermediate dataset which will be deleted after running the code.
which naming convention do you recommend in order to best avoid the permission error? (should the full directory be newly created?)
"/tmp/$USER" or "/tmp/$USER/mwaddlink/" ?

Mar 29 2021, 9:53 AM · Patch-For-Review, Growth-Team (Current Sprint), Add-Link

Mar 26 2021

MGerlach added a comment to T272727: Start developing metrics for content-diversity gaps.

Update week 2021-03-22:

  • we have a general framework to identify content relevant to a specific gap that we think will work for 5 content gaps: gender (biographies), geography (places), sexual orientation, cultural context/background, and time/recency. we have been discussing and coordinating with isaac and jaime around the geography gap to make this consistent with similar efforts for gaps related to editors and readers
  • for the moment we have paused discussions around "Important topics" such as medicine as this is often not clearly defined what should be captured; we will revisit this at a later stage; this will also allow us to capture other aspects related to the narrow operationalization of the 5 content-gaps above (e.g. instead of gender-biographies, the interviews revealed interest in coverage of gender-related topics such as fenimism more generally)
  • metrics: we continued discussions around metrics and it seems that selection and extent scores as mentioned above seem good candidates for the first set of metrics because they: i) have high degree of awareness among community, ii) are mature in terms of having been used in multiple publications, iii) are actionable, iv) are straighforward to apply to all languages, v) are (mostly) straightforward to apply across the different gaps.
  • the plan is to prepare a more polished and detailed written summary of this conclusion to also identify missing steps. in the long-run, we will then aim to start working on a prototype-implementation of these metrics for the 5 gaps
Mar 26 2021, 6:42 PM · Research (FY2020-21-Research-April-June)
MGerlach added a comment to T272731: In-depth analysis of link-recommendation model .

Update week 2021-03-22:

  • added a hard-coded filter to avoid suggesting links to certain types of pages such as disambiguation pages.
  • this was suggested in earlier stages during the manual evaluation by volunteers; I now had the capacity address some of these issues in a more general framework using wikidata
  • for each article in the set of candidate-links, we check its corresponding wikidata-item and retrieve all wikidata-items that are listed under the instance-of property. This then allows us to remove links belonging to certain instances from the anchor-dictionary such that we can make sure that these will not be suggested by the link-recommendation. Currently, we remove links that are instances of these items:
  • the list of instances to filter can be easily adapted in this framework (as long as it is encoded in wikidata); in principle, it could thus be also easily customized for each Wikipedia depending on respective style-guides
Mar 26 2021, 6:24 PM · Research (FY2020-21-Research-April-June)
MGerlach added a comment to T272726: Evaluate list-building tools for ad-hoc topic modeling.

Update week 2021-03-22:

  • dug a little deeper in this analysis. the picture that emerges is that the different methods are very complementary in how they are able to capture the articles contained in each wikiproject
  • the overlap among the different lists is very low (among the 100 items from each list, there are very few items in common); on average (over different wikiprojects) jaccard-index is ~0.05...0.1
  • the improvement of one list with respect to another is not marginal, but often times one list-building-method provides very poor coverage, while another provides very good coverage; for example, there are hundreds of wikiprojects for which the "reader-based" list yields coverage that is at least twice as good as the baseline (that is an improvement of 100% or more in the number of articles that match the articles contained in the wikiprojects)
  • there seems to be no consistent pattern in terms of whether a specific method works best when aggregating different wikiprojects into topics (e.g. the different wikiprojects related to "Biography")
  • this seems to suggest that a good strategy as a tool is to pool the results from different lists
  • discussing with Isaac, we realized that it will be good to check how these results hold for at least one other non-English wiki; Isaac already prepared the data and I should be able to repeat this analysis quickly in the next week (together with writing this up)
Mar 26 2021, 6:13 PM · Research (FY2020-21-Research-April-June)

Mar 19 2021

MGerlach added a comment to T272731: In-depth analysis of link-recommendation model .

Update week 2021-03-15

  • we had different discussions around which manual filters we should add to prevent certain links from being recommended (e.g. links to disambiguation pages), see the extended documentation, in order to accommodate different requests from volunteers who gave feedback. based on these discussions I started to add features about the entities of the links based on information in wikidata. For example, disambiguation pages (often) have as the instance_of-property the value Q4167410 (Wikimedia disambiguation page ); similary, one can identify links to dates or years, which were previously flagged as unwanted link recommendations. the plan is to add the value for the instance_of-property for all links which makes it easy to filter a set of pages that should not be linked to. While we would set a default list (e.g. containing disambiguation-pages, dates, etc), the list could be customized for each wiki according to the style-guidelines.
  • in T277342 we are working to improve performance of the link-recommendation model (locally works well, but in the production environment there are some issues). one approach we tested was to reduce the number of database lookups by reducing the maximum length of the ngrams in the text that are considered as a link-anchor (P14882 suggests that >95% of links have anchors consisting of 5 or less tokens so we might not loose too much when imposing more restrictions here)
Mar 19 2021, 5:45 PM · Research (FY2020-21-Research-April-June)