Page MenuHomePhabricator

MGerlach (Martin Gerlach)
Research Scientist

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Sep 9 2019, 9:50 AM (185 w, 4 d)
Availability
Available
IRC Nick
mgerlach
LDAP User
MGerlach
MediaWiki User
MGerlach (WMF) [ Global Accounts ]

Recent Activity

Today

MGerlach added a comment to T316372: Write and submit paper on knowledge networks of Wikipedia readers.

weekly updates:

  • slowly moving towards finalizing main figures of the manuscript
  • working on robustness checks for main findings
  • moving towards starting the actual writing
Fri, Mar 31, 4:15 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T326730: Conduct pilot survey for measuring perception of readability.

weekly updates

  • scaling the survey to a larger sample of pairs from which we show a small subsample to each participant
    • this requires setting up some scripts that can automatically generate the corresponding logic from Limesurvey so that we dont have to add each pair manually which is error-prone
    • testing Limesurvey instance that we dont exceed size limitations for questionnaires when sampling many pairs (we succesfully added 10k pairs to the survey and uploaded it to Limesurvey)
  • the scaling of the survey caused some issues in how items are displayed to the participants; ideally we would like to display each item on a separate page to avoid distraction. this needs some debugging but should be solvable.
  • starting to look into survey privacy statement
Fri, Mar 31, 4:13 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T322354: Run bulk analysis of readability scores on different Wikipedias.

weekly update:

  • set up a pipeline to run the language-agnostic model to get readability scores for all articles in a dump
  • however, I am putting the language-agnostic model on hold for now. With Mykola, we have finished the evaluation of an alternative language-dependent model (based on mBERT) adapting the methodology from the existing revert-risk model. the advantage of this model is that i) it significantly outperforms the language-agnostic model in all languages but 1, ii) it supports many more languages than the language-agnostic model (which currently depends on dbpedia-spotlight which supports only around 20 languages), iii) it is a single multilingual model (in contrast the language-agnostic model requires a separate dbpedia-spotlight model for entitiy linking in each language).
  • given the unexpectedly strong performance of the multilingual model, I am planning to replace the language-agnostic model
Fri, Mar 31, 4:06 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T333701: Implement "visibility metric" as percentage of orphan articles in a category..

just added a link in the task description pointing to the previous research that identified selection, extent, and visibility as 3 of the most relevant aspects for metrics for knowledge gaps. https://meta.wikimedia.org/wiki/Research:Developing_Metrics_for_Content_Gaps_(Knowledge_Gaps_Taxonomy)#Outcomes

Fri, Mar 31, 3:53 PM · Research
MGerlach updated the task description for T333701: Implement "visibility metric" as percentage of orphan articles in a category..
Fri, Mar 31, 3:52 PM · Research
MGerlach added a comment to T323493: Upgrade the link recommendation algorithm from Spark 2 to Spark 3..

In my opinion, one of the main issues for the migration from spark2 to spark3 will be the following (there might be other issues though):
Currently, for the spark jobs the pipeline activates the anaconda-wmf environment (code). anaconda-wmf only supports spark2. thus we have to switch to conda-analytics. one of the problems I think we will be encountering is that we need some packages such as mwparserfromhell for the spark-jobs (see code). with the anaconda-wmf environment these are already installed and thus available on the spark-workers. in contrast, the conda-analytics environment is a minimal environment and doesnt contain that package by default so it will not be available on the spark-workers. therefore, we need to build a custom environment in which we install those dependencies (most importantly mwparserfromhell). we then have to ship that environment to the spark-workers when creating the spark-session (for example here). for this we can use the wmfdata-package which has an easy interface to start a spark-session and shipping the environment to the workers (code) by setting ship_python_env=True .

Fri, Mar 31, 9:36 AM · Machine-Learning-Team

Wed, Mar 29

MGerlach added a comment to T305688: Make HTML Dumps available in hadoop.

@MGerlach Where can one find the enterprise html json files on the stat machines? I didn't read this carefully enough before experimenting a bit - it will make things easier.

Wed, Mar 29, 6:54 AM · Data-Engineering

Tue, Mar 28

MGerlach updated the task description for T333127: [Session] LLMs, ChatGPT, machine learning tools, etc.
Tue, Mar 28, 7:15 AM · Wikimedia-Hackathon-2023

Fri, Mar 24

MGerlach added a comment to T322354: Run bulk analysis of readability scores on different Wikipedias.

weekly update:

  • re-training and evaluating the model using local instances of dbpedia-spotlight
  • as a next step we can run the model on all articles in the corresponding wikis using the local instance
Fri, Mar 24, 5:30 PM · Research (FY2022-23-Research-January-March)

Fri, Mar 17

MGerlach added a comment to T322354: Run bulk analysis of readability scores on different Wikipedias.

weekly update:

  • refactoring code and re-training the model using the local instance of dbpedia-spotlight
Fri, Mar 17, 2:47 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T299245: Develop a model to detect sentences that need copy-editing.

weekly update:

  • obtained first results for evaluating the model to score sentences for copyediting in multiple languages
  • considering 7 languages (arwiki, bnwiki, cswiki, enwiki, eswiki, frwiki, viwki) we obtain an accuracy between 70-80% across languages distinguishing ground-truth sentences from those wikis obtained from the removal of copyedit-templates
  • as a next step: apply model's scores to larger dataset of sentences from Wikipedia and manually check results
Fri, Mar 17, 2:45 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T326730: Conduct pilot survey for measuring perception of readability.

weekly update:

  • added attention screening item to the survey
  • implemented survey in limesurvey which samples a small random subset of all pairs for participants to compare
Fri, Mar 17, 2:40 PM · Research (FY2022-23-Research-January-March)
MGerlach updated the task description for T326730: Conduct pilot survey for measuring perception of readability.
Fri, Mar 17, 2:38 PM · Research (FY2022-23-Research-January-March)

Mon, Mar 6

MGerlach added a watcher for Wikimedia-Hackathon-2023: MGerlach.
Mon, Mar 6, 9:36 AM

Fri, Mar 3

MGerlach added a comment to T326730: Conduct pilot survey for measuring perception of readability.

weekly update:

  • figured out pre-screening (link) and sociodemographic data of participants (link) via prolific
  • figured out attention screening via prolific (link)
  • figured out additional survey items (language proficiency, topic interest)
  • still working on implementing questions on limesurvey but we have some new promising ideas how to do that in the next 1-2 weeks
Fri, Mar 3, 4:23 PM · Research (FY2022-23-Research-January-March)

Feb 24 2023

MGerlach added a comment to T299245: Develop a model to detect sentences that need copy-editing.

weekly update:

  • generated dataset of sentence pairs (before/after) from edit-diffs where copyedit-template was removed for all wikis which have the copyedit-template. after some filtering, there are 30 different wikis with at least 1000 pairs of aligned sentences (before/after the removal of the copyedit template)
Feb 24 2023, 4:26 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T326730: Conduct pilot survey for measuring perception of readability.

weekly update:

  • finalized processing/filtering pipeline to generate clean list of snippets of equal length from articles
  • working on implementing questions in limesurvey
Feb 24 2023, 4:11 PM · Research (FY2022-23-Research-January-March)

Feb 17 2023

MGerlach added a comment to T299245: Develop a model to detect sentences that need copy-editing.

weekly update:

  • using a standard pre-trained language model, we can automatically distinguish sentence-pairs (the same sentence before and after an edit tagged as copyedit via the removal of the copyedit template) with moderate precision of ~70-80%. this suggests we might use this model to predict whether a specific sentence requires copyediting.
  • so far we only checked this for sentence from English Wikipedia. as a next step I will extract similar sentence pairs (before/after an edit where the copyedit-template was removed) from other wikis. the template exists in 83 different wikis (Q6292692)
Feb 17 2023, 3:00 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T316367: Evaluate the impact of de-orphanization of articles in terms of visibility.

weekly update:

  • completed regression analysis for quantifying effect of de-orphanization on number of pageviews for all wikis.
  • results hold for most wikis and are statistically signficant: overall, large effect sizes of 50% (or more) increases in the number of pageviews following treatment.
  • I consider the work done as part of this task, but will keep this task open until results are documented on the respective meta-page
Feb 17 2023, 2:53 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T326730: Conduct pilot survey for measuring perception of readability.

weekly update:

  • solved access to technical infrastructure tools (prolific, limesurvey)
  • converging on a pipeline to extract snippets from articles that are from a single topic and are of appoximately equal length
Feb 17 2023, 2:49 PM · Research (FY2022-23-Research-January-March)

Feb 14 2023

Physikerwelt awarded T325815: Understanding perception of readability in Wikipedia a Love token.
Feb 14 2023, 10:31 PM · Research

Feb 13 2023

MGerlach added a comment to T327371: Fix Armenian sentence tokenization bug in the link recommendation algorithm .

In short: I could resolve the issue when upgrading wikitextparser to version 0.51.1 (I previously used 0.45.1).

Feb 13 2023, 4:12 PM · Machine-Learning-Team, Add-Link, Growth-Team
MGerlach added a comment to T324756: Story idea for Blog: From hell to HTML: releasing a Python package to easily work with Wikimedia HTML dumps.

@apaskulin Thank you. I resolved all remaining comments and moved to "ready for publication".

Feb 13 2023, 8:39 AM · Developer-Advocacy (Jan-Mar 2023), Technical-blog-posts
MGerlach moved T324756: Story idea for Blog: From hell to HTML: releasing a Python package to easily work with Wikimedia HTML dumps from Draft in progress to Ready for publication or requires post publication follow-up on the Technical-blog-posts board.
Feb 13 2023, 8:38 AM · Developer-Advocacy (Jan-Mar 2023), Technical-blog-posts

Feb 10 2023

MGerlach added a comment to T326730: Conduct pilot survey for measuring perception of readability.

weekly update:

  • working on finalizing to set up formal collaboration for granting access to WMF-limesurvey instance to collaborators for designing survey questions in practice
Feb 10 2023, 2:44 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T322354: Run bulk analysis of readability scores on different Wikipedias.

weekly update:

  • figured out how to run local instance of dbpedia-spotlight on stat.machines
  • next step: build a pipeline to run the full model to get readability scores of all articles of a dump
Feb 10 2023, 2:43 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T299245: Develop a model to detect sentences that need copy-editing.

weekly update:

  • filtering the dataset of edited sentences from copyedit-template-removal. many sentences are not changed due to grammtical/copyedit errors but seem to be stylistic reasons. Focusing only on a small subset of sentences that are clearly related to grammatical errors seems to make it possible to distinguish whether a sentence needs editing using pre-trained language models.
Feb 10 2023, 2:41 PM · Research (FY2022-23-Research-January-March)

Feb 9 2023

MGerlach updated the task description for T328742: Generate list of common misspellings from wiktionary.
Feb 9 2023, 11:10 AM · Research (FY2022-23-Research-January-March)

Feb 7 2023

MGerlach updated the task description for T324701: [SPIKE] Does site visit frequency serve as a meaningful distinction between various groups of readers?.
Feb 7 2023, 10:56 AM · Editing-team (Tracking), Research (FY2022-23-Research-January-March)
MGerlach added a comment to T324701: [SPIKE] Does site visit frequency serve as a meaningful distinction between various groups of readers?.

answering questions 3. (how do above distributions change when considering the reader visited articles beyond the main namespace).

Feb 7 2023, 10:56 AM · Editing-team (Tracking), Research (FY2022-23-Research-January-March)

Feb 6 2023

MGerlach updated the task description for T324701: [SPIKE] Does site visit frequency serve as a meaningful distinction between various groups of readers?.
Feb 6 2023, 4:07 PM · Editing-team (Tracking), Research (FY2022-23-Research-January-March)
MGerlach added a comment to T324701: [SPIKE] Does site visit frequency serve as a meaningful distinction between various groups of readers?.

answering questions 1./2.:

  • we separate all readers into two subgroups i) people who visit a single Wikipedia page and ii) people who visit multiple Wikipedia pages on a single day.
  • for each group separately, we calculate what percentage:
    • Did not visit Wikipedia before (non-recurring)
    • Visited Wikipedia within in the last month (7<d<=31)
    • Visited Wikipedia within in the last week (1<d<=7)
    • Visited Wikipedia yesterday (d=1)

regularity-vs-intensity.png (352×483 px, 30 KB)

Feb 6 2023, 4:07 PM · Editing-team (Tracking), Research (FY2022-23-Research-January-March)

Feb 3 2023

MGerlach added a comment to T322354: Run bulk analysis of readability scores on different Wikipedias.

weekly update:

  • talked to folks from data engineering and got good suggestions on how to run the model on the stat-machines. - next: will try to implement those suggestions and/or get additional support
Feb 3 2023, 2:34 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T316367: Evaluate the impact of de-orphanization of articles in terms of visibility.

weekly update:

  • Running regression models to quantify impact in a statistically more rigorous way (this also makes it easier to include wiki-specific analysis more elegantly)
Feb 3 2023, 2:33 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T316372: Write and submit paper on knowledge networks of Wikipedia readers.

weekly update:

  • finished 2 figures. continuing work on figures 3-7.
Feb 3 2023, 2:31 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T299245: Develop a model to detect sentences that need copy-editing.

weekly update:

  • generated larger dataset of sentence pairs by looking at all edits from removal of copyedit template (not only those marked as minor). this yields 176k pairs of sentences from 34k different articles.
Feb 3 2023, 2:31 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T326730: Conduct pilot survey for measuring perception of readability.

weekly update:

  • defined first approach for selecting snippets for comparison. for the pilot we will select only articles from a single topic in order to reduce the effect of familiarity. we will extract snippets from the articles by selecting 5 sentences.
Feb 3 2023, 2:28 PM · Research (FY2022-23-Research-January-March)
MGerlach created T328742: Generate list of common misspellings from wiktionary.
Feb 3 2023, 9:42 AM · Research (FY2022-23-Research-January-March)
MGerlach created T328734: Grant Access to ldap/nda for AKhatun.
Feb 3 2023, 8:09 AM · SRE, LDAP-Access-Requests
MGerlach updated subscribers of T328733: Requesting access to analytics-privatedata-users for Aisha Khatun.

Hey SRE/Analytics/Legal -- we have a new contractor onboard: @AKhatun_WMF . She needs access to HDFS and the stat machines for a new research project. Don't hesitate to let me know if you need more information. I believe she has an NDA signed as part of the contracting process but she's not a formal collaborator so I'm not 100% certain.

Feb 3 2023, 8:04 AM · SRE, SRE-Access-Requests
MGerlach created T328733: Requesting access to analytics-privatedata-users for Aisha Khatun.
Feb 3 2023, 8:03 AM · SRE, SRE-Access-Requests

Feb 1 2023

MGerlach added a comment to T324756: Story idea for Blog: From hell to HTML: releasing a Python package to easily work with Wikimedia HTML dumps.

@apaskulin I revised the blogpost addressing all of your comments -- they helped a lot to improve the writing. I kept all changes in suggestion mode so you could easily identify the differences to the previous version. I also added two suggestions for illustrations. Let me know if you have any other suggestions.

Feb 1 2023, 7:40 AM · Developer-Advocacy (Jan-Mar 2023), Technical-blog-posts

Jan 27 2023

ppelberg awarded T324701: [SPIKE] Does site visit frequency serve as a meaningful distinction between various groups of readers? a Barnstar token.
Jan 27 2023, 6:59 PM · Editing-team (Tracking), Research (FY2022-23-Research-January-March)
MGerlach added a comment to T324756: Story idea for Blog: From hell to HTML: releasing a Python package to easily work with Wikimedia HTML dumps.

Great post! I've added some comments and suggestions to the Google Doc. Let me know when you've had a chance to review them.

Jan 27 2023, 4:56 PM · Developer-Advocacy (Jan-Mar 2023), Technical-blog-posts
MGerlach added a comment to T326730: Conduct pilot survey for measuring perception of readability.

weekly update

  • generated a notebook for how to use Bradley-Terry-model to infer scores from pairwise comparisons with synthetic data; as well as how to ensure there are enough pairwise comparisons such that algorithm converges
Jan 27 2023, 4:49 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T324701: [SPIKE] Does site visit frequency serve as a meaningful distinction between various groups of readers?.

weekly update:

  • started first exploratory analysis of a 1% sample of reading sessions from a single day (~438k sessions)
  • percentage of reading sessions with single or multiple pageviews:

intensity.png (348×454 px, 17 KB)

  • percentage of reading sessions with last-access: 1 day, 1 week (but more than 1 day), 1 month (but more than 1 week), non-returning:

regularity.png (341×475 px, 26 KB)

  • percentage of reading sessions only accessing main namespace or also accessing pages in any other namespace:

familiarity.png (340×458 px, 21 KB)

  • next step: look at combination of different facets
Jan 27 2023, 4:45 PM · Editing-team (Tracking), Research (FY2022-23-Research-January-March)
MGerlach added a comment to T316367: Evaluate the impact of de-orphanization of articles in terms of visibility.

weekly update:

  • obtained first results for impact of treated articles (de-orphanized) vs control articles (same article in another language that remained orphan)
  • on average: there is a 40% increase in the number of pageviews for treated articles; we dont see an increase for the control articles
  • next: wiki-specific analysis via regression
Jan 27 2023, 4:37 PM · Research (FY2022-23-Research-January-March)

Jan 19 2023

MGerlach added a comment to T316372: Write and submit paper on knowledge networks of Wikipedia readers.

weekly update:

  • reorganizing datasets and working on figures
Jan 19 2023, 7:31 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T326730: Conduct pilot survey for measuring perception of readability.

weekly update:

  • decided on survey design based on pairwise comparison of articles (see Benoit et al)
  • decided on planne dinfrastructure setup: reruiting via prolific and hosting via limesurvey.
  • testing analysis of data from pairwise comparison via bradley-terry-model using crowd-kit package
  • next: testing survey setup via limesurvey
Jan 19 2023, 7:29 PM · Research (FY2022-23-Research-January-March)
MGerlach updated the task description for T326730: Conduct pilot survey for measuring perception of readability.
Jan 19 2023, 7:25 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T322354: Run bulk analysis of readability scores on different Wikipedias.

weekly update

  • the current bottleneck in the pipeline is the call of the public API of dbpedia-spotlight. this is not scalable to score all articles of a dump.
  • instead, tested to run a local instance of dbpedia-spotlight. this yields up to 100x fold speedup suggesting it is feasible to use this approach to score readability for all articles of a dump
  • next: pre-processing one dump and setting up a pipeline with the local instance of dbpedia-spotlight
Jan 19 2023, 7:24 PM · Research (FY2022-23-Research-January-March)

Jan 17 2023

MGerlach added a comment to T276438: Establish processes for running the dataset pipeline.

@kostajh I agree that we should re-run the pipelines after some time. If possible, updating after 6 months seems reasonable (though I dont have any quantitative insights how quick the model gets outdated). But the almost 1.5 years since October 2021 seem too long.

Jan 17 2023, 3:04 PM · Growth-Team, Machine-Learning-Team, Growth-Scaling, Add-Link

Jan 12 2023

MGerlach added a comment to T316372: Write and submit paper on knowledge networks of Wikipedia readers.

weekly update:

  • revised scope/framing of the paper
  • reached consensus and are now working on figures for results section and framing for introduction
Jan 12 2023, 7:15 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T299245: Develop a model to detect sentences that need copy-editing.

weekly update:

  • generated a new ground-truth dataset of edits to articles where the copyedit-template was removed. the rationale is that the removal of the template indicates that the edit improved the article with respect to copyediting.
  • looking at all such events in the revision history of all articles in English Wikipedia. only keeping edits for articles where: i) the template was removed only once in the revision history of the article (to avoid cases where the template is added/removed many times); ii) the edit was marked as a minor edit (to avoid edits which contain major addition/removal of content).
  • I then align sentences from the old to the new revision by matching all possible pairs of sentences via their (minimum) Levenshtein distance
  • this yields 13k pairs of sentences across 5k articles where each sentence was changed supposedly as part of copyediting due to removal of the copyedit template. One-off dataset available here.
  • in principle, the pipeline can be adapted easily to other Wikipedias which use this or similar templates.
Jan 12 2023, 7:07 PM · Research (FY2022-23-Research-January-March)

Jan 11 2023

MGerlach moved T325815: Understanding perception of readability in Wikipedia from FY2022-23-Research-January-March to In Progress on the Research board.
Jan 11 2023, 1:23 PM · Research
MGerlach created T326730: Conduct pilot survey for measuring perception of readability.
Jan 11 2023, 1:22 PM · Research (FY2022-23-Research-January-March)

Jan 6 2023

MGerlach added a comment to T299245: Develop a model to detect sentences that need copy-editing.

weekly update:

  • conducted first analysis on using BERT language model to classify sentences that are grammatically correct/incorrect
  • for benchmark corpora and synthetic corpora (not Wikipedia) we obtain high accuracy showing the general applicability of this approach to score/rank copyedits. however, for the dataset of sentences from the newcomer-copyedit-task, the model cannot distinguish between sentences before or after the edit (i.e. from these examples we cannot detect systematic differences which would help us distinguish supposedly correct or incorrect sentences). since the general approach works with benchmark/synthetic corpora, the limiting factor seems to be the underlying dataset to fine-tune the model with labeled sentences from Wikipedia.
  • Therefore, as a next step, we try to obtain an alternative dataset of labeled sentences from Wikipedia that are grammatically correct/incorrect using copyedit-templates. We will adapt the approach from extracting positive and negative examples of articles with reliability issues -- instead of looking for templates indicating reliability issues (such as pov), we will look for articles with copyedit-issues (e.g. copy_edit).
Jan 6 2023, 4:54 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T324701: [SPIKE] Does site visit frequency serve as a meaningful distinction between various groups of readers?.

Weekly update:

  • generated first dataset of readers of English Wikipedia on a single day capturing for each actor: i) the number of pageloads to articles in the main namespace; ii) the number of days since the last access; iii) how many other namespaces (beyond main namespace) were visited.
  • will start exploratory analysis of the data in the next week(s)
Jan 6 2023, 4:42 PM · Editing-team (Tracking), Research (FY2022-23-Research-January-March)
MGerlach moved T299245: Develop a model to detect sentences that need copy-editing from FY2022-23-Research-July-September to FY2022-23-Research-January-March on the Research board.
Jan 6 2023, 4:29 PM · Research (FY2022-23-Research-January-March)

Jan 5 2023

MGerlach claimed T324701: [SPIKE] Does site visit frequency serve as a meaningful distinction between various groups of readers?.
Jan 5 2023, 7:39 PM · Editing-team (Tracking), Research (FY2022-23-Research-January-March)
MGerlach moved T325815: Understanding perception of readability in Wikipedia from Staged to FY2022-23-Research-January-March on the Research board.
Jan 5 2023, 7:38 PM · Research

Dec 22 2022

MGerlach added a comment to T316367: Evaluate the impact of de-orphanization of articles in terms of visibility.

weekly update:

  • extracted treatment/control pairs for two complementary studies in effect of deorphanization on visibility
    • 1) treatment: article a deorphanized in month t in wiki w; control: article a remains orphan in wiki w' != w
    • 2) treatment: article a orphanized in month t in wiki w; control: article a'!=a remains orphan in wiki w
  • next: extracting timeseries and calculating difference in differences to estimate effect of treatment
Dec 22 2022, 7:41 PM · Research (FY2022-23-Research-January-March)
MGerlach closed T316369: Plan qualitative analysis on language-agnostic readability scores as Resolved.

weekly update:

Dec 22 2022, 1:45 PM · Research (FY2022-23-Research-October-December)
MGerlach closed T316369: Plan qualitative analysis on language-agnostic readability scores, a subtask of T293028: [EPIC] Initiate Multilingual Readability Research, as Resolved.
Dec 22 2022, 1:44 PM · Research, Epic
MGerlach created T325815: Understanding perception of readability in Wikipedia.
Dec 22 2022, 1:40 PM · Research

Dec 21 2022

MGerlach added a comment to T325521: Investigate `UnicodeEncodeError` thrown by Add-A-Link training pipeline for fywiki model.

Just want to add that this error is not unique to fywiki but will likely appear when working with other wikis as well. It as been reported for zhwiki (see the issues mentioned above) and I was able to reproduce that error.

Dec 21 2022, 2:31 PM · Machine-Learning-Team, Add-Link, Growth-Team

Dec 15 2022

MGerlach added a comment to T316367: Evaluate the impact of de-orphanization of articles in terms of visibility.

weekly update:

  • setting up more robust experiment in which we define a control group of articles that were not de-orphanized.
Dec 15 2022, 8:13 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T316369: Plan qualitative analysis on language-agnostic readability scores.

weekly update:

  • sketching out details on how to measure perception of readability via surveys. most promising approach is the one described by Benoit et al. Measuring and Explaining Political Sophistication through Textual Complexity
  • naively, we would just show individual articles to participants and ask how they rate how easy it is to understand the text (readability) on a pre-defined scale (say from 0 to 5 or 0 to 100).
  • however, it can be difficult for humans to rate a single trait (such as readability) on a pre-defined absolute scale; in addition, there are some concerns about the validity of the result when measuring a trait directly with a single question
  • therefore, the authors propose a simpler setup in which raters are only shown pairs of two texts at a time and asked to rate which of the two presented texts is easier to read/understand. by collecting many of such pairwise ratings, we can calculate a readability score for each individual text using the Bradley-Terry model (this is similar to the Elo-rating in chess, where a skill score of an individual player is calculated from only comparing pairs of players). the resulting readability score is not a direct measurement but a latent measurement of "readability".
  • one of the main advantages of this setup is that it can be applied to measure perception of readability in all languages, i.e. it does not rely on the availability of labeled data where we show participants to versions of the same text.
  • in principle, this approach should be straightforward to apply to the case of articles in Wikipedia. I am currently sketching out the details of how to adapt. There are some open questions such as how to sample the set of articles which will be shown to raters (e.g. how many articles, should we sample them randomly or should they match some representativeness in terms of topics, which part of the article to show, how long should the text-snippet be, etc)
Dec 15 2022, 8:09 PM · Research (FY2022-23-Research-October-December)
MGerlach moved T316369: Plan qualitative analysis on language-agnostic readability scores from FY2022-23-Research-July-September to FY2022-23-Research-October-December on the Research board.
Dec 15 2022, 7:35 PM · Research (FY2022-23-Research-October-December)
MGerlach closed T315086: Create a list with examples of automatically suggested copyedits for manual evaluation as Resolved.

weekly update:

  • current tasks have been solved
  • there are no further improvements planned at the moment for manual evaluation
  • therefore closing the task, feel free to re-open in the future if the work will be picked up again
Dec 15 2022, 7:34 PM · Growth-Copy-Edit, Research (FY2022-23-Research-July-September)
MGerlach closed T315086: Create a list with examples of automatically suggested copyedits for manual evaluation, a subtask of T293034: [EPIC] Research support for Copyediting as a structured tasks, as Resolved.
Dec 15 2022, 7:34 PM · Research, Epic
MGerlach closed T315086: Create a list with examples of automatically suggested copyedits for manual evaluation, a subtask of T315096: [Epic] Growth: Copy edit, as Resolved.
Dec 15 2022, 7:34 PM · Growth-Team (Current Sprint), Growth-Copy-Edit

Dec 9 2022

MGerlach added a comment to T316367: Evaluate the impact of de-orphanization of articles in terms of visibility.

weekly update:

  • completed first analysis on number of clicks to newly deorphanized articles from clickstream data.
  • found statistically significant increase in the number of clicks in the month after de-orphanization: average increases from 0 to 32; 17% of the de-orphanizing links appeared with more than 10 clicks per month, this is much higher than for the average link ( Dimitrov et al. write "on Wikipedia only around 4% of all existing links are clicked by visitors more frequently than 10 times within a month"
  • we will likely refine this analysis to include a more robust control groups to compare the increase in visibility
Dec 9 2022, 10:18 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T316369: Plan qualitative analysis on language-agnostic readability scores.

weekly update:

  • synthesizing literature on measuring perception of readability. identified two main approaches: i) rating of a pair of two texts which one is more simple/difficult (e.g. Benoit et al. 2019); ii) items from a "readability"-scale (e.g. Graefe et al. 2016). will sketch in detail how we would adapt these two approaches to our case. we will likely start with option 1 as it is more well-defined.
  • contacted GDI about options to run surveys with WMF's LimeSurvey instance
Dec 9 2022, 10:08 PM · Research (FY2022-23-Research-October-December)

Dec 8 2022

MGerlach created T324756: Story idea for Blog: From hell to HTML: releasing a Python package to easily work with Wikimedia HTML dumps.
Dec 8 2022, 2:18 PM · Developer-Advocacy (Jan-Mar 2023), Technical-blog-posts

Dec 2 2022

MGerlach added a comment to T316369: Plan qualitative analysis on language-agnostic readability scores.

weekly update:

  • collecting literature on how to measure readability in surveys
  • getting an overview of the different options of doing surveys (quicksurvey, limesurvey, prolific)
Dec 2 2022, 9:23 PM · Research (FY2022-23-Research-October-December)
MGerlach added a comment to T316372: Write and submit paper on knowledge networks of Wikipedia readers.

weekly update:

  • analysing variation of knowledge network across the 14 different wikis. investigating the effect of the underlying wiki (size etc)
Dec 2 2022, 9:21 PM · Research (FY2022-23-Research-January-March)

Nov 25 2022

MGerlach added a comment to T316369: Plan qualitative analysis on language-agnostic readability scores.

weekly update:

  • first meeting with Indira and others. agreed about general research plan to conduct surveys on readability of wikipedia articles and whether our automatic scores from the language-agnostic model captures that across different languages
  • we will be spending the next weeks to refine the details of the research plan specifically focusing on how we should set up the survey around readability
  • we will start with revisiting the literature around constructs of readability and how we can measure them via surveys
Nov 25 2022, 7:00 PM · Research (FY2022-23-Research-October-December)
MGerlach added a comment to T315086: Create a list with examples of automatically suggested copyedits for manual evaluation.

weekly update:

  • went through feedback from ambassadors about sources of errors. the main issue seemed to be about text that was directly quoted (often text in foreign language or from hundreds of years ago) thus throwing many false positives
  • added improvements to the pre- and post-processing of the errors (both for LanguageTool and from custom list of misspellings)
    • ignoring text paragraphs that indicate blockquotes (example) or multicolumn tables often capturing parallel multilingual texts of poems or lyrics (example) (commit)
    • better handling of text that appears in explicit quotes, especially single quotes and dismabiguating with apostrophes (example) (commit)
Nov 25 2022, 6:57 PM · Growth-Copy-Edit, Research (FY2022-23-Research-July-September)
MGerlach added a comment to T316367: Evaluate the impact of de-orphanization of articles in terms of visibility.

weekly update:

  • retrieving the timeseries of clicks to deorphanized articles from clickstream before and after they were deoprhanized. Resolving technical issues around pages that were renamed or became redirects (since clickstream only contains pagetitles) to track the number of clicks from month to month.
Nov 25 2022, 6:46 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T316372: Write and submit paper on knowledge networks of Wikipedia readers.

weekly update:

  • implemented model to predict scores on curiosity (specifically the deprivation sensitivtiy facet) from metrics of knowledge networks (code)
Nov 25 2022, 6:44 PM · Research (FY2022-23-Research-January-March)

Nov 10 2022

MGerlach added a comment to T322354: Run bulk analysis of readability scores on different Wikipedias.

weekly update:

  • created slightly adapted model which maps the readability score from the language-agnostic model to a grade level (roughly the number of years of education needed to understand a text) since that is more interpretable. with this we can map the distribution over (a random subset of) articles of a wiki.

Screenshot from 2022-11-10 18-50-27.png (1×457 px, 78 KB)

Nov 10 2022, 5:55 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T316372: Write and submit paper on knowledge networks of Wikipedia readers.

weekly update:

  • created datasets for knowledge networks for 14 wikis (the same we considered in the why we read wikipedia studies)
Nov 10 2022, 5:47 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T315086: Create a list with examples of automatically suggested copyedits for manual evaluation.

weekly updates:

  • Kirsten shared results from 2nd round of manual evaluation (spreadsheet)
  • refinements for LanguageTool substantially improve the accuracy. my guess is that filtering[[ https://community.languagetool.org/rule/list?lang=en | certain types of rules ]] is the main driver of this improvement.
  • surfacing copyedits from a list of common misspellings seems very effective in terms of surfacing high-precision copyedits. the main challenge will then be to curate such lists such that there are enough copyedits and that there is a sufficient diversity of copyedits (and not just one particular misspelling again and again)
Nov 10 2022, 5:46 PM · Growth-Copy-Edit, Research (FY2022-23-Research-July-September)
MGerlach added a comment to T299245: Develop a model to detect sentences that need copy-editing.

weekly update:

  • discussed with Djellel; we refined plan for development of the model in terms of features training and test data
  • planning to start implementing in the next week(s)
Nov 10 2022, 5:40 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T316369: Plan qualitative analysis on language-agnostic readability scores.

weekly update:

  • reached consensus around doing a survey on readability with readers
  • will be setting up a meeting with Indira and their team to discuss details and next steps.
Nov 10 2022, 5:38 PM · Research (FY2022-23-Research-October-December)

Nov 3 2022

MGerlach added a comment to T322354: Run bulk analysis of readability scores on different Wikipedias.

weekly update:

  • Calculated readability scores for a random subset of 1000 articles in 17 wikis (cawiki, dawiki, dewiki, enwiki/simplewiki, eswiki, fiwiki, frwiki, huwiki, itwiki, nlwiki, nowiki, ptwiki, rowiki, ruwiki, svwiki, trwiki)
    • Ideally, we would like to get readability scores for every article in a project. However, when calculating the language-angostic scores, we need to call the DBPedia-spotlight API to get the language-agnostic features. This step is time-consuming. Therefore, in a first iteration, we only get a representative sample
    • Similarly, we focus on 17 wikis since those are the languages which are supported by the DBPedia-spotlight API
  • We see noticable differences in the distribution of readability scores (low=easier to read/high=harder to read), specifically, simplewiki has much lower scores than enwiki

Screenshot from 2022-11-03 18-32-11.png (1×448 px, 80 KB)

Nov 3 2022, 5:33 PM · Research (FY2022-23-Research-January-March)
MGerlach created T322354: Run bulk analysis of readability scores on different Wikipedias.
Nov 3 2022, 5:23 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T316367: Evaluate the impact of de-orphanization of articles in terms of visibility.

weekly update:

  • refined analysis setup. the aim is the show that articles that are de-orphanized by editors show an increase in visibility
    • dataset: all articles that were de-orphanized in a given month
    • metric: we operationalize visibility of the target article (the orphan that gets de-orphanized) in three different ways. : i) number of pageviews to the source-articles, i.e. all articles linking to the target-article; ii) number of clicks from the source-articles to the target article; iii) number of previews from the source articles to the target article.
    • in each case the visilibity before orphanization is X-=0 and after de-orphanization is X+>=0. therefore, we aim for the following two analysis: i) is there an increase in visibility, i.e. is X+ statistically significantly different from 0; ii) how big is the increase in visibility, i.e. comparing the increase from X- to X+ to the average increase in visibility of other articles that received new inlinks.
  • next step: implement analysis
Nov 3 2022, 5:18 PM · Research (FY2022-23-Research-January-March)
MGerlach moved T316372: Write and submit paper on knowledge networks of Wikipedia readers from FY2022-23-Research-July-September to FY2022-23-Research-October-December on the Research board.
Nov 3 2022, 5:09 PM · Research (FY2022-23-Research-January-March)
MGerlach moved T316367: Evaluate the impact of de-orphanization of articles in terms of visibility from FY2022-23-Research-July-September to FY2022-23-Research-October-December on the Research board.
Nov 3 2022, 5:09 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T316372: Write and submit paper on knowledge networks of Wikipedia readers.

Weekly update

  • adding analysis around how knowledge networks differ with the topic of the articles
  • running analysis to quantify variation of knowledge networks with country and language-version
Nov 3 2022, 5:09 PM · Research (FY2022-23-Research-January-March)
MGerlach updated the task description for T322070: Onboard Nazia onto analytics infrastructure.
Nov 3 2022, 1:49 PM · Research

Nov 2 2022

MGerlach created T322222: Grant Access to ldap/nda for Appledora.
Nov 2 2022, 2:51 PM · SRE, LDAP-Access-Requests

Oct 28 2022

MGerlach added a comment to T316369: Plan qualitative analysis on language-agnostic readability scores.

weekly update:

  • organized my thoughts around next steps on analysis on readability. there are three main aspects to pursue next:
    • improve the model: i) extend the number of supported languages, ii) explore the effect of the topic or the quality on the readability score
    • understand perception of readability of Wikipedia articles through surveys
    • measure the readability of all articles in a project. this will allow us to compare the expected number of years of education required to understand the articles (e.g. the Flesch-Kincaid reading level roughly maps to grade level) with the actual number of years of education of Wikipedia's readers (e.g. from recent reader-demographics survey)
Oct 28 2022, 2:32 PM · Research (FY2022-23-Research-October-December)
MGerlach added a comment to T315086: Create a list with examples of automatically suggested copyedits for manual evaluation.

Weekly update:

  • created a more balanced sample of copyedits from lists of common misspellings ([[ copyedits_v2_common-misspellings-balanced | spreadsheet ]])
    • some misspellings occur very often while many/most occur rarely which leads to an overrepresentation of a few misspellings
    • I parse the whole dump of the respective Wikipedias (ar,bn,cs,es, en) and all occurrences of each misspelling
    • I then keep only at most 5 occurrences of each misspelling
  • more generally, this suggests we would need to figure out ways to extend the list of common misspellings in order to surface a more diverse set of misspellings
Oct 28 2022, 2:24 PM · Growth-Copy-Edit, Research (FY2022-23-Research-July-September)

Oct 20 2022

MGerlach added a comment to T316372: Write and submit paper on knowledge networks of Wikipedia readers.

Weekly update

  • generated additional dataset to analyze how topics affect the knowledge networks
  • generated additional dataset to analyze how knowledge networks might differ across countries; planning to similarly check how knowledge networks might differ across different languages (so far we only checked enwiki)
Oct 20 2022, 4:42 PM · Research (FY2022-23-Research-January-March)
MGerlach added a comment to T315086: Create a list with examples of automatically suggested copyedits for manual evaluation.

Weekly update

  • compiled a new sample of copyedit errors in this spreadsheet (v2) https://docs.google.com/spreadsheets/d/1ponuT-jwEM4KF9XCG1Q86mkDYEGm8-F66PaG1U0XQ3M/edit#gid=0
    • following the same approach as before, I used the 20220801 snapshot of the HTML-dumps. I then selected the first 100 articles for which there is at least one error left after the filtering. We only consider articles that have not been edited in at least 1 year. For each article I picked only one error randomly such that we have 100 errors from 100 different articles.
    • I improved some of the details of the pre-processing of the text and the post-processing of the errors.
      • pre-processing: we now keep the paragraph-id from the HTML where the error was found. This will make it easier to re-localize the error downstream in the application
      • post-processing: improved the filtering of errors. i) errors where only correction is a hyphen (previous corrections seemed too strict and debatable); ii) errors which are marked by sic; iii) errors which appear in quotes; iv) errors which relate to proper nouns (filter words where first letter is capitalized)
  • List of common misspellings: I looked for misspelled words in ar, bn, cs, es, en using the list of misspelled words compiled by the ambassadors (thank you!). These lists each contain 20 misspelled words. Roughly 1 out of 1000 articles has at least one of these spelling mistakes (an indication we might need to extend this list of common spelling mistakes in some cases)
    • ar: 100 articles with errors after checking 94,994 articles
    • bn: 100 articles with errors after checking 54,189 articles
    • cs: 100 articles with errors after checking 124,892 articles
    • es: 56 articles with errors after checking all 1,732,127 articles in the dump
    • en: 100 articles with errors after checking 1,277,730 articles (for this I actually used a much longer[[ https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines#The_Machine-Readable_List | list of common spelling mistakes ]] since the original list did not yield errors in any article)
  • LanguageTool: I looked for errors from LanguageTool in ar, es, en (unfortunately not supported in bn and cs). I filtered a set of error-categories/types/rules based on the most common false positives from the evaluation in the previous round (v1). Note that, in principle, these rules can be added/removed to the filter separately for each wiki
    • PUNCTUATION (category): feedback mentioned many false positives related to adding commas
    • STYLE (category): feedback mentioned many false positives where the error isnt clearly wrong or right
    • REDUNDANCY (category): feedback mentioned many false positives where the error isnt clearly wrong or right
    • HUNSPELL_RULE_AR (rule): rule related to typos in Arabic which caused many false positives
    • MORFOLOGIK_RULE_ES (rule): rule related to typos in Spanish which caused many false positives
    • UPPERCASE_SENTENCE_START (rule): rule requiring capital letter at sentence start; this yielded many false positives due to abbreviations causing wrong identification of the beginning of a sentence
Oct 20 2022, 4:33 PM · Growth-Copy-Edit, Research (FY2022-23-Research-July-September)

Oct 13 2022

MGerlach added a comment to T316369: Plan qualitative analysis on language-agnostic readability scores.

Weekly update:

Oct 13 2022, 5:08 PM · Research (FY2022-23-Research-October-December)
MGerlach added a comment to T315086: Create a list with examples of automatically suggested copyedits for manual evaluation.

Weekly update:

Oct 13 2022, 5:04 PM · Growth-Copy-Edit, Research (FY2022-23-Research-July-September)