Page MenuHomePhabricator

MGerlach (Martin Gerlach)
Research Scientist

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Sep 9 2019, 9:50 AM (147 w, 3 d)
Availability
Available
IRC Nick
mgerlach
LDAP User
MGerlach
MediaWiki User
MGerlach (WMF) [ Global Accounts ]

Recent Activity

Fri, Jul 1

MGerlach updated the task description for T305258: Exploratory analysis of readers’ knowledge networks (Q4).
Fri, Jul 1, 4:12 PM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T305258: Exploratory analysis of readers’ knowledge networks (Q4).

Update week 2022-06-27:

  • writing up results of first round of exploratory analysis for meta
Fri, Jul 1, 4:12 PM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T299091: Implement language-agnostic approach for readability (Q3, Q4).

Update week 2022-06-27:

  • finished analysis of first version of language-angostic model for readability. in the supervised classification of articles with different annotated readability levels (easy vs hard) it performs better than standard readability formulas when applying to other languages without fine-tuning or re-training
  • finishing the write-up of the results in the next 1-2 weeks to be added to meta
Fri, Jul 1, 4:10 PM · Research (FY2021-22-Research-April-June)
MGerlach updated the task description for T299091: Implement language-agnostic approach for readability (Q3, Q4).
Fri, Jul 1, 4:05 PM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T299092: Evaluation of link recommendations to orphan articles via link translation (Q3/Q4) .

Update week 2022-06-27:

  • figured out how to query morelike and findlink-recommendations as baselines
  • started summarizing and writing up the results on the evaluation of the link recommendation for the meta-page
Fri, Jul 1, 4:03 PM · Research (FY2021-22-Research-April-June)

Fri, Jun 24

MGerlach added a comment to T305258: Exploratory analysis of readers’ knowledge networks (Q4).

Update week 2022-06-20:

  • debugging the calculation of some metrics for the knowledge networks
  • summarizing results and discussing next steps
Fri, Jun 24, 4:02 PM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T299091: Implement language-agnostic approach for readability (Q3, Q4).

Update week 2022-06-20:

  • doing some robustness checks and improvements to the model, e.g. improving the sentence-tokenization in German (code) where nltk's default sentence-tokenizer couldnt handle well excessive usage of abbreviations (such as "bzw." or "z. B.") which lead to artificially short sentences in German Wikipedia which likely negatively affected the performance of the language-agnostic model
  • summarizing the results and discussing next steps
Fri, Jun 24, 4:01 PM · Research (FY2021-22-Research-April-June)

Fri, Jun 17

MGerlach added a comment to T299245: Develop a model to detect sentences that need copy-editing (Q3+).

Update week 2022-06-13:
Discussing with Djellel we clarified the goal:

  • the evaluation of LanguageTool T305180 showed that it can surface meaningful copyedits (and suggestions for improvement) in Wikipedia article but that we have to deal with the challenge that it might surface many false positives. Therefore, we would like to build a model that can assign a confidence score to Languagetool’s copyedits in Wikipedia articles. Then we can work through the list from top to bottom highlighting those copyedits for which we are most confident about.
  • The first step is to get a labeled dataset of copyedits in Wikipedia articles. A promising candidate (thanks to Marshall's suggestion) are the suggested edits in which newcomers are guided (without actual recommendations) to do copyedits. We can extract the corresponding edits in all deployed languages via the revision_tag "newcomer task: copyedit" (see an example in enwiki). Analysing the corresponding diffs should give us a multilingual labeled dataset to train a model that can rank LanguageTool's errors.
Fri, Jun 17, 3:41 PM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T299091: Implement language-agnostic approach for readability (Q3, Q4).

Update week 2022-06-13:

  • comparing supervised model across different languages. The language-agnostic model outperforms the standard-readability formulas in all other languages besides English (in which the model was explicitly trained). The language-agnostic model yields comparable results across all languages without any re-training or fine-tuning (which will be the typical case when applying to other languages in which there is no training data).
  • we will be investigating some robustness checks of the model (hyperparamter grid-search etc) and try to add some more language-agnostic features before writing a summary report.
Fri, Jun 17, 3:27 PM · Research (FY2021-22-Research-April-June)
MGerlach updated the task description for T299091: Implement language-agnostic approach for readability (Q3, Q4).
Fri, Jun 17, 3:22 PM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T305258: Exploratory analysis of readers’ knowledge networks (Q4).

Update week 2022-06-13:

  • added dataset of knowledge networks from targeted navigation (wikispeedia-data) for comparison. The corresponding networks are substantially different than navigation of Wikipedia readers (specifically in terms of shorter characteristic path lengths)
  • compiled different results from the first exploratory analysis. We will prepare a summary and started to discuss next steps for the analysis.
Fri, Jun 17, 3:22 PM · Research (FY2021-22-Research-April-June)
MGerlach closed T305180: Evaluate copyedits from LanguageTool (Q4), a subtask of T293034: [EPIC] Research support for Copyediting as a structured tasks, as Resolved.
Fri, Jun 17, 3:17 PM · Research, Epic
MGerlach closed T305180: Evaluate copyedits from LanguageTool (Q4) as Resolved.
Fri, Jun 17, 3:16 PM · Research (FY2021-22-Research-April-June)

Fri, Jun 10

MGerlach added a comment to T305258: Exploratory analysis of readers’ knowledge networks (Q4).

Update week 2022-06-06:

  • continuing analysis of knowledge networks of readers. main principal component seems to be associated with hunter and busybody types described in Lydon-Staley etal. Performing robustness checks of the results by calculating additional network metrics and comparing with Random Walk models.
Fri, Jun 10, 5:04 PM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T299091: Implement language-agnostic approach for readability (Q3, Q4).

Update week 2022-06-06:

  • comparing the language-agnostic with the language-specific approach in the unsupervised task. The non-customized (per language) readability formulas (such as Flesch-Reading-ease) perform better for English but the language-agnostic approach generalizes better for the other languages
  • starting evaluation of the supervised model in non-English data
Fri, Jun 10, 4:58 PM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T305180: Evaluate copyedits from LanguageTool (Q4).

Updates week 2022-06-06:

  • presented results to the Growth Team (slidedeck). They will discuss about how to best proceed and get back to us
  • started some additional analysis comparing performance of spellcheckers with that of LanguageTool, will add results next week
Fri, Jun 10, 4:53 PM · Research (FY2021-22-Research-April-June)

Wed, Jun 8

MGerlach added a comment to T310120: Add a link: per language word exclusion list.

I suppose this could work the same way as the exclusion sections T304150: Allow communities to configure which sections are excluded from link suggestion generation / T279519: Add a link: algorithm improvements: Avoid recommending links in sections that usually don't have links. It would probably be a "medium" level of effort to implement this on a per-word level.

When we have identified an anchor-candidate, we do some filtering here (e.g. same anchor is already linked in the article), so that might be a good place to check whether the anchor-candidate is in the word-exclusion list.

Wed, Jun 8, 9:28 AM · Growth-Team, Add-Link

Jun 3 2022

MGerlach added a comment to T305258: Exploratory analysis of readers’ knowledge networks (Q4).

Update week 2022-05-30:

  • given the similarity in the knowledge networks of Wikipedia readers and KNOT-study, we have been trying to reproduce the findings that there are 2 main types of readers characterized by hunter- and busybody-style of curiosity.
Jun 3 2022, 4:29 PM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T299091: Implement language-agnostic approach for readability (Q3, Q4).

Update week 2022-06-03

  • implemented language-agnostic features for measuring readability in different languages
  • started evaluation in datasets from 6 languages (ru,es,fr,en,de,it)
Jun 3 2022, 3:43 PM · Research (FY2021-22-Research-April-June)
MGerlach updated the task description for T299091: Implement language-agnostic approach for readability (Q3, Q4).
Jun 3 2022, 3:40 PM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T305180: Evaluate copyedits from LanguageTool (Q4).

Updates week 2022-06-03:

  • evaluated performance of LanguageTool for detecting copyedits in more than 20 different language versions of Wikipedia using the approach of comparing error-rates in articles with featured-article-badge and articles with copyedit-template
  • I showed how we can apply an additional post-processing step in which we filter errors using the rich annotations of the text contained in the HTML-version of the articles which leads to a substantial improvement in the performance in almost all wikis
  • I added a detailed write-up of the evaluation of LanguageTool to meta: https://meta.wikimedia.org/wiki/Research:Copyediting_as_a_structured_task/LanguageTool
Jun 3 2022, 2:20 PM · Research (FY2021-22-Research-April-June)
MGerlach updated the task description for T305180: Evaluate copyedits from LanguageTool (Q4).
Jun 3 2022, 2:16 PM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T309353: When suggesting a link, this link shouldn't be preceding or following an existing link.

Thanks for flagging this.
The reported behavior is expected to happen (sometimes) as the current model does not check for already existing links next to the anchor of the recommended link.
In principle, it is possible to control for that. When iterating through the text to find candidate anchors for new links (code), we would have to keep track of what comes before and what comes after and check whether it is a link or not. However, currently we only iterate through text-nodes of mwparserfromhell's output (ignoring any other nodes such as links) since these are the only type of elements where we want to place new links. Thus, we would have to slightly change the way we iterate through the nodes to be able to check whether a preceeding or following node (directly next to a candidate anchor) is of type WikiLink.

Jun 3 2022, 8:22 AM · Research, Growth-Team, Add-Link

Jun 1 2022

MGerlach added a comment to T309263: Inspect jawiki and aswiki "add a link" models to improve their performance.

Comment on jawiki: I believe the problem comes from the tokenization to generate n-grams which we use as anchor-candidates. Our current approach (here) uses NLTK's standard tokenizer. Among other things, it relies on whitespaces to identify boundaries of words. However, as I understand this does not work in Japanese as it is written (mostly) without whitespaces (see here or here). Thus, we need to improve the tokenizer for Japanese (and potentially other languages with similar properties too). A starting point might be this blogpost which gives an overview on different techniques for tokenization.

Jun 1 2022, 10:28 AM · Growth-Team, Machine-Learning-Team (Active Tasks), CommRel-Specialists-Support (Apr-Jun-2022), Add-Link

May 25 2022

leila awarded T302237: Outreachy Project (Round 24): Build Python library to work with html-dumps a Love token.
May 25 2022, 10:30 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
MGerlach added a comment to T305180: Evaluate copyedits from LanguageTool (Q4).

Updates week 2022-05-23:

  • extending evaluation on Wikipedia to other languages (i.e. getting featured articles and articles with copyedit-templates)
May 25 2022, 12:56 PM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T304548: Deploy "add a link" to 4th round of wikis.

I'm keeping the two wikis in my freezer.

@MGerlach, when could we start checking on the models?

May 25 2022, 10:36 AM · Machine-Learning-Team (Active Tasks), User-notice, Growth-Team (Current Sprint), Add-Link

May 24 2022

MGerlach added a comment to T304548: Deploy "add a link" to 4th round of wikis.

Since @MGerlach mentioned that some heuristics are failing when working with the skipped wikis, I think the next steps would be to manually inspect the models with users who have experience with these languages or use google-translate as the link-recommendation algorithm is iteratively improved until the models pass the backtesting evaluation.

May 24 2022, 7:49 AM · Machine-Learning-Team (Active Tasks), User-notice, Growth-Team (Current Sprint), Add-Link

May 23 2022

MGerlach closed T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps as Resolved.
May 23 2022, 8:15 AM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
MGerlach closed T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps, a subtask of T302237: Outreachy Project (Round 24): Build Python library to work with html-dumps, as Resolved.
May 23 2022, 8:15 AM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)

May 20 2022

MGerlach added a comment to T305258: Exploratory analysis of readers’ knowledge networks (Q4).

Update week 2022-05-16:

May 20 2022, 4:02 PM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T299092: Evaluation of link recommendations to orphan articles via link translation (Q3/Q4) .

Update week 2022-05-16:

  • investigated different heuristic approaches for baseline comparison. I ended up with 3 heuristics we can use almost out-of-the-box
    • link-reciprocity: suggest any of the existing outgoing links as new incoming link (can be obtained easily from the pagelinks-table)
    • morelike: recommend similar articles from cirrussearch' morelike (the top-3 suggestions are shown in the mobile version as "related articles"). can be easily obtained by querying the MediaWiki API
    • findlink-tool: this tool suggests articles to link from based on text-search. it is listed in the orphan-template as a way to find suggestions how to de-orphanize articles.
May 20 2022, 3:05 PM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T299091: Implement language-agnostic approach for readability (Q3, Q4).

Update week 2022-05-16:

  • adding more features to the language-agnostic model to improve accuracy: i) document-level features such as type-token ration; ii) distance-features such as number of words between the same entity.
May 20 2022, 2:59 PM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T305180: Evaluate copyedits from LanguageTool (Q4).

Updates week 2022-05-16

May 20 2022, 2:22 PM · Research (FY2021-22-Research-April-June)

May 19 2022

MGerlach added a comment to T308146: Integrate the model training and the deployment of "Add a link" to new Wikipedias exiting the Incubator.

@kostajh I dont think there is a well-defined minimum. In principle, you can train on anything, though fewer articles will mean fewer training data. The question is then whether this is enough for the model to learn meaningful patterns from that. I honestly dont have a well-informed answer for that. We should try in any case for these wikis. We could track the performance of the backtesting evaluation for wikis of different sizes and check if there is a significant drop when the number of articles becomes too small.

May 19 2022, 2:03 PM · Machine-Learning-Team, incubator.wikimedia.org, Growth-Team, Add-Link

May 18 2022

MGerlach added a comment to T304548: Deploy "add a link" to 4th round of wikis.

@kevinbazira great.
Most of the languages look fine but there are some redflags:

  • jawiki with extremely low recall (0.01) and very low precision at 0.32
  • aswiki with much lower precision than other wikis (0.57)
May 18 2022, 9:11 AM · Machine-Learning-Team (Active Tasks), User-notice, Growth-Team (Current Sprint), Add-Link

May 13 2022

MGerlach added a comment to T305258: Exploratory analysis of readers’ knowledge networks (Q4).

Update week 2022-05-09:

  • looking at descriptive statistics of network metrics and comparing with statistics from KNOT-data
May 13 2022, 3:31 PM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T299092: Evaluation of link recommendations to orphan articles via link translation (Q3/Q4) .

Update week 2022-05-09:

  • evaluating the baselines for comparison: i) simple heuristic based on reciprocating outgoing links; ii) nearest neighbors in embeddings of link network (currently training the embedding)
May 13 2022, 3:30 PM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T305180: Evaluate copyedits from LanguageTool (Q4).

Update week 2022-05-09:

  • Evaluated error-detection of LanguageTool in Wikipedia articles focusing on the problem of False Positives
  • The main problem is that we dont have an annotated dataset with copyedit errors in Wikipedia articles that we could use as ground truth. Therefore, I focused on "featured articles" (the highest quality class) assuming that these articles are free of errors (at least in the sense that we should highlight these errors to be corrected). In this line of thinking, any error thrown by LanguageTool is considered a false positive. This will give us an upper bound on the false-positive rate since some of these errors might still be genuine.
  • The dataset consists of 6090 featured articles in enwiki with 1,192,369 sentences. (code)
  • when using the language-code "en-US", we get 0.804 errors per sentence, i.e. almost every sentence will yield one false positive. this is consistent with the qualitative observations reported in T284550#7802765 when using LanguageTool's web interface which uses "en-US" as a default.
  • in contrast, when using language-code "en", we get only 0.065 errors per sentence, i.e. only about 1 false positive in every 15 sentences. This means that we have a more than 10-fold reduction in the number of false positives when switching from "en-US" to "en".
  • Thus, using the "en" language-code for LanguageTool will substantially reduce false positives in the context of Wikipedia. when using the LanguageTool-website (instead of the API) the default choice is en-US. In fact, there is no option to select "en" but only other variants such as en-GB. This explains the anecdotal observation of many false positives when copy-pasting text from Wikipedia articles into the web-interface of LanguageTool (T284550#7802765).
  • When comparing the different choices of the language-variants for the annotated benchmark-corpus (T305180#7909253), using "en" instead of "en-US" admittedly causes LanguageTool to detect fewer errors (reduction in recall, e.g., from 0.14 to 0.08 in A.train ) but it still detects thousands of errors. considering the reduction in the number of false positives, this trade-off seems to be well worth.
  • code: https://gitlab.wikimedia.org/repos/research/copyedit/-/blob/main/run_featured-articles_run-languagetool_bulk.ipynb
May 13 2022, 3:18 PM · Research (FY2021-22-Research-April-June)
MGerlach updated the task description for T305180: Evaluate copyedits from LanguageTool (Q4).
May 13 2022, 2:52 PM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T299091: Implement language-agnostic approach for readability (Q3, Q4).

Update week 2022-05-09:

  • built a supervised model for binary classification of an article into simple/hard trained on the English articles in enwiki and simplewiki
  • Random Forest classifier yields prediction accuracy of 0.709 using language-agnostic features on a held-out test set from same corpus. in comparison, the baseline using language-specific features yields 0.788.
  • it is expected that the latter yields a slightly better performance for English because these features were specifically developed for English and the language-agnostic features are a coarser approximation. however, we expect that the model using language-agnostic features will generalize better when applied to articles from other languages without further training or fine-tuning.
  • next steps: i) improving the model by adding some additional language-agnostic features based on distance of entities and hyperparameter tuning; ii) evaluating the trained model on non-English corpus.
May 13 2022, 2:52 PM · Research (FY2021-22-Research-April-June)

May 6 2022

MGerlach added a comment to T305258: Exploratory analysis of readers’ knowledge networks (Q4).

Update week 2022-05-02:

May 6 2022, 11:29 AM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T299091: Implement language-agnostic approach for readability (Q3, Q4).

Update week 2022-05-02:

  • first results for language-agnostic readability applied to the English dataset (simplewiki vs enwiki). detailed results: https://gitlab.wikimedia.org/repos/research/readability/-/blob/main/compute_readability/score%20entity%20based%20metrics.ipynb
    • language-agnostic features can systematically distinguish between texts from simplewiki and enwiki, e.g. avg sentence length in terms of the number of entities is 1.90 (simplewiki) vs 2.57 (enwiki)
    • language-agnostic features are strongly correlated to standard readability formulas such as Flesch-Kincaid
    • language-agnostic features perform almost as well as standard readability formulas in an unsupervised task, i.e. for the same article in two version (simplewiki and enwiki), decide which one is simpler/harder based only on the readability score
  • Next steps: train supervised model to classify articles into simpler/harder based on a combination of features.
May 6 2022, 10:22 AM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T299092: Evaluation of link recommendations to orphan articles via link translation (Q3/Q4) .

Update week 2022-05-02:

  • finished first evaluation of link recommendation for orphan articles
  • evaluation data: 68,473 orphan articles that were de-orphanized between two consecutive snapshots (2022-01 vs 2022-02), we extract the in-links that were added to these articles as ground-truth. Note that this is less than 1% of the total number of orphan articles that exist (>8M).
  • we calculate recall@k, i.e. the fraction of times the added link was among the top-k suggestions from link-translation. Some example statistics for k=10:
wiki_dbrecalltotalrecall@10
enwiki45873890.0619840303153336
plwiki60923680.25717905405405406
arzwiki46422580.20549158547387067
viwiki154118230.8453099286889743
frwiki31018100.1712707182320442
  • next step: compare with (simple) baseline from, e.g., recommendations from standard graph-embedding
May 6 2022, 10:11 AM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T305180: Evaluate copyedits from LanguageTool (Q4).

Update week 2022-05-02:

May 6 2022, 9:59 AM · Research (FY2021-22-Research-April-June)
MGerlach updated the task description for T305180: Evaluate copyedits from LanguageTool (Q4).
May 6 2022, 9:43 AM · Research (FY2021-22-Research-April-June)

Apr 29 2022

MGerlach added a comment to T299091: Implement language-agnostic approach for readability (Q3, Q4).

Update week 2022-04-25:

  • Selected features for supervised classification of readability levels for Language-agnostic approach and baseline using standard NLP-approaches.
Apr 29 2022, 6:09 PM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T305180: Evaluate copyedits from LanguageTool (Q4).

Update week 2022-04-25:

  • Familiarized myself with Errant, a tool for the evaluation of grammatical error detection. Starting in English we can evaluate LanguageTool’s precision and recall on a benchmark corpus.
Apr 29 2022, 6:07 PM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T305258: Exploratory analysis of readers’ knowledge networks (Q4).

Update week 2022-04-25:

Apr 29 2022, 6:05 PM · Research (FY2021-22-Research-April-June)
MGerlach updated the task description for T305258: Exploratory analysis of readers’ knowledge networks (Q4).
Apr 29 2022, 5:58 PM · Research (FY2021-22-Research-April-June)

Apr 27 2022

MGerlach added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

Hi everyone,
the final application deadline has passed. I wanted to thank you for all your hard work and effort you put into your submissions. you all did a really good job not only in your analysis and notebooks but also in being curious, asking questions, and helping each other out!

Apr 27 2022, 4:28 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)

Apr 21 2022

MGerlach added a comment to T299091: Implement language-agnostic approach for readability (Q3, Q4).

Update week 2022-04-18:

  • took a deeper look into which of the features from the language-agnostic approach we should use. We adapted the features from the original study (Stajner/Hulpus 2020) in order to avoid as much as possible any language-specific parsing (beyond the entity-linking). Specifically, we removed any features relying on counting the number of tokens as this requires language-specific tokenizers. New adapted features include, for example, the average number of entities per sentence as a language-agnostic proxy to the typical measure of sentence length in number of words/tokens/syllables.
  • starting unsupervised evaluation of whether features can systematically distinguish the easier from the more difficult version of the same text and compare with baseline features from standard readability formulas.
Apr 21 2022, 6:07 PM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T305258: Exploratory analysis of readers’ knowledge networks (Q4).

Update week 2022-04-18:

  • continuing to implement different network-metrics, including the comparison with random null models.
Apr 21 2022, 5:56 PM · Research (FY2021-22-Research-April-June)

Apr 19 2022

MGerlach added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

Hi all,
just a reminder: if you have not done already, dont forget to submit your final application on the outreachy website before the deadline on Friday, April 22 at 4pm UTC (a little bit more than 3 days from when I am posting this).
Even if you sent your notebook to Isaac or me for feedback during the past weeks (thanks for anyone who shared their progress), you still need to submit the application on the outreachy-site. Please also make sure to include the public link to your notebook (see the documentation for how to get the public link).

Apr 19 2022, 10:47 AM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
MGerlach added a comment to T302237: Outreachy Project (Round 24): Build Python library to work with html-dumps.

Hi all,
just a reminder: if you have not done already, dont forget to submit your final application on the outreachy website before the deadline on Friday, April 22 at 4pm UTC (a little bit more than 3 days from when I am posting this).
Even if you sent your notebook to Isaac or me for feedback during the past weeks (thanks for anyone who shared their progress), you still need to submit the application on the outreachy-site. Please also make sure to include the public link to your notebook (see the documentation for how to get the public link).

Apr 19 2022, 10:45 AM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)

Apr 16 2022

MGerlach added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

@Talika2002 @Appledora @Robot_Jelly
Some comments on the final application, especially with respect to the project timeline and community specific questions:

  • The most important aspect of the application is the notebook from the application task. We wont be able to give additional feedback on the other aspects in the final application.
  • Project timeline: try to give a rough sketch of the different steps you plan to take in order to complete the project (T302237 mentions 4 phases: becoming familiar with the dumps, writing the code, writing the documentation, and performing analysis). No need to be perfect and too detailed. The aim is to organize what things you think should be done and in which order. This is not set in stone and will likely change as you embark on the project. we feel free to also use it to identify which aspects are most interesting to you and would like to spend more time on; as well as any additional steps you might think are pertinent for your work on this project
  • Community-specific questions: you can skip this question if you want.
Apr 16 2022, 2:05 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)

Apr 14 2022

MGerlach added a comment to T305258: Exploratory analysis of readers’ knowledge networks (Q4).

Update week 2022-04-11:

  • Started to implement network features for different networks:
  • Clustering, Characteristic Path Length, Small-world propensity, and core-periphery structure (code). This requires some iterations to adapt to the specific use-case (e.g. weighted networks)
  • Extract meso-scale structure from fitting blockmodels with graph-tool. This allows us to quantify modularity and compression of the network (code)
Apr 14 2022, 5:48 PM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T299091: Implement language-agnostic approach for readability (Q3, Q4).

Update week 2022-04-11:

Apr 14 2022, 5:46 PM · Research (FY2021-22-Research-April-June)

Apr 13 2022

MGerlach added a comment to T302237: Outreachy Project (Round 24): Build Python library to work with html-dumps.

Welcome to all the newer applicants. There is still some time left for the microtask (i.e. the application task T302242). The deadline for the final application is April 22. See T302242#7840521 for some additional comments.

Apr 13 2022, 2:13 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)

Apr 8 2022

MGerlach added a comment to T299091: Implement language-agnostic approach for readability (Q3, Q4).

Update week 2022-04-04:

  • implemented 6 baselines readability scores for the English-Simple-Wikipedia dataset. All of them show statistically significant differences between articles in simplewiki and enwiki. (code)
  • started to implement the entity linker (dbpedia-spotlight for spacy) to extract language-agnostic features.
Apr 8 2022, 2:24 PM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T305180: Evaluate copyedits from LanguageTool (Q4).

Update week 2022-04-04:

  • starting to evaluate performance of LanguageTool on benchmark datasets from grammatical error correction. we started with the most commonly used one for English (BEA 2019 Shared Task). We already identified a similar benchmark dataset which covers a few other languages (Lang-8). We will have to adapt the standard evaluation metrics as we are mostly interested in error detection and not so much on automatic error correction (the correction will be done by the editors)
  • started to brainstorm how we could obtain a ground-truth dataset of copyedit errors in Wikipedia-articles. one idea is to start from sentences with in high-quality articles (e.g. featured-articles) as extreme cases with no errors to assess sensitivity.
Apr 8 2022, 2:20 PM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T305258: Exploratory analysis of readers’ knowledge networks (Q4).

Update week 2022-04-04:

Apr 8 2022, 2:04 PM · Research (FY2021-22-Research-April-June)
MGerlach updated the task description for T305258: Exploratory analysis of readers’ knowledge networks (Q4).
Apr 8 2022, 2:01 PM · Research (FY2021-22-Research-April-June)
MGerlach created T305688: Make HTML Dumps available in hadoop.
Apr 8 2022, 12:39 PM · Data-Engineering
MGerlach added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

@Appledora

hi @MGerlach , just for the sake of clarifications, recording contributions and making the final application is not the same right? I know that contributions can be updated, something like a version controlling mechanism. But can we also edit our applications once we send them in? Thanks.

Apr 8 2022, 11:13 AM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
MGerlach added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

Hi everyone,
welcome everyone who joined since the last posts. Great to see the ongoing discussions.

Apr 8 2022, 10:25 AM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)

Apr 1 2022

MGerlach moved T299092: Evaluation of link recommendations to orphan articles via link translation (Q3/Q4) from FY2021-22-Research-Jan-March to FY2021-22-Research-April-June on the Research board.
Apr 1 2022, 3:56 PM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T299092: Evaluation of link recommendations to orphan articles via link translation (Q3/Q4) .

Update week 2022-03-28:

  • evaluation of the link-recommendation is ongoing and will be completed during Q4
  • we have generated a ground-truth dataset of all links that were added to orphan articles in all Wikipedias comparing two consecutive snapshots
  • We are currently implementing the evaluation in terms of the recall@k:
    • given a true link (s_true,t) that was added to an orphan article as a target (t), our model generates a ranked list of the top-k links (s_1,t)...(s_k,t) to the orphan article as a target
    • averaging over all true links, we calculate how often the true link was among the top-k recommendations
  • our link-translation model will rank the recommended links according to the number of language editions the link already exists
Apr 1 2022, 3:55 PM · Research (FY2021-22-Research-April-June)
MGerlach renamed T299092: Evaluation of link recommendations to orphan articles via link translation (Q3/Q4) from Evaluation of link recommendations to orphan articles via link translation (Q3) to Evaluation of link recommendations to orphan articles via link translation (Q3/Q4) .
Apr 1 2022, 3:42 PM · Research (FY2021-22-Research-April-June)
MGerlach created T305258: Exploratory analysis of readers’ knowledge networks (Q4).
Apr 1 2022, 3:36 PM · Research (FY2021-22-Research-April-June)
MGerlach updated the task description for T293036: [EPIC] Understanding curious and critical readers.
Apr 1 2022, 3:32 PM · Research, Epic
MGerlach moved T299245: Develop a model to detect sentences that need copy-editing (Q3+) from FY2021-22-Research-Jan-March to FY2021-22-Research-April-June on the Research board.
Apr 1 2022, 2:56 PM · Research (FY2021-22-Research-April-June)
MGerlach moved T299091: Implement language-agnostic approach for readability (Q3, Q4) from FY2021-22-Research-Jan-March to FY2021-22-Research-April-June on the Research board.
Apr 1 2022, 2:55 PM · Research (FY2021-22-Research-April-June)
MGerlach updated the task description for T299091: Implement language-agnostic approach for readability (Q3, Q4).
Apr 1 2022, 2:54 PM · Research (FY2021-22-Research-April-June)
MGerlach renamed T299091: Implement language-agnostic approach for readability (Q3, Q4) from Implement language-agnostic approach for readability (Q3) to Implement language-agnostic approach for readability (Q3, Q4).
Apr 1 2022, 2:53 PM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T299091: Implement language-agnostic approach for readability (Q3, Q4).

Update week 2022-03-28:

  • figuring out last details for the formal collaboration
  • we have started to implement baseline readability metrics from textstat, specifically Flesch-Reading ease.
  • update the title and task description to reflect that work is ongoing and will be carried over into Q4
Apr 1 2022, 2:52 PM · Research (FY2021-22-Research-April-June)
MGerlach closed T299786: Curiosity: exploratory analysis and onboarding collaborators (Q3), a subtask of T293036: [EPIC] Understanding curious and critical readers, as Resolved.
Apr 1 2022, 10:43 AM · Research, Epic
MGerlach closed T299786: Curiosity: exploratory analysis and onboarding collaborators (Q3) as Resolved.

Weekly update

  • I wrote up a summary of my first exploratory analysis on how readers interact with an article's version-history and talk-page, respectively. This is a first step to better understanding how readers on Wikipedia critically engage with the information they find. The motivation is that many teaching guides on how to assess the quality and trustworthiness of information in the context of Wikipedia recommend to check the version-history and talk-pages to see how the information was created and what are perspectives of contributors. This analysis provides some answers to how much readers actually use this and for what types of pages.
  • In short: There is substantial engagement with version-history and talk-pages by readers, in particular not only by editors but also by readers who do not edit. In some cases (specifically desktop) engagement is at around 1:1000 pageviews which is almost as high as engagement with citations (Piccardi et al.); results suggest that the position of the corresponding button has a big influence on the engagement. For example, on mobile the talk-page is always at the top in enwiki but in other wikis only visible to logged-in users: engagement with talk-pages in enwiki is even higher on mobile than on desktop whereas in other wikis it is almost non-existent. There are some indication that existence of reliability issues in articles (via templates) correlates with engagement of version-history and talk-pages.
  • Link to the meta-page with the details: https://meta.wikimedia.org/wiki/Research:Understanding_Curious_and_Critical_Readers/Reader_interactions_with_talk-pages_and_version-history
Apr 1 2022, 10:43 AM · Research (FY2021-22-Research-Jan-March)

Mar 31 2022

MGerlach created T305180: Evaluate copyedits from LanguageTool (Q4).
Mar 31 2022, 4:50 PM · Research (FY2021-22-Research-April-June)

Mar 29 2022

MGerlach closed T293035: Research support for Copyediting as a structured tasks (Q2) as Resolved.

Based on the above observations of the API surfacing the copyedits from LanguageTool and @Miriam's feedback after their discussion with Growth I will focus on trying to get a quantitative estimate of the precision of copyedits from LanguageTool when applying to Wikipedia articles. This will be captured in a separate tasks.

Mar 29 2022, 4:55 PM · Research (FY2021-22-Research-Jan-March)
MGerlach closed T293035: Research support for Copyediting as a structured tasks (Q2), a subtask of T293034: [EPIC] Research support for Copyediting as a structured tasks, as Resolved.
Mar 29 2022, 4:55 PM · Research, Epic
MGerlach updated the task description for T293035: Research support for Copyediting as a structured tasks (Q2).
Mar 29 2022, 4:51 PM · Research (FY2021-22-Research-Jan-March)
MGerlach updated the task description for T299092: Evaluation of link recommendations to orphan articles via link translation (Q3/Q4) .
Mar 29 2022, 4:32 PM · Research (FY2021-22-Research-April-June)
MGerlach added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

@Appledora

Hi @Isaac and @MGerlach , is there any reason why this (mwparserfromhell) method considers Category as Wikilinks too?

image.png (256×544 px, 43 KB)

This is the expected behavior. Categories are also Wikipedia-pages (though in a different namespace). In the article, you will see the categories as blue links at the bottom.

Mar 29 2022, 3:42 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
MGerlach updated the task description for T304542: Deploy "add a link" to third round of wikis.
Mar 29 2022, 11:43 AM · User-notice, Growth-Team (Current Sprint), Add-Link
MGerlach added a comment to T304542: Deploy "add a link" to third round of wikis.

Training models for the wikis listed has been completed successfully.

We have also worked on models verification using the backtesting results shown below:
CCing @MGerlach, in case he'd like to add comments on the backtesting evaluation.

Mar 29 2022, 10:18 AM · User-notice, Growth-Team (Current Sprint), Add-Link

Mar 28 2022

MGerlach added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

@Appledora

q1. Is there any imminent deadline on this?

The end of the application period (I think April 22). If you would like to get feedback beforehand and incorporate this into your notebook, please send us your notebook 1-2 weeks before the deadline (or earlier).

q2. Will there be more supplementary microstasks?

No.

q3. Do we have to complete all these TODOs ?

Yes. (We are not expecting perfection -- just do your best and explain what you're doing and why!)

q4. Is the process iterative? i.e: develop upon feedback etc.

Yes. We will try to give feedback once for each application (If you feel you have completed your notebook, you may request feedback and we will provide high-level feedback on what is good and what is missing. To do so, send an email to your mentor with the link to your public PAWS notebook. We will try to make time to give this feedback once to anyone who would like it.) But feel free to ask questions/comments in the task and help each other out as you progress on your notebook.

Mar 28 2022, 10:41 AM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
MGerlach added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

@ShivaniSangwan

Hi @Isaac and @MGerlach, I was going through the sample code provided and had a small doubt regarding sections extracted via

mwparserfromhell.parse().get_sections()

Is the content present inside any section has to be just the textual information present inside it, or do different tags and different table classes need to be structured individually? I went through the documentation https://www.mediawiki.org/wiki/Specs/HTML/2.4.0#Headings_and_Sections but could not find an explanation for the same.

I am not sure I understand the question. I try to add some comments, but let me know if it is still unclear.

Mar 28 2022, 10:27 AM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
MGerlach added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

@SakshiKaushik

Hi @Isaac and @MGerlach , I am able to do all the tasks that is mentioned in the notebook.

Great.

I had some doubts that I want to ask before I send it for your review because you have time for feedback at most once. Here are my questions

  1. If we compare the number of tags, it is significantly higher in HTML parser than wikitext parser. I am wondering if wikitext does some pre-filtering of commonly occurring tags?

No, there is no pre-filtering in wikitext. However, the content of wikitext and HTML can differ due to, e.g., transclusion of templates: wikitext only contains the template-tag without the template's content whereas HTML will contain the template's content. For more information, see Mitrevski et al. (https://arxiv.org/pdf/2001.10256.pdf) who discuss this at the example of links.

  1. Continuing on the above question, the parsed text also had some extra text like 'This article is Orphaned...' which of course, does not appear in the main text of the wikipedia page. Is there a mechanism to automatically add these extra words every-time someone writes an article? I believe we need to remove such text from ALL the HTML parsing that we want to do. I couldn't clean this up properly because I am not sure what is the standard practice.

There is no standard practice since typically we have not been working with HTML-dumps. If you think this type of text should be treated different than other types of text, simply try to explain why and how you would approach this problem. No need for the perfect solution at this point but try your best at describing your thinking around this.

Mar 28 2022, 10:14 AM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)

Mar 27 2022

MGerlach added a comment to T302237: Outreachy Project (Round 24): Build Python library to work with html-dumps.

Hey @Isaac and @MGerlach!

Hope you are doing well.

This is Radhika Saini, an outreachy applicant. I have good experience with Python, Jupyter notebooks, HTML and Really looking forward to being able to make meaningful contributions to this project. Can you please guide me on where we can begin or what to look into first?

Mar 27 2022, 11:30 AM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)

Mar 26 2022

MGerlach added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

Hi @Isaac and @MGerlach,
Hope you are doing well.
While forking the notebook when I have added ?format=raw at the end of the URL, It is not giving me the option to download in the .ipynb format. Will it be okay if I firstly download it in pdf format and then convert it into .ipynb via an online converter or if there is any other way to download it please let me know.
Thank you in advance!!

Mar 26 2022, 10:30 AM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
MGerlach added a comment to T302237: Outreachy Project (Round 24): Build Python library to work with html-dumps.

Hi all. If you would like to contribute to this project during the application period, please see the application task T302242.
Dont hesitate to ask questions there. I will try to answer open questions but feel free to help each other out too.

Mar 26 2022, 10:18 AM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)

Mar 25 2022

MGerlach changed the visibility for T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.
Mar 25 2022, 5:42 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
MGerlach added a comment to T299786: Curiosity: exploratory analysis and onboarding collaborators (Q3).
  • We solved access for collaborators to stat-machines, including use of jupyter notebooks.
  • We discussed the analysis steps to generate and analyze knowledge networks from an existing set of reading sessions which we will begin next week. We agreed that the following information will likely be relevant to capture in the generation of the networks: i) existing hyperlinks, ii) transition by the reader, iii) similarity between all pairs of nodes based on text (weighted, undirected). We agreed on a set of network metrics (such as clustering) to measure those networks. We will compare metrics with several null models based on random walks to have a reference point that takes into account variation in, e.g., size of the networks
  • I started to extend analysis on reader interactions with talk-pages and version-history beyond enwiki to other languages. The motivation is that placement of the corresponding buttons differs across languages (e.g. the mobile-version talk-page button is at the top in enwiki but at the bottom of the article in eswiki) and it will be interesting to see how this affects click-through rates. The challenge was to get the language-specific namespaces and aliases (e.g. mobile version-history calls in enwiki “Special:History/<page-title> vs eswiki “Especial:History”, see code to automatically extract for all languages) and filter the webrequest-logs in each language for version-history and talk-page calls from readers. I managed to extract the data for all languages and am planning next week to analyze a few languages to compare corresponding click-through rates.
Mar 25 2022, 5:13 PM · Research (FY2021-22-Research-Jan-March)
MGerlach updated the task description for T299786: Curiosity: exploratory analysis and onboarding collaborators (Q3).
Mar 25 2022, 5:07 PM · Research (FY2021-22-Research-Jan-March)

Mar 24 2022

MGerlach added a comment to T284550: Integrate LanguageTools suggestion as structured task.

@kostajh at this point it is also an anecdotal observation from my side. For example, copying the lead section of enwiki-article Roman Catholic Diocese of Bisceglie (a random article) into the LanguageTool-interface yields 7 errors in only 3 sentences -- all of which are false positives (bold are the errors from LanguageTool):

The Diocese of Bisceglie (Latin: Dioecesis Vigiliensis) was a Roman Catholic diocese located in the town of Bisceglie on the Adriatic Sea in the province of Barletta-Andria-Trani, Apulia in southern Italy. It is five miles south of Trani. In 1818, it was united with the Archdiocese of Trani to form the Archdiocese of Trani-Bisceglie.[1][2]

Not all of the errors are links such as the italics Latin spelling. Some false positives are recurring, e.g. Trani appears as plain text and is marked as an error. Thus, it is not trivial to exclude them and some effort is needed to filter out the false positives.

Mar 24 2022, 9:54 AM · Growth-Team, Growth-Structured-Tasks

Mar 23 2022

MGerlach added a comment to T304542: Deploy "add a link" to third round of wikis.

In the deployment of the second round (T284481), we had a step between training and deployment to verify the performance of the trained model in the backtesting-evaluation. Specifically, in T284481#7163025 I reported the precision of the model in each language at the default threshold (0.5). Checking that this value is not too low (we typically aimed for a precision of around 0.75 or higher) helps us to make sure the model does not completely fail and should not be deployed. When training the model in a new language this can happen due to issues in correctly parsing the text, as happened in the case of Bengali (bnwiki).

Mar 23 2022, 5:18 PM · User-notice, Growth-Team (Current Sprint), Add-Link

Mar 18 2022

MGerlach added a comment to T284550: Integrate LanguageTools suggestion as structured task.

In order to directly query LanguageTool's suggestions for Wikipedia-articles, I have created an experimental API on toolforge. One needs to specify the language of the Wikipedia (e.g. “en”) and the page-title. Some example calls for different wikis/articles:

Mar 18 2022, 3:38 PM · Growth-Team, Growth-Structured-Tasks
MGerlach updated subscribers of T293035: Research support for Copyediting as a structured tasks (Q2).
  • solved the issues from last week and deployed an experimental API for detecting copyedits in Wikipedia articles using languagetool
  • Call to API by providing language (e.g. “en”) and page-title. Some example calls for different wikis/articles:
  • The API applies aggressive filtering to reduce the number of false positives. Using the HTML-version of the article makes it much easier to: i) identify only plain text (and avoiding tables, infoboxes, or transcluded content from templates; ii) remove errors that overlap with annotated text such as links, bold, italics, etc, which often yields wrong errors.
  • Code (and documentation): https://gitlab.wikimedia.org/repos/research/copyedit-api
  • Solution was to set up an endpoint on cloud-vps running our own instance of LanguageTool (thanks @Isaac )
  • Started with evaluation of the precision and recall of LanguageTool on a ground-truth dataset in English (non-Wikipedia); the plan is to perform similar evaluation on Wikipedia articles.
Mar 18 2022, 3:08 PM · Research (FY2021-22-Research-Jan-March)
MGerlach added a comment to T299091: Implement language-agnostic approach for readability (Q3, Q4).

Update week 2022-03-14:

  • this week we discussed the possibility of doing this work as part of a formal collaboration. I am in contact with Miriam to sort out the details. Overall, this direction seems promising as there is strong alignment in the short (3-6 months) and medium (6+ months)-term research plan. I am confident that we can make a final decision next week (with work starting asap).
Mar 18 2022, 3:00 PM · Research (FY2021-22-Research-April-June)