User Details
- User Since
- Sep 9 2019, 9:50 AM (185 w, 4 d)
- Availability
- Available
- IRC Nick
- mgerlach
- LDAP User
- MGerlach
- MediaWiki User
- MGerlach (WMF) [ Global Accounts ]
Today
weekly updates:
- slowly moving towards finalizing main figures of the manuscript
- working on robustness checks for main findings
- moving towards starting the actual writing
weekly updates
- scaling the survey to a larger sample of pairs from which we show a small subsample to each participant
- this requires setting up some scripts that can automatically generate the corresponding logic from Limesurvey so that we dont have to add each pair manually which is error-prone
- testing Limesurvey instance that we dont exceed size limitations for questionnaires when sampling many pairs (we succesfully added 10k pairs to the survey and uploaded it to Limesurvey)
- the scaling of the survey caused some issues in how items are displayed to the participants; ideally we would like to display each item on a separate page to avoid distraction. this needs some debugging but should be solvable.
- starting to look into survey privacy statement
weekly update:
- set up a pipeline to run the language-agnostic model to get readability scores for all articles in a dump
- however, I am putting the language-agnostic model on hold for now. With Mykola, we have finished the evaluation of an alternative language-dependent model (based on mBERT) adapting the methodology from the existing revert-risk model. the advantage of this model is that i) it significantly outperforms the language-agnostic model in all languages but 1, ii) it supports many more languages than the language-agnostic model (which currently depends on dbpedia-spotlight which supports only around 20 languages), iii) it is a single multilingual model (in contrast the language-agnostic model requires a separate dbpedia-spotlight model for entitiy linking in each language).
- given the unexpectedly strong performance of the multilingual model, I am planning to replace the language-agnostic model
just added a link in the task description pointing to the previous research that identified selection, extent, and visibility as 3 of the most relevant aspects for metrics for knowledge gaps. https://meta.wikimedia.org/wiki/Research:Developing_Metrics_for_Content_Gaps_(Knowledge_Gaps_Taxonomy)#Outcomes
In my opinion, one of the main issues for the migration from spark2 to spark3 will be the following (there might be other issues though):
Currently, for the spark jobs the pipeline activates the anaconda-wmf environment (code). anaconda-wmf only supports spark2. thus we have to switch to conda-analytics. one of the problems I think we will be encountering is that we need some packages such as mwparserfromhell for the spark-jobs (see code). with the anaconda-wmf environment these are already installed and thus available on the spark-workers. in contrast, the conda-analytics environment is a minimal environment and doesnt contain that package by default so it will not be available on the spark-workers. therefore, we need to build a custom environment in which we install those dependencies (most importantly mwparserfromhell). we then have to ship that environment to the spark-workers when creating the spark-session (for example here). for this we can use the wmfdata-package which has an easy interface to start a spark-session and shipping the environment to the workers (code) by setting ship_python_env=True .
Wed, Mar 29
Tue, Mar 28
Fri, Mar 24
weekly update:
- re-training and evaluating the model using local instances of dbpedia-spotlight
- as a next step we can run the model on all articles in the corresponding wikis using the local instance
Fri, Mar 17
weekly update:
- refactoring code and re-training the model using the local instance of dbpedia-spotlight
weekly update:
- obtained first results for evaluating the model to score sentences for copyediting in multiple languages
- considering 7 languages (arwiki, bnwiki, cswiki, enwiki, eswiki, frwiki, viwki) we obtain an accuracy between 70-80% across languages distinguishing ground-truth sentences from those wikis obtained from the removal of copyedit-templates
- as a next step: apply model's scores to larger dataset of sentences from Wikipedia and manually check results
weekly update:
- added attention screening item to the survey
- implemented survey in limesurvey which samples a small random subset of all pairs for participants to compare
Mon, Mar 6
Fri, Mar 3
weekly update:
- figured out pre-screening (link) and sociodemographic data of participants (link) via prolific
- figured out attention screening via prolific (link)
- figured out additional survey items (language proficiency, topic interest)
- still working on implementing questions on limesurvey but we have some new promising ideas how to do that in the next 1-2 weeks
Feb 24 2023
weekly update:
- generated dataset of sentence pairs (before/after) from edit-diffs where copyedit-template was removed for all wikis which have the copyedit-template. after some filtering, there are 30 different wikis with at least 1000 pairs of aligned sentences (before/after the removal of the copyedit template)
weekly update:
- finalized processing/filtering pipeline to generate clean list of snippets of equal length from articles
- working on implementing questions in limesurvey
Feb 17 2023
weekly update:
- using a standard pre-trained language model, we can automatically distinguish sentence-pairs (the same sentence before and after an edit tagged as copyedit via the removal of the copyedit template) with moderate precision of ~70-80%. this suggests we might use this model to predict whether a specific sentence requires copyediting.
- so far we only checked this for sentence from English Wikipedia. as a next step I will extract similar sentence pairs (before/after an edit where the copyedit-template was removed) from other wikis. the template exists in 83 different wikis (Q6292692)
weekly update:
- completed regression analysis for quantifying effect of de-orphanization on number of pageviews for all wikis.
- results hold for most wikis and are statistically signficant: overall, large effect sizes of 50% (or more) increases in the number of pageviews following treatment.
- I consider the work done as part of this task, but will keep this task open until results are documented on the respective meta-page
weekly update:
- solved access to technical infrastructure tools (prolific, limesurvey)
- converging on a pipeline to extract snippets from articles that are from a single topic and are of appoximately equal length
Feb 14 2023
Feb 13 2023
In short: I could resolve the issue when upgrading wikitextparser to version 0.51.1 (I previously used 0.45.1).
@apaskulin Thank you. I resolved all remaining comments and moved to "ready for publication".
Feb 10 2023
weekly update:
- working on finalizing to set up formal collaboration for granting access to WMF-limesurvey instance to collaborators for designing survey questions in practice
weekly update:
- figured out how to run local instance of dbpedia-spotlight on stat.machines
- next step: build a pipeline to run the full model to get readability scores of all articles of a dump
weekly update:
- filtering the dataset of edited sentences from copyedit-template-removal. many sentences are not changed due to grammtical/copyedit errors but seem to be stylistic reasons. Focusing only on a small subset of sentences that are clearly related to grammatical errors seems to make it possible to distinguish whether a sentence needs editing using pre-trained language models.
Feb 9 2023
Feb 7 2023
answering questions 3. (how do above distributions change when considering the reader visited articles beyond the main namespace).
Feb 6 2023
answering questions 1./2.:
- we separate all readers into two subgroups i) people who visit a single Wikipedia page and ii) people who visit multiple Wikipedia pages on a single day.
- for each group separately, we calculate what percentage:
- Did not visit Wikipedia before (non-recurring)
- Visited Wikipedia within in the last month (7<d<=31)
- Visited Wikipedia within in the last week (1<d<=7)
- Visited Wikipedia yesterday (d=1)
Feb 3 2023
weekly update:
- talked to folks from data engineering and got good suggestions on how to run the model on the stat-machines. - next: will try to implement those suggestions and/or get additional support
weekly update:
- Running regression models to quantify impact in a statistically more rigorous way (this also makes it easier to include wiki-specific analysis more elegantly)
weekly update:
- finished 2 figures. continuing work on figures 3-7.
weekly update:
- generated larger dataset of sentence pairs by looking at all edits from removal of copyedit template (not only those marked as minor). this yields 176k pairs of sentences from 34k different articles.
weekly update:
- defined first approach for selecting snippets for comparison. for the pilot we will select only articles from a single topic in order to reduce the effect of familiarity. we will extract snippets from the articles by selecting 5 sentences.
Hey SRE/Analytics/Legal -- we have a new contractor onboard: @AKhatun_WMF . She needs access to HDFS and the stat machines for a new research project. Don't hesitate to let me know if you need more information. I believe she has an NDA signed as part of the contracting process but she's not a formal collaborator so I'm not 100% certain.
Feb 1 2023
@apaskulin I revised the blogpost addressing all of your comments -- they helped a lot to improve the writing. I kept all changes in suggestion mode so you could easily identify the differences to the previous version. I also added two suggestions for illustrations. Let me know if you have any other suggestions.
Jan 27 2023
weekly update
- generated a notebook for how to use Bradley-Terry-model to infer scores from pairwise comparisons with synthetic data; as well as how to ensure there are enough pairwise comparisons such that algorithm converges
weekly update:
- started first exploratory analysis of a 1% sample of reading sessions from a single day (~438k sessions)
- percentage of reading sessions with single or multiple pageviews:
- percentage of reading sessions with last-access: 1 day, 1 week (but more than 1 day), 1 month (but more than 1 week), non-returning:
- percentage of reading sessions only accessing main namespace or also accessing pages in any other namespace:
- next step: look at combination of different facets
weekly update:
- obtained first results for impact of treated articles (de-orphanized) vs control articles (same article in another language that remained orphan)
- on average: there is a 40% increase in the number of pageviews for treated articles; we dont see an increase for the control articles
- next: wiki-specific analysis via regression
Jan 19 2023
weekly update:
- reorganizing datasets and working on figures
weekly update:
- decided on survey design based on pairwise comparison of articles (see Benoit et al)
- decided on planne dinfrastructure setup: reruiting via prolific and hosting via limesurvey.
- testing analysis of data from pairwise comparison via bradley-terry-model using crowd-kit package
- next: testing survey setup via limesurvey
weekly update
- the current bottleneck in the pipeline is the call of the public API of dbpedia-spotlight. this is not scalable to score all articles of a dump.
- instead, tested to run a local instance of dbpedia-spotlight. this yields up to 100x fold speedup suggesting it is feasible to use this approach to score readability for all articles of a dump
- next: pre-processing one dump and setting up a pipeline with the local instance of dbpedia-spotlight
Jan 17 2023
@kostajh I agree that we should re-run the pipelines after some time. If possible, updating after 6 months seems reasonable (though I dont have any quantitative insights how quick the model gets outdated). But the almost 1.5 years since October 2021 seem too long.
Jan 12 2023
weekly update:
- revised scope/framing of the paper
- reached consensus and are now working on figures for results section and framing for introduction
weekly update:
- generated a new ground-truth dataset of edits to articles where the copyedit-template was removed. the rationale is that the removal of the template indicates that the edit improved the article with respect to copyediting.
- looking at all such events in the revision history of all articles in English Wikipedia. only keeping edits for articles where: i) the template was removed only once in the revision history of the article (to avoid cases where the template is added/removed many times); ii) the edit was marked as a minor edit (to avoid edits which contain major addition/removal of content).
- I then align sentences from the old to the new revision by matching all possible pairs of sentences via their (minimum) Levenshtein distance
- this yields 13k pairs of sentences across 5k articles where each sentence was changed supposedly as part of copyediting due to removal of the copyedit template. One-off dataset available here.
- in principle, the pipeline can be adapted easily to other Wikipedias which use this or similar templates.
Jan 11 2023
Jan 6 2023
weekly update:
- conducted first analysis on using BERT language model to classify sentences that are grammatically correct/incorrect
- for benchmark corpora and synthetic corpora (not Wikipedia) we obtain high accuracy showing the general applicability of this approach to score/rank copyedits. however, for the dataset of sentences from the newcomer-copyedit-task, the model cannot distinguish between sentences before or after the edit (i.e. from these examples we cannot detect systematic differences which would help us distinguish supposedly correct or incorrect sentences). since the general approach works with benchmark/synthetic corpora, the limiting factor seems to be the underlying dataset to fine-tune the model with labeled sentences from Wikipedia.
- Therefore, as a next step, we try to obtain an alternative dataset of labeled sentences from Wikipedia that are grammatically correct/incorrect using copyedit-templates. We will adapt the approach from extracting positive and negative examples of articles with reliability issues -- instead of looking for templates indicating reliability issues (such as pov), we will look for articles with copyedit-issues (e.g. copy_edit).
Weekly update:
- generated first dataset of readers of English Wikipedia on a single day capturing for each actor: i) the number of pageloads to articles in the main namespace; ii) the number of days since the last access; iii) how many other namespaces (beyond main namespace) were visited.
- will start exploratory analysis of the data in the next week(s)
Jan 5 2023
Dec 22 2022
weekly update:
- extracted treatment/control pairs for two complementary studies in effect of deorphanization on visibility
- 1) treatment: article a deorphanized in month t in wiki w; control: article a remains orphan in wiki w' != w
- 2) treatment: article a orphanized in month t in wiki w; control: article a'!=a remains orphan in wiki w
- next: extracting timeseries and calculating difference in differences to estimate effect of treatment
weekly update:
- considering this task done as we reached a general understanding of the work that needs to be done around evaluating language-agnostic models for readability by conducting surveys to ask participants to rate the readability of articles.
- started a separate project-page on meta-wiki: https://meta.wikimedia.org/wiki/Research:Understanding_perception_of_readability_in_Wikipedia
- work on the project will be captured under this task T325815
Dec 21 2022
Just want to add that this error is not unique to fywiki but will likely appear when working with other wikis as well. It as been reported for zhwiki (see the issues mentioned above) and I was able to reproduce that error.
Dec 15 2022
weekly update:
- setting up more robust experiment in which we define a control group of articles that were not de-orphanized.
weekly update:
- sketching out details on how to measure perception of readability via surveys. most promising approach is the one described by Benoit et al. Measuring and Explaining Political Sophistication through Textual Complexity
- naively, we would just show individual articles to participants and ask how they rate how easy it is to understand the text (readability) on a pre-defined scale (say from 0 to 5 or 0 to 100).
- however, it can be difficult for humans to rate a single trait (such as readability) on a pre-defined absolute scale; in addition, there are some concerns about the validity of the result when measuring a trait directly with a single question
- therefore, the authors propose a simpler setup in which raters are only shown pairs of two texts at a time and asked to rate which of the two presented texts is easier to read/understand. by collecting many of such pairwise ratings, we can calculate a readability score for each individual text using the Bradley-Terry model (this is similar to the Elo-rating in chess, where a skill score of an individual player is calculated from only comparing pairs of players). the resulting readability score is not a direct measurement but a latent measurement of "readability".
- one of the main advantages of this setup is that it can be applied to measure perception of readability in all languages, i.e. it does not rely on the availability of labeled data where we show participants to versions of the same text.
- in principle, this approach should be straightforward to apply to the case of articles in Wikipedia. I am currently sketching out the details of how to adapt. There are some open questions such as how to sample the set of articles which will be shown to raters (e.g. how many articles, should we sample them randomly or should they match some representativeness in terms of topics, which part of the article to show, how long should the text-snippet be, etc)
weekly update:
- current tasks have been solved
- there are no further improvements planned at the moment for manual evaluation
- therefore closing the task, feel free to re-open in the future if the work will be picked up again
Dec 9 2022
weekly update:
- completed first analysis on number of clicks to newly deorphanized articles from clickstream data.
- found statistically significant increase in the number of clicks in the month after de-orphanization: average increases from 0 to 32; 17% of the de-orphanizing links appeared with more than 10 clicks per month, this is much higher than for the average link ( Dimitrov et al. write "on Wikipedia only around 4% of all existing links are clicked by visitors more frequently than 10 times within a month"
- we will likely refine this analysis to include a more robust control groups to compare the increase in visibility
weekly update:
- synthesizing literature on measuring perception of readability. identified two main approaches: i) rating of a pair of two texts which one is more simple/difficult (e.g. Benoit et al. 2019); ii) items from a "readability"-scale (e.g. Graefe et al. 2016). will sketch in detail how we would adapt these two approaches to our case. we will likely start with option 1 as it is more well-defined.
- contacted GDI about options to run surveys with WMF's LimeSurvey instance
Dec 8 2022
Dec 2 2022
weekly update:
- collecting literature on how to measure readability in surveys
- getting an overview of the different options of doing surveys (quicksurvey, limesurvey, prolific)
weekly update:
- analysing variation of knowledge network across the 14 different wikis. investigating the effect of the underlying wiki (size etc)
Nov 25 2022
weekly update:
- first meeting with Indira and others. agreed about general research plan to conduct surveys on readability of wikipedia articles and whether our automatic scores from the language-agnostic model captures that across different languages
- we will be spending the next weeks to refine the details of the research plan specifically focusing on how we should set up the survey around readability
- we will start with revisiting the literature around constructs of readability and how we can measure them via surveys
weekly update:
- went through feedback from ambassadors about sources of errors. the main issue seemed to be about text that was directly quoted (often text in foreign language or from hundreds of years ago) thus throwing many false positives
- added improvements to the pre- and post-processing of the errors (both for LanguageTool and from custom list of misspellings)
- ignoring text paragraphs that indicate blockquotes (example) or multicolumn tables often capturing parallel multilingual texts of poems or lyrics (example) (commit)
- better handling of text that appears in explicit quotes, especially single quotes and dismabiguating with apostrophes (example) (commit)
weekly update:
- retrieving the timeseries of clicks to deorphanized articles from clickstream before and after they were deoprhanized. Resolving technical issues around pages that were renamed or became redirects (since clickstream only contains pagetitles) to track the number of clicks from month to month.
weekly update:
- implemented model to predict scores on curiosity (specifically the deprivation sensitivtiy facet) from metrics of knowledge networks (code)
Nov 10 2022
weekly update:
- created slightly adapted model which maps the readability score from the language-agnostic model to a grade level (roughly the number of years of education needed to understand a text) since that is more interpretable. with this we can map the distribution over (a random subset of) articles of a wiki.
weekly update:
- created datasets for knowledge networks for 14 wikis (the same we considered in the why we read wikipedia studies)
weekly updates:
- Kirsten shared results from 2nd round of manual evaluation (spreadsheet)
- refinements for LanguageTool substantially improve the accuracy. my guess is that filtering[[ https://community.languagetool.org/rule/list?lang=en | certain types of rules ]] is the main driver of this improvement.
- surfacing copyedits from a list of common misspellings seems very effective in terms of surfacing high-precision copyedits. the main challenge will then be to curate such lists such that there are enough copyedits and that there is a sufficient diversity of copyedits (and not just one particular misspelling again and again)
weekly update:
- discussed with Djellel; we refined plan for development of the model in terms of features training and test data
- planning to start implementing in the next week(s)
weekly update:
- reached consensus around doing a survey on readability with readers
- will be setting up a meeting with Indira and their team to discuss details and next steps.
Nov 3 2022
weekly update:
- Calculated readability scores for a random subset of 1000 articles in 17 wikis (cawiki, dawiki, dewiki, enwiki/simplewiki, eswiki, fiwiki, frwiki, huwiki, itwiki, nlwiki, nowiki, ptwiki, rowiki, ruwiki, svwiki, trwiki)
- Ideally, we would like to get readability scores for every article in a project. However, when calculating the language-angostic scores, we need to call the DBPedia-spotlight API to get the language-agnostic features. This step is time-consuming. Therefore, in a first iteration, we only get a representative sample
- Similarly, we focus on 17 wikis since those are the languages which are supported by the DBPedia-spotlight API
- We see noticable differences in the distribution of readability scores (low=easier to read/high=harder to read), specifically, simplewiki has much lower scores than enwiki
weekly update:
- refined analysis setup. the aim is the show that articles that are de-orphanized by editors show an increase in visibility
- dataset: all articles that were de-orphanized in a given month
- metric: we operationalize visibility of the target article (the orphan that gets de-orphanized) in three different ways. : i) number of pageviews to the source-articles, i.e. all articles linking to the target-article; ii) number of clicks from the source-articles to the target article; iii) number of previews from the source articles to the target article.
- in each case the visilibity before orphanization is X-=0 and after de-orphanization is X+>=0. therefore, we aim for the following two analysis: i) is there an increase in visibility, i.e. is X+ statistically significantly different from 0; ii) how big is the increase in visibility, i.e. comparing the increase from X- to X+ to the average increase in visibility of other articles that received new inlinks.
- next step: implement analysis
Weekly update
- adding analysis around how knowledge networks differ with the topic of the articles
- running analysis to quantify variation of knowledge networks with country and language-version
Nov 2 2022
Oct 28 2022
weekly update:
- organized my thoughts around next steps on analysis on readability. there are three main aspects to pursue next:
- improve the model: i) extend the number of supported languages, ii) explore the effect of the topic or the quality on the readability score
- understand perception of readability of Wikipedia articles through surveys
- measure the readability of all articles in a project. this will allow us to compare the expected number of years of education required to understand the articles (e.g. the Flesch-Kincaid reading level roughly maps to grade level) with the actual number of years of education of Wikipedia's readers (e.g. from recent reader-demographics survey)
Weekly update:
- created a more balanced sample of copyedits from lists of common misspellings ([[ copyedits_v2_common-misspellings-balanced | spreadsheet ]])
- some misspellings occur very often while many/most occur rarely which leads to an overrepresentation of a few misspellings
- I parse the whole dump of the respective Wikipedias (ar,bn,cs,es, en) and all occurrences of each misspelling
- I then keep only at most 5 occurrences of each misspelling
- more generally, this suggests we would need to figure out ways to extend the list of common misspellings in order to surface a more diverse set of misspellings
Oct 20 2022
Weekly update
- generated additional dataset to analyze how topics affect the knowledge networks
- generated additional dataset to analyze how knowledge networks might differ across countries; planning to similarly check how knowledge networks might differ across different languages (so far we only checked enwiki)
Weekly update
- compiled a new sample of copyedit errors in this spreadsheet (v2) https://docs.google.com/spreadsheets/d/1ponuT-jwEM4KF9XCG1Q86mkDYEGm8-F66PaG1U0XQ3M/edit#gid=0
- following the same approach as before, I used the 20220801 snapshot of the HTML-dumps. I then selected the first 100 articles for which there is at least one error left after the filtering. We only consider articles that have not been edited in at least 1 year. For each article I picked only one error randomly such that we have 100 errors from 100 different articles.
- I improved some of the details of the pre-processing of the text and the post-processing of the errors.
- pre-processing: we now keep the paragraph-id from the HTML where the error was found. This will make it easier to re-localize the error downstream in the application
- post-processing: improved the filtering of errors. i) errors where only correction is a hyphen (previous corrections seemed too strict and debatable); ii) errors which are marked by sic; iii) errors which appear in quotes; iv) errors which relate to proper nouns (filter words where first letter is capitalized)
- List of common misspellings: I looked for misspelled words in ar, bn, cs, es, en using the list of misspelled words compiled by the ambassadors (thank you!). These lists each contain 20 misspelled words. Roughly 1 out of 1000 articles has at least one of these spelling mistakes (an indication we might need to extend this list of common spelling mistakes in some cases)
- ar: 100 articles with errors after checking 94,994 articles
- bn: 100 articles with errors after checking 54,189 articles
- cs: 100 articles with errors after checking 124,892 articles
- es: 56 articles with errors after checking all 1,732,127 articles in the dump
- en: 100 articles with errors after checking 1,277,730 articles (for this I actually used a much longer[[ https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines#The_Machine-Readable_List | list of common spelling mistakes ]] since the original list did not yield errors in any article)
- LanguageTool: I looked for errors from LanguageTool in ar, es, en (unfortunately not supported in bn and cs). I filtered a set of error-categories/types/rules based on the most common false positives from the evaluation in the previous round (v1). Note that, in principle, these rules can be added/removed to the filter separately for each wiki
- PUNCTUATION (category): feedback mentioned many false positives related to adding commas
- STYLE (category): feedback mentioned many false positives where the error isnt clearly wrong or right
- REDUNDANCY (category): feedback mentioned many false positives where the error isnt clearly wrong or right
- HUNSPELL_RULE_AR (rule): rule related to typos in Arabic which caused many false positives
- MORFOLOGIK_RULE_ES (rule): rule related to typos in Spanish which caused many false positives
- UPPERCASE_SENTENCE_START (rule): rule requiring capital letter at sentence start; this yielded many false positives due to abbreviations causing wrong identification of the beginning of a sentence
Oct 13 2022
Weekly update:
Weekly update:
- spent some time refactoring the code to run LanguageTool on Wikipedia-articles to make it more customizable based on ambassador's feedback
- I adapted the changes in the exploratory-API (commit in gitlab) for visibility
- For example, when calling https://copyedit.toolforge.org/api/v1/lt?lang=en&title=Hermann%20Scheer LanguageTool yields an error about a missing comma; this error appears very often but the first round of feedback suggested that we might not want to surface those errors. We can avoid surfacing this error by filtering errors from the punctuation-category https://copyedit.toolforge.org/api/v1/lt?lang=en&title=Hermann%20Scheer&filter_category=PUNCTUATION
- We can filter any of LanguageTool's error-categories, -types, or -rules (more details in the readme as well as LanguageTool's inventory of all rules). This option allows for easy customization of which of LanguageTool's rules should be used. Most importantly, this can be configured differently in each language.