Page MenuHomePhabricator

MGerlach
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Sep 9 2019, 9:50 AM (70 w, 6 d)
Availability
Available
LDAP User
MGerlach
MediaWiki User
MGerlach (WMF) [ Global Accounts ]

Recent Activity

Fri, Dec 18

MGerlach added a comment to T266768: Develop 3 approaches for building article lists (ad-hoc topic models) and provide experimental API.

update week 2020-12-18:

  • all models integrated into common interface: https://list-building.toolforge.org/
  • example queries show interesting and diverse results.
  • this concludes the work of this task
  • follow-up task will be to systematically evaluate different lists
Fri, Dec 18, 8:13 PM · Research (FY2020-21-Research-October-December)
MGerlach updated the task description for T266768: Develop 3 approaches for building article lists (ad-hoc topic models) and provide experimental API.
Fri, Dec 18, 8:11 PM · Research (FY2020-21-Research-October-December)

Dec 18 2020

MGerlach closed T266770: Research Showcase December 2020 as Resolved.

ran showcase.

Dec 18 2020, 7:55 PM · Research (FY2020-21-Research-October-December)
MGerlach updated the task description for T266770: Research Showcase December 2020.
Dec 18 2020, 7:54 PM · Research (FY2020-21-Research-October-December)

Dec 14 2020

MGerlach updated the task description for T266770: Research Showcase December 2020.
Dec 14 2020, 7:07 PM · Research (FY2020-21-Research-October-December)

Dec 11 2020

MGerlach added a comment to T266780: Start discussing and processing 2 formal collaborations for knowledge gaps metrics definition.

update week 2020-12-11:

Dec 11 2020, 8:47 PM · Research (FY2020-21-Research-October-December)
MGerlach updated the task description for T266768: Develop 3 approaches for building article lists (ad-hoc topic models) and provide experimental API.
Dec 11 2020, 8:44 PM · Research (FY2020-21-Research-October-December)
MGerlach added a comment to T266768: Develop 3 approaches for building article lists (ad-hoc topic models) and provide experimental API.

update week 2020-12-11:

  • unblocked on additional instances for cloud-vps
  • added outlinks model
Dec 11 2020, 8:44 PM · Research (FY2020-21-Research-October-December)
MGerlach added a comment to T265745: Develop automatic evaluation protocol with existing links to avoid manual evaluation and evaluate link recommendation on 4 or more languages.

update week 2020-12-07:

  • did more through analysis of the model in terms of the importance of individual features
      • the navigation-based feature (distance between articles that should be linked in embedding obtained from reading sessions) has a very low feature importance (); a model without this feature has virtually the same performance in terms of precision and recall on the backtesting data (results in spreadsheet ) across all languages (simple, de, pt, ar, bn, cs, vi)
    • based on these insights we exclude the navigation-based feature from the model (patch in gerrit); this will make it easier to share model and data publicly
  • investigating performance of models in different languages, bn-wiki seems to be an outlier compared to the other languages in the sense that in order to get precision at least 70-80%, the recall would be only at around 10%. one possible implication is that we might be able to make good recommendations only for few articles.
    • I will try to make some exploratory research in the next week to see if we can identify possible issues with the model in this particular language. discussing with marshall, this serves as a good test case if and how we are able to identify the underlying problem once we detect the model is not performing well.
Dec 11 2020, 8:39 PM · Research (FY2020-21-Research-October-December)

Dec 10 2020

MGerlach updated the task description for T266770: Research Showcase December 2020.
Dec 10 2020, 12:46 PM · Research (FY2020-21-Research-October-December)

Dec 4 2020

MGerlach added a comment to T266768: Develop 3 approaches for building article lists (ad-hoc topic models) and provide experimental API.

update week 2020-12-04:
Added some smaller changes and fixes to the list-building:

  • users can vary length of list https://list-building.toolforge.org/?lang=en&qid=Q42&k=3
  • avoiding false result of seed-article is not contained in the model (e.g. the reader-based model only contains items which received a minimum number of pageviews, thus not all items are included)
  • still blocked on hosting additional models on cloud-vps T266180; started planning to build downscaled models on toolforge as intermediate solution
Dec 4 2020, 10:06 PM · Research (FY2020-21-Research-October-December)
MGerlach added a comment to T266770: Research Showcase December 2020.

weekly update:

  • 2 confirmed speakers for december showcase and asking for title and abstract by next week
Dec 4 2020, 10:00 PM · Research (FY2020-21-Research-October-December)
MGerlach updated the task description for T266770: Research Showcase December 2020.
Dec 4 2020, 9:55 PM · Research (FY2020-21-Research-October-December)
MGerlach closed T265710: Research Showcase November 2020 as Resolved.
Dec 4 2020, 9:55 PM · Research (FY2020-21-Research-October-December)
MGerlach added a comment to T266780: Start discussing and processing 2 formal collaborations for knowledge gaps metrics definition.

update week 2020-12-04:

Dec 4 2020, 9:54 PM · Research (FY2020-21-Research-October-December)
MGerlach updated the task description for T266780: Start discussing and processing 2 formal collaborations for knowledge gaps metrics definition.
Dec 4 2020, 9:49 PM · Research (FY2020-21-Research-October-December)
MGerlach added a comment to T265745: Develop automatic evaluation protocol with existing links to avoid manual evaluation and evaluate link recommendation on 4 or more languages.

update week 2020-12-04:
Spend some time doing some smaller improvements to algorithm that were on the list but not high-priority right now

  • fixing an artifact from earlier version about building train/test set for training and automatic evaluation; extracted linked sentences were not split randomly between the two sets leading to an imbalance (articles earlier in the dump with smaller IDs tended to be preferentially in the training set); ensuring random shuffling increased the precision from 0.7 to 0.79 with virtually same recall.
  • started to do some more systematic investigation of false positive in the backtesting evaluation to identify clear cases where recommendation fails (here). this revealed an error in generating candidate-anchors for n-grams involving non-alphabetic characters (common in, e.g., city names which are linked, example: "Latrobe, Pennsylvania"). while not linking that item is not problematic, it leads to artifacts where sub-ngrams will then be incorrectly recommended (in the case above a link to "Pennsylvania"). solve
  • discussion about how the feature-tables should be moved to production. one idea was to publish the data and the model since that makes it easier to copy and would allow users to generate their own recommendations. one unsolved issue is the case of the table-feature derived from reading sessions equivalent to the navigation-vectors. a pragmatic solution that appeared was to not use that data in the prediction (and hence not publish); a first test indicates that exclusion of this feature will only small detrimental effect on precision and recall. this motivates a more systematic evaluation of the efficacy of each feature for the recommendation. so far I have concentrated on making the model work ensuring a reasonably performance across a set of languages. however, it seems useful to check which features actually contribute substantially and which could b dropped; this would go hand in hand with some ideas discussed with Djellel on which other features could be added to improve the performance.
  • planning to add a more fine-grained evaluation analysis on the level of topics (instead of over all articles) since i) one might hypothesize that candidate links for articles are easier than for others, and ii) recommendations for newcomers are generated for articles in a specific topic such that it would be good to ensure that there are no outliers.
Dec 4 2020, 9:42 PM · Research (FY2020-21-Research-October-December)
MGerlach updated the task description for T265745: Develop automatic evaluation protocol with existing links to avoid manual evaluation and evaluate link recommendation on 4 or more languages.
Dec 4 2020, 9:17 PM · Research (FY2020-21-Research-October-December)

Nov 27 2020

MGerlach added a comment to T265610: Add Link engineering: Convert mwaddlink to read/write to MySQL instead of SQLite.

I think it probably makes sense to continue writing to SQLite files, then reading from and copying into MySQL tables.

Agree.

added data from SQLite files and model into MySQL tables (staging database) for 7 languages (ar, cs, de, ko, pt, simple, vi).

Nov 27 2020, 10:42 PM · Patch-For-Review, Growth-Team (Current Sprint), Add-Link, Growth-Structured-Tasks
MGerlach created T268901: Add Link engineering: Remove linkmodel.json from tmp-folder after retraining/updating link recommendation model.
Nov 27 2020, 10:33 PM · Growth-Team, Add-Link, Growth-Structured-Tasks
MGerlach added a comment to T265745: Develop automatic evaluation protocol with existing links to avoid manual evaluation and evaluate link recommendation on 4 or more languages.

update week 2020-11-23:

  • gave presentation in tuesday team-meeting: https://docs.google.com/presentation/d/1AGlNI6slw1ShCasT9OEbBik_sIsMwErFVyhJO1FE4K4/edit#slide=id.g6237f1b673_0_590
    • discussed different possibilities to improve algorithm
    • discussed experiences around productionizing research-models
  • managed to move data to mysql-databases on stats
    • this required some testing and profiling, e.g. in terms of the number of queries (potentially too many when generating recommendation for large articles; thus started some exploration on possible strategies to decrease), encoding issues with mysql (this is non-trivial but seems to be rare in most cases for the 7 languages tested)
    • integrated writing output of trained model (data+actual model) into training pipeline
Nov 27 2020, 10:13 PM · Research (FY2020-21-Research-October-December)

Nov 25 2020

MGerlach added a comment to T265610: Add Link engineering: Convert mwaddlink to read/write to MySQL instead of SQLite.

We also need to look into some edge cases where insertions fail due to what look like character encoding issues.

When we copy, I can make a quick check and count how often this happens for a few languages. If the number is not too large, this should not affect the performance too much except that we might miss a few recommendations.

Nov 25 2020, 11:12 PM · Patch-For-Review, Growth-Team (Current Sprint), Add-Link, Growth-Structured-Tasks

Nov 24 2020

MGerlach added a comment to T266826: Add Link engineering: Pipeline for moving MySQL database(s) from stats1008 to production MySQL server.

Before we proceed with this, @MGerlach is going to confirm that the datasets are OK to make public.

One table contains features generated according to the algorithm from the navigation vectors using reading-sessions as input (sensitive data); however, output was published previously. We are currently reviewing whether we can publish this data (and if yes in what form). An alternative would be to remove this feature from the model. hopefully, I have updates on this next week (slow week due to US-holidays).

Nov 24 2020, 8:00 PM · Patch-For-Review, Growth-Team (Current Sprint), Add-Link, Growth-Structured-Tasks

Nov 23 2020

MGerlach added a comment to T265610: Add Link engineering: Convert mwaddlink to read/write to MySQL instead of SQLite.

I think it probably makes sense to continue writing to SQLite files, then reading from and copying into MySQL tables.

Agree.

We need to integrate this step into the run-pipeline.sh script, though.

I can add this to the script.

We also need to look into some edge cases where insertions fail due to what look like character encoding issues.

When we copy, I can make a quick check and count how often this happens for a few languages. If the number is not too large, this should not affect the performance too much except that we might miss a few recommendations.

Nov 23 2020, 11:41 PM · Patch-For-Review, Growth-Team (Current Sprint), Add-Link, Growth-Structured-Tasks

Nov 21 2020

MGerlach added a comment to T265745: Develop automatic evaluation protocol with existing links to avoid manual evaluation and evaluate link recommendation on 4 or more languages.

update week 2020-11-16:

  • this week met with Djellel to discuss possible further developments of algorithm to improve performance
    • generating backtesting-data from only high-quality languages to improve reliability of ground truth (challenge: quality score not easily available in all languages)
    • adding additional features to link prediction model
    • how to include user-feedback on accepted/rejected links back into model-training
    • we are exploring these options (with lower priority from my side) independently of the model that is currently being moved into production. the latter seems to have at least acceptable accuracy (see results) and we want to avoid breaking the current pipeline
  • still working on moving the output of the model to mysql-databases; this is not so much a research problem, but making sure the model can be moved to production in terms of the footprint of computational ressources without breaking/re-writing the whole training pipeline. for example, in order to limit number of queries to the mysql-databases, adding the possibilty to set a maxmimum number of recommendations per article
  • attended growth-teams deep dive meeting to discuss in detail issues in implementation of the model
Nov 21 2020, 12:21 AM · Research (FY2020-21-Research-October-December)

Nov 20 2020

MGerlach added a comment to T266768: Develop 3 approaches for building article lists (ad-hoc topic models) and provide experimental API.

update week 2020-11-16:

Nov 20 2020, 11:58 PM · Research (FY2020-21-Research-October-December)
MGerlach updated the task description for T266768: Develop 3 approaches for building article lists (ad-hoc topic models) and provide experimental API.
Nov 20 2020, 11:53 PM · Research (FY2020-21-Research-October-December)
MGerlach added a comment to T266780: Start discussing and processing 2 formal collaborations for knowledge gaps metrics definition.

update week 2020-11-16:

  • started discussing with Marc around possibility to collaborate on gaps related to content diversity
Nov 20 2020, 11:52 PM · Research (FY2020-21-Research-October-December)
MGerlach updated the task description for T266780: Start discussing and processing 2 formal collaborations for knowledge gaps metrics definition.
Nov 20 2020, 11:51 PM · Research (FY2020-21-Research-October-December)
MGerlach added a comment to T265710: Research Showcase November 2020.

update week 2020-11-16:

  • successfully ran showcase
Nov 20 2020, 11:48 PM · Research (FY2020-21-Research-October-December)
MGerlach updated the task description for T265710: Research Showcase November 2020.
Nov 20 2020, 11:47 PM · Research (FY2020-21-Research-October-December)
MGerlach updated the task description for T265710: Research Showcase November 2020.
Nov 20 2020, 11:47 PM · Research (FY2020-21-Research-October-December)

Nov 6 2020

MGerlach added a comment to T266768: Develop 3 approaches for building article lists (ad-hoc topic models) and provide experimental API.

Update week 2020-11-03:

  • no updates
Nov 6 2020, 6:43 PM · Research (FY2020-21-Research-October-December)
MGerlach added a comment to T266780: Start discussing and processing 2 formal collaborations for knowledge gaps metrics definition.

Update week 2020-11-03:

  • we have still been discussing how to approach the work on metrics. some consensus around metrics based on surveys and content-diversity gaps.
  • reached out to Marc-miquel to discuss potential collaboration to work on metrics related to diversity of content.
Nov 6 2020, 6:36 PM · Research (FY2020-21-Research-October-December)
MGerlach added a comment to T265745: Develop automatic evaluation protocol with existing links to avoid manual evaluation and evaluate link recommendation on 4 or more languages.

Update week 2020-11-03:

  • this week a lot of work went to working with product to move the model further towards production. I started to submit some patches to gerrit to introduce some changes to the codebase there; at the same time, I spent substantial amount of time reviewing code, most of it related to switching from sqlite to mysql, i.e. making sure that the model doesnt break in the transition.
  • particular points:
    • we figured out a working solution to convert data-tables needed for querying the model to mysql
    • also moved and fixed the training-pipeline; this was broken after the move of the codebase to gerrit
    • added a patch to get context-window for link-recommendation; this was crucial because links are generated using wikitext but on the front-end have to be inserted in visual-editor; context-window was requested in order to avoid potential ambiguity when placing the link in visual editor
    • adding option for maximum number of linkrecommendations to query in order to reduce the number of calls to mysql-tables
Nov 6 2020, 6:31 PM · Research (FY2020-21-Research-October-December)

Oct 30 2020

MGerlach added a comment to T265745: Develop automatic evaluation protocol with existing links to avoid manual evaluation and evaluate link recommendation on 4 or more languages.

Update week 2020-10-26:

  • familiarizing myself with code-review in gerrit where repo is now hosted
  • ongoing discussions and coordination to support growth in productionizing the model:
    • adapting the output of the model to ensure conversion between wikitext (model) and visual editor (front-end)
    • converting data generated as part of training needed for model-querying from sqlite to mysql-tables (reuqirement for productionizing) T265610
Oct 30 2020, 6:01 PM · Research (FY2020-21-Research-October-December)
MGerlach added a comment to T266768: Develop 3 approaches for building article lists (ad-hoc topic models) and provide experimental API.

Update week 2020-10-26:

  • coordinated with Isaac and Diego this week to scope the work:
    • each of us will develop one separate approach based on existing models
    • defined model input and output in order to make it easier to build single API as wrapper to compare different models
Oct 30 2020, 5:54 PM · Research (FY2020-21-Research-October-December)
MGerlach added a comment to T266780: Start discussing and processing 2 formal collaborations for knowledge gaps metrics definition.

Update week 2020-10-26:

  • blocked on ongoing discussions on how to approach the metrics for the taxonomy (how far we want to go for each metric, how many metrics, which metrics)
Oct 30 2020, 5:51 PM · Research (FY2020-21-Research-October-December)
MGerlach updated the task description for T265710: Research Showcase November 2020.
Oct 30 2020, 12:12 PM · Research (FY2020-21-Research-October-December)

Oct 29 2020

MGerlach added a parent task for T266768: Develop 3 approaches for building article lists (ad-hoc topic models) and provide experimental API: T258804: Language-Agnostic Topic Modeling.
Oct 29 2020, 12:23 PM · Research (FY2020-21-Research-October-December)
MGerlach added a subtask for T258804: Language-Agnostic Topic Modeling: T266768: Develop 3 approaches for building article lists (ad-hoc topic models) and provide experimental API.
Oct 29 2020, 12:23 PM · Research, Epic
MGerlach added a subtask for T242172: Taxonomy of Knowledge Gaps: T266780: Start discussing and processing 2 formal collaborations for knowledge gaps metrics definition.
Oct 29 2020, 12:22 PM · Research, Epic
MGerlach added a parent task for T266780: Start discussing and processing 2 formal collaborations for knowledge gaps metrics definition: T242172: Taxonomy of Knowledge Gaps.
Oct 29 2020, 12:22 PM · Research (FY2020-21-Research-October-December)
MGerlach created T266780: Start discussing and processing 2 formal collaborations for knowledge gaps metrics definition.
Oct 29 2020, 12:20 PM · Research (FY2020-21-Research-October-December)
MGerlach created T266770: Research Showcase December 2020.
Oct 29 2020, 10:46 AM · Research (FY2020-21-Research-October-December)
MGerlach created T266768: Develop 3 approaches for building article lists (ad-hoc topic models) and provide experimental API.
Oct 29 2020, 10:40 AM · Research (FY2020-21-Research-October-December)

Oct 26 2020

MGerlach added a comment to T266446: Add Link engineering: Provide a mechanism for storing data about which link recommendations were rejected by the user.

@MGerlach do you have any thoughts on this? Specifically, on your capacity to implement a block list in research/mwaddlink in this quarter? I was thinking that when we call the link recommendation service, in addition to the wikitext, we would also provide a list of known link recommendations that we want excluded from the return value.

Oct 26 2020, 3:50 PM · DBA, Patch-For-Review, Growth-Team (Current Sprint), Add-Link, Growth-Structured-Tasks

Oct 23 2020

MGerlach updated the task description for T265745: Develop automatic evaluation protocol with existing links to avoid manual evaluation and evaluate link recommendation on 4 or more languages.
Oct 23 2020, 4:59 PM · Research (FY2020-21-Research-October-December)
MGerlach claimed T265745: Develop automatic evaluation protocol with existing links to avoid manual evaluation and evaluate link recommendation on 4 or more languages.

Update week: 2020-10-19:

  • ran backtesting on 7 wikis (simple, de, pt, ar,cs,ko,vi), results on meta
  • wrote a high-level summary of the model and put on meta
  • planned work
    • backtesting allows us to investigate false positives to identify issues with the model and to understand differences across the languages (even though at the moment, the model yields satisfactory results in all languages)
    • discussed the model/results in tuesday and thursday meeting in which there was some useful feedback on how to potentially improve the model and potential corner cases; for example: i) in order to avoid recommending existing links we parse the wikitext, but this misses some links from templates (this can be a substantial fraction of links), ii) use only high-quality articles to construct the gold-standard data for training and backtesting, iii) how to deal with red links.
Oct 23 2020, 4:58 PM · Research (FY2020-21-Research-October-December)
MGerlach closed T260206: Add a link: testing API, a subtask of T252822: [EPIC] Growth: "add a link" structured task, as Resolved.
Oct 23 2020, 3:06 PM · Growth-Structured-Tasks, Growth-Team, Epic
MGerlach closed T260206: Add a link: testing API as Resolved.

Update: we went through different iterations:

Oct 23 2020, 3:06 PM · Growth-Structured-Tasks, Growth-Team
MGerlach added a comment to T265605: Add Link engineering: Consolidate dedcode/addlink and mgerlach/mwaddlink-query into single repository.

@kostajh at the moment the gerrit-repo contains two requirements-files:

  • requirements.txt (the full environment required for training and querying)
  • requirements_query.txt (the lighter environment only for training)
Oct 23 2020, 2:48 PM · Growth-Team (Current Sprint), Add-Link, Growth-Structured-Tasks
MGerlach added a comment to T257254: Add a link: backtesting protocol.

Update: implemented the first version of the backtesting protocol.

  • We evalaute the link recommendation on individual sentences (this was the first sentence in an article that had at least one link).
  • We remove all existing (true) links from the sentence. We run the model to get link recommendations for that sentence. We compare whether the recommended links match the true existing links ( both, the anchor-text and the link-target-page have to match)
    • precision: how many of the recommended links are true links
    • recall: how many of the true links were recommended
  • we can vary the threshold for recommending a link
    • low threshold = low precision, high recall
    • high threshold = higher precision, lower recall
  • trained the model and ran backtesting for 7 wikis (simple, de, pt, ar, cs, ko, vi) without language-specific fine-tuning
    • with recall of ~40% we can ensure to have a precision of at least 70-80% for any language
    • for some languages such as vi, pt, precision is even higher
  • results (and some more details): https://meta.wikimedia.org/wiki/Research:Link_recommendation_model_for_add-a-link_structured_task#First_set_of_results_(2020-10)
Oct 23 2020, 2:04 PM · Growth-Structured-Tasks, Growth-Team

Oct 19 2020

MGerlach added a comment to T265605: Add Link engineering: Consolidate dedcode/addlink and mgerlach/mwaddlink-query into single repository.

@kostajh are there any naming/structuting conventions for virtual environments in production that I should follow? For example, in the solution described above there will be several requirements-files in a reuqirements-folder, with the requirements.txt in the main folder mirroring the production environment.

Oct 19 2020, 5:43 PM · Growth-Team (Current Sprint), Add-Link, Growth-Structured-Tasks
MGerlach added a comment to T258978: Service operations setup for Add a Link project.
  • Logging: log in json format to stdout
Oct 19 2020, 4:18 PM · Patch-For-Review, Add-Link, Growth-Team (Current Sprint), Product-Infrastructure-Team-Backlog, SRE, serviceops, GrowthExperiments-NewcomerTasks
MGerlach added a comment to T265605: Add Link engineering: Consolidate dedcode/addlink and mgerlach/mwaddlink-query into single repository.

Looks like the code is in a single repo (and will soon be imported to gerrit, where we should push patches), but leaving this open to implement the multiple requirements.txt approach.

Oct 19 2020, 4:16 PM · Growth-Team (Current Sprint), Add-Link, Growth-Structured-Tasks
MGerlach committed rRMWAce48455b0c98: json-logging (authored by MGerlach).
json-logging
Oct 19 2020, 2:47 PM
MGerlach committed rRMWA00f77e89d1db: backtesting-eval varying link-threshold (authored by MGerlach).
backtesting-eval varying link-threshold
Oct 19 2020, 2:47 PM
MGerlach committed rRMWA5df0c889c872: lowering default link-threshold (authored by MGerlach).
lowering default link-threshold
Oct 19 2020, 2:47 PM
MGerlach committed rRMWA04c0b957c916: adapting training-data generation with better negative examples (authored by MGerlach).
adapting training-data generation with better negative examples
Oct 19 2020, 2:47 PM
MGerlach committed rRMWA93f789a90640: typo forgot n_jobs (authored by MGerlach).
typo forgot n_jobs
Oct 19 2020, 2:47 PM
MGerlach committed rRMWA468c613abe78: move to old (authored by MGerlach).
move to old
Oct 19 2020, 2:47 PM
MGerlach committed rRMWA762629617a7c: avoid dist if norm=0 to avoid runtime-warning (authored by MGerlach).
avoid dist if norm=0 to avoid runtime-warning
Oct 19 2020, 2:47 PM
MGerlach committed rRMWA4d0c949fba2f: merging mwaddlink-query (authored by MGerlach).
merging mwaddlink-query
Oct 19 2020, 2:47 PM
MGerlach committed rRMWAe9664518e262: change order of lopps to fix bug in offset-calculation (authored by MGerlach).
change order of lopps to fix bug in offset-calculation
Oct 19 2020, 2:47 PM
MGerlach committed rRMWA4dbeb5512e10: fix bug in read_csv appearing in cswiki with quotes in anchors (authored by MGerlach).
fix bug in read_csv appearing in cswiki with quotes in anchors
Oct 19 2020, 2:47 PM
MGerlach committed rRMWAa8612f6708cb: typo (authored by MGerlach).
typo
Oct 19 2020, 2:47 PM
MGerlach committed rRMWAd6c6cf44a1bb: update readme (authored by MGerlach).
update readme
Oct 19 2020, 2:47 PM
MGerlach committed rRMWA56735347ec2e: uncomment crucial line (authored by MGerlach).
uncomment crucial line
Oct 19 2020, 2:47 PM
MGerlach committed rRMWA9a7d251db9b6: updated pipeline using sqlite-tables (memory) (authored by MGerlach).
updated pipeline using sqlite-tables (memory)
Oct 19 2020, 2:47 PM
MGerlach committed rRMWA62def55a290a: cleaned requirements for venv (authored by MGerlach).
cleaned requirements for venv
Oct 19 2020, 2:47 PM
MGerlach committed rRMWA131b4a41f9a8: inlcude backtesing to running the pipeline (authored by MGerlach).
inlcude backtesing to running the pipeline
Oct 19 2020, 2:47 PM
MGerlach committed rRMWAa954cb8fc1ae: script to run backtesting fir any lang with refactored utils (authored by MGerlach).
script to run backtesting fir any lang with refactored utils
Oct 19 2020, 2:47 PM
MGerlach committed rRMWA2a45ec038665: update interval to 100 (authored by MGerlach).
update interval to 100
Oct 19 2020, 2:47 PM
MGerlach committed rRMWA53bac2e0c9bf: keep alternative anchor dict generaotr upo date (authored by MGerlach).
keep alternative anchor dict generaotr upo date
Oct 19 2020, 2:47 PM
MGerlach committed rRMWA0358f724ccdd: backtesting with refactored utils (authored by MGerlach).
backtesting with refactored utils
Oct 19 2020, 2:47 PM
MGerlach committed rRMWAdb955f4acd40: new backtesting notebook with refactored utils (authored by MGerlach).
new backtesting notebook with refactored utils
Oct 19 2020, 2:47 PM
MGerlach committed rRMWA55612ea33432: new link prediction notebook with refactored utils (authored by MGerlach).
new link prediction notebook with refactored utils
Oct 19 2020, 2:47 PM
MGerlach committed rRMWA6c76b3173615: update uils dependency (authored by MGerlach).
update uils dependency
Oct 19 2020, 2:47 PM
MGerlach committed rRMWA88b01b12c751: remove old utils (authored by MGerlach).
remove old utils
Oct 19 2020, 2:47 PM
MGerlach committed rRMWA586d31e8e7cc: factor out new utils (authored by MGerlach).
factor out new utils
Oct 19 2020, 2:47 PM
MGerlach committed rRMWA428cc4a909ba: updated recommender to work with shelve-data (authored by MGerlach).
updated recommender to work with shelve-data
Oct 19 2020, 2:47 PM
MGerlach committed rRMWA62a0326f7958: running pipeline for agiven lang (authored by MGerlach).
running pipeline for agiven lang
Oct 19 2020, 2:47 PM
MGerlach committed rRMWA8a819f23f790: some more utils-functinos (authored by MGerlach).
some more utils-functinos
Oct 19 2020, 2:47 PM
MGerlach committed rRMWA8171b6488276: convert all datasets to shelve-format (authored by MGerlach).
convert all datasets to shelve-format
Oct 19 2020, 2:47 PM
MGerlach committed rRMWA753c160f00df: anchor dict via spark (authored by MGerlach).
anchor dict via spark
Oct 19 2020, 2:47 PM

Oct 16 2020

MGerlach added a comment to T265745: Develop automatic evaluation protocol with existing links to avoid manual evaluation and evaluate link recommendation on 4 or more languages.

Update week 2020-10-12:

  • started to build backtesting dataset: (https://github.com/dedcode/mwaddlink/blob/master/scripts/generate_backtesting_data.py)
    • for a given article, we only take the first sentence;
      • the assumption is that the first sentence is well-linked
      • we avoid the potential issue that a link is missing because it appears earlier in the article
    • we collect 100k sentences (with existing links) for each wiki
  • we will run the trained link recommendation model to recommend links for each sentence
  • evaluate micro-precision and micro-recall averaged over all sentences
Oct 16 2020, 5:10 PM · Research (FY2020-21-Research-October-December)
MGerlach created T265745: Develop automatic evaluation protocol with existing links to avoid manual evaluation and evaluate link recommendation on 4 or more languages.
Oct 16 2020, 5:03 PM · Research (FY2020-21-Research-October-December)
MGerlach created T265710: Research Showcase November 2020.
Oct 16 2020, 9:58 AM · Research (FY2020-21-Research-October-December)
MGerlach closed T264091: Research Showcase October 2020 as Resolved.

Closing this task as there are not more todos.

Oct 16 2020, 9:56 AM · Research (FY2020-21-Research-October-December)
MGerlach committed rRMWA254dc67c7e70: note for deprecation, merged with mwaddlink (authored by MGerlach).
note for deprecation, merged with mwaddlink
Oct 16 2020, 8:33 AM
MGerlach committed rRMWAe56d7526fa9e: add info on http-proxy (authored by MGerlach).
add info on http-proxy
Oct 16 2020, 8:33 AM
MGerlach committed rRMWA15e93d2bfde5: nltk.punkt paclage (authored by MGerlach).
nltk.punkt paclage
Oct 16 2020, 8:33 AM
MGerlach committed rRMWA26957f79bed6: first model (authored by MGerlach).
first model
Oct 16 2020, 8:33 AM
MGerlach committed rRMWA4e87edd8d856: first commit (authored by MGerlach).
first commit
Oct 16 2020, 8:33 AM

Oct 15 2020

MGerlach added a comment to T264091: Research Showcase October 2020.

@JLaytonWMF
we are canceling the showcase for october (speakers declined).
could you also remove the event from the staff calendar?
Thank you.

Oct 15 2020, 7:06 PM · Research (FY2020-21-Research-October-December)

Oct 13 2020

MGerlach renamed T261403: Move dedcode/mwaddlink from github to gerrit from Move mwaddlink-api from github to gerrit to Move mwaddlink-query from github to gerrit.
Oct 13 2020, 1:42 PM · serviceops, Release-Engineering-Team, Growth-Structured-Tasks, Growth-Team
MGerlach added a comment to T261403: Move dedcode/mwaddlink from github to gerrit.

@kostajh

  1. take https://github.com/martingerlach/mwaddlink-query and move utility methods from https://github.com/dedcode/mwaddlink into it
    1. maybe that involves making a small shared library between the two repos, depending on whether the model trainer also needs access to these methods? The overall goal would be to remove https://github.com/martingerlach/mwaddlink-query/blob/main/addlink-query_links.py#L8-L11

this could work as a temporary solution. the better option would probably be to have a shared library for both the training and the query-part in order to make the parsing is consistent across both. this will probably become more important later as we make tweaks to the training of the model (when seeing what needs improvement when applying to different languages). maybe we can deal with the more general solution later.
I also want to incorporate some of the suggestions mentioned in T258978#6532612

Oct 13 2020, 1:35 PM · serviceops, Release-Engineering-Team, Growth-Structured-Tasks, Growth-Team

Oct 9 2020

MGerlach closed T259370: Experimental API for unsupervised topic modeling based on reading sessions as Resolved.

Update week 2020-10-05:

  • added interactive user interface to API to generate custom lists of articles

https://reader.toolforge.org/

Oct 9 2020, 4:18 PM · Research (FY2020-21-Research-July-September)
MGerlach closed T259370: Experimental API for unsupervised topic modeling based on reading sessions, a subtask of T258804: Language-Agnostic Topic Modeling, as Resolved.
Oct 9 2020, 4:18 PM · Research, Epic