Isaac (Isaac Johnson)
Research Scientist

Projects

Policy-Admins
Group
Trusted-Contributors
Group

Calendar

User Details

User Since: Oct 1 2018, 2:19 PM (290 w, 6 d)
Availability: Available
IRC Nick: isaacj
LDAP User: Isaac Johnson
MediaWiki User: Isaac (WMF) [ Global Accounts ]

Recent Activity
View All

Fri, Apr 26

Isaac added a comment to T354559: Put together diff blogpost on AI + Wikimedia + Datasets.

Weekly update:

Presented initial ideas to Gap Team (presentation). Both help me process all my thoughts and led to some good feedback and sparking of ideas. Writing should be far easier now.
Things I'm pondering:
- Folks appreciated my attempt to describe the different types of AI models at Wikimedia as background to why these datasets are important. This is a lot to fit into a blogpost but maybe I write a separate one or even just a wiki page as a explainer
- The blogpost is definitely ballooning in size. I probably don't have to decide right now what to include/exclude, but TA made the good suggestion that it could be a series instead of single blogpost. So something like: 1. Intro / Current State; 2. Data Gaps; 3. Benchmarks

Fri, Apr 26, 5:00 PM · Research (FY2023-24-Research-April-June)

Isaac added a comment to T360572: Extend Article Quality Model to use HTML.

Weekly updates:

I started on loading the HTML dumps into HDFS (code courtesy of Fabian) -- this is working well and I tested with Arabic and was quite happy with how quickly it processed. Though loading in English is taking some time...
Destinie is working out some kinks in our HTML features

Fri, Apr 26, 4:39 PM · Research (FY2023-24-Research-April-June), Epic

Isaac added a comment to T361637: Support for topic infrastructure work.

Weekly updates:

ML Platform and Search Platform indicated that my plans were fine for the article-country hypothesis and they can support deployment. In particular, EB on Search indicated that the broader expansion of tags on Search index for recommender systems shouldn't pose any issues.
Put together basic API for using just the Wikidata properties: https://wiki-topic.toolforge.org/countries
Good meeting put together by Miriam in which we charted out that Community Growth could do some outreach to get feedback on the current topic taxonomy and we'd work to make updates based on that but then try to freeze the taxonomy.

Fri, Apr 26, 4:36 PM · Research (FY2023-24-Research-April-June)

Thu, Apr 25

Isaac updated subscribers of T363514: Requesting access to analytics-privatedata-users for YLiou_WMF (no server access).

@YLiou_WMF here's the task -- please sign L3

Thu, Apr 25, 6:54 PM · SRE, SRE-Access-Requests

Isaac created T363514: Requesting access to analytics-privatedata-users for YLiou_WMF (no server access).

Thu, Apr 25, 6:51 PM · SRE, SRE-Access-Requests

Wed, Apr 24

Isaac added a comment to T308164: Migrate Content Translation Recommendation API to Lift Wing.

Ahh this is great news @kevinbazira ! @KartikMistry is there any reason from the Content Translation side why we can't switch over to the LiftWing endpoint? My read is that the code is quite simple -- e.g., if I go to Content Translation on Spanish Wikipedia, the tool hits this endpoint:
https://recommend.wmflabs.org/types/translation/v1/articles?source=en&target=es&seed=Music%20Modernization%20Act|Felony%20disenfranchisement&search=morelike&application=CX

Wed, Apr 24, 2:48 PM · Language-Team, Machine-Learning-Team, Epic

Tue, Apr 23

Isaac added a comment to T308164: Migrate Content Translation Recommendation API to Lift Wing.

hey all (not sure who exactly to tag but maybe I'll start with @kevinbazira just because I know you did a lot of good work on this) -- I'm working on some planning for improvements to our recommender systems for next fiscal year around what topic filters we provide to editors. Content Translation is of special interest but Android's SuggestedEdits is important too. The recommendation logic for both of these systems is still hosted on GapFinder as far as I can tell, but deploying any improvements is going to require moving them to a proper service (LiftWing). Does anyone know why this effort to move Content Translation's recommendation API over to LiftWing (along with Android's endpoints T340854) stalled last year?

Tue, Apr 23, 8:26 PM · Language-Team, Machine-Learning-Team, Epic

Fri, Apr 19

Isaac added a comment to T354559: Put together diff blogpost on AI + Wikimedia + Datasets.

Weekly update:

Spoke with Stephanie (Enterprise) who will be attending a symposium on ML benchmarking with Wikimedia data and shared my early thoughts on the subject.
Spoke with Adam B about his 10% project around a Commons dump and some of the likely challenges / needs associated with that. It left me feeling optimistic though it won't be solved overnight.
Forgot that I was to present on this this week to our team though and postponed to this upcoming Thursday so will be doing some deep thinking and writing in the meantime

Fri, Apr 19, 10:28 PM · Research (FY2023-24-Research-April-June)

Isaac added a comment to T361637: Support for topic infrastructure work.

Weekly updates:

I put forth a draft hypothesis for next year related to a country-level article prediction model: If we build a country-level inference model for Wikipedia articles, we will be able to filter lists of articles to those about a specific region with >70% precision and >50% recall. I had a conversation with Fabian about this too and it'd be easy to pull in the cultural/geographic code that currently exists for inferring countries based on Wikidata properties. To take it a step further and cover articles without Wikidata items or with incomplete items or for geographic aspects that are not really covered in Wikidata -- e.g., geographic extent of flora/fauna -- I'd want to do some inference based on the country topics of the links in an article. Doing this online would be challenging (likely high latency as you'd need to evaluate many articles at once). There are ways to build a cache of predictions for articles and use that for evaluating the links but then you run into challenges with cache invalidation etc. Because the intent is to load the model predictions into the Search index as weighted tags, however, we can actually probably use the Search APIs to gather the country predictions for an article's links (analogous example for articletopic for en:Japanese_iris) and infer from there. This is nice because the Search index will always have up-to-date information and so we won't need to store this source of truth in multiple places.

Fri, Apr 19, 10:25 PM · Research (FY2023-24-Research-April-June)

Isaac added a comment to T360572: Extend Article Quality Model to use HTML.

@DJames-WMF has made progress on converting the wikitext over to HTML features. We're finding that the old normalization values -- e.g., how many references are expected in a top-quality article for a given wiki -- are no longer well-aligned for a few features. This seems to be most relevant for page-length which then affects wikilinks and references as well. I'll need to look into re-generating these normalization values. A few options:

Use the APIs to fetch HTML for a random sample of articles to re-calibrate the values. Sample size could be a challenge though because we're looking at the 95th percentile so we need a large enough sample for that to be stable.
Slowly loop through the whole Enterprise HTML dump -- this would take a very long time and in my experience the article ordering is not random so we can't stop early unfortunately without biasing the result.
Load a snapshot of the HTML dumps for the relevant languages into HDFS and process in parallel -- this is probably the most sensible solution because then we can re-use the data if we ever need to come back and recompute a value.

Fri, Apr 19, 10:19 PM · Research (FY2023-24-Research-April-June), Epic

Isaac added a comment to T219903: Keep research.wikimedia.org landing page updated.

Confirmed -- thanks @DDeSouza !

Fri, Apr 19, 11:50 AM · periodic-update, Research

Tue, Apr 16

Isaac added a comment to T343228: Changes to Research Showcase MediaWiki.

Would it be possible to add the theme of the showcase to its subpage title

Good idea. I think it should be doable (just need to move the pages to the new titles). The downside would be it's harder to guess the page title but maybe that's not an issue. One alternative too: when we picked up this task, there was also a question about whether we wanted some sort of "summary" of our archive too. Maybe the listing of pages isn't the place to do that and instead we add a basic table to the page with each month, a link to the full description, and the theme?

Tue, Apr 16, 10:00 PM · Research, Research-outreach, Research-foundational

Isaac updated the task description for T362416: Attend ICWSM 2024 conference.

Tue, Apr 16, 9:29 PM · Research-foundational, address-knowledge-gaps, Research-outreach, Research

Isaac added a comment to T219903: Keep research.wikimedia.org landing page updated.

@DDeSouza I went ahead and made a merge request for a new paper and some small adjustments to the other papers (seemed easier than trying to explain in this case): https://gitlab.wikimedia.org/repos/sre/miscweb/research-landing-page/-/merge_requests/25

Tue, Apr 16, 3:37 PM · periodic-update, Research

Mon, Apr 15

Isaac added a comment to T343228: Changes to Research Showcase MediaWiki.

Thanks all for the patience on this -- I have now moved all the past showcases onto monthly archive pages and added the search functionality to the main page: https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#Archive

Mon, Apr 15, 7:57 PM · Research, Research-outreach, Research-foundational

Fri, Apr 12

Isaac added a comment to T361623: Swap out wikitext for HTML in training quality model.

Next steps for this notebook based on Destinie's assessment (notebook) of how well-distributed each model feature is after switching to HTML. We have three features that are poorly distributed (values all lumped together) so the model cannot learn much from them. They are:

Page length: the values are all lumped around 1 because Parsoid HTML (with all of its syntax) is far more verbose that wikitext and by definition a superset of the wikitext. We don't have any perfect way of getting back to the wikitext length but probably a more reasonable assessment of article length is how much text is in it. So instead of len(article_html), let's use the get_plaintext() function and take the length of that. That function has a bunch of settings for it to work appropriately so let's use the approach used by html_to_plaintext() in this notebook with a few small tweaks:
- Don't exclude List elements (they often have valid content from an article quality standpoint)
- Take out the if len(paragraph.strip()) > 15 clause for each paragraph (we're just counting up things so I'm okay with the occasional "weird" paragraph)
- Rather than doing the final if paragraphs: check, just use '\n'.join(paragraphs) for computing length -- this will just be an empty string (length 0) if no paragraphs.
Media: the reason they're lumping to 1s is probably because many articles have lots of little icons that aren't defined in the wikitext (transcluded via templates). These are inflating our counts of images in the article. I put together one heuristic to filter these out in the test cases and I think we can re-use that pixel-size logic here too (code). This should reduce our media counts back to where they're more evenly distributed.
Categories: Here the lumping towards 1 values likely is the result of hidden categories (usually transcluded via templates again and not in the wikitext). One way around this is to check each category returned by get_categories() to see if it was transcluded. There's an existing function in the library (example import statement) and then we can do something like len([1 for c in article.wikistew.get_categories() if not is_transcluded(c)]).

Fri, Apr 12, 8:43 PM · Research (FY2023-24-Research-April-June)

Isaac added a comment to T354559: Put together diff blogpost on AI + Wikimedia + Datasets.

Weekly update: no progress

Fri, Apr 12, 6:49 PM · Research (FY2023-24-Research-April-June)

Isaac updated the task description for T361623: Swap out wikitext for HTML in training quality model.

Fri, Apr 12, 6:48 PM · Research (FY2023-24-Research-April-June)

Isaac added a comment to T361637: Support for topic infrastructure work.

Weekly updates:

Asked for input from Search on adding in the different topic tags we're considering (countries, quality, wikiprojects): https://etherpad.wikimedia.org/p/recsys-search-tags-future
Talked with Inuka Team about challenges/opportunities in this space as they consider potential projects to take on
Part of discussions with EH at Wikimedia Uruguay and others around their new templates for WikiProjects, which automatically find tasks to surface to editors: https://es.wikipedia.org/wiki/Wikiproyecto:Cambio_clim%C3%A1tico. This is an exciting replication of the infrastructure that Growth has worked on for Newcomer Homepage but by community members within the WikiProject context. It's further motivation for adding WikiProject tags to Search as well because without that, it's much harder to use our structured task filters (add-a-link; add-an-image) because there's no single query that filters by Wikiproject and task availability.
Began exploring feasibility of geography model on LiftWing. Ascertained that there could be key-value store support in the future that might be useful (if we use links to infer countries, we'll need to quickly look up the associated countries with each article link). In the meantime, it should be easy to just grab an item's Wikidata JSON and just check the country-related properties as we do with the culture metrics.

Fri, Apr 12, 6:37 PM · Research (FY2023-24-Research-April-June)

Isaac added a comment to T346089: Investigate Isaac article quality ML model as option .

Yep, that work is happening under T360455 and T360572

Fri, Apr 12, 2:33 PM · Epic, Wikimedia Enterprise - Content Integrity, Wikimedia Enterprise

Wed, Apr 10

Isaac added a comment to T361623: Swap out wikitext for HTML in training quality model.

Added another step for the bug-fixing we're working on right now with 0-values for some of the features. I also unchecked the optional exploration -- that actually is separate from the notebook (it involves updating a README file in a code repository) so we can talk about it in a future meeting and decide whether to pick it up or not.

Wed, Apr 10, 6:49 PM · Research (FY2023-24-Research-April-June)

Isaac updated the task description for T361623: Swap out wikitext for HTML in training quality model.

Wed, Apr 10, 6:47 PM · Research (FY2023-24-Research-April-June)

Wed, Apr 3

Isaac added a comment to T318384: Put API on Cloud VPS .

Thanks! Unlikely to happen soon but when we reach a stage where we are re-training the model, I'll see if we can experiment with nudging the model away from these sorts of responses (because agreed that it's ideal to solve it via model architecture / training as opposed to post-hoc filters if possible). And please continue to share if you see other patterns in incorrect recommendations.

Wed, Apr 3, 8:43 PM · Wikipedia-Android-App-Backlog (Android Release - FY2023-24)

Isaac reassigned T361623: Swap out wikitext for HTML in training quality model from Isaac to DJames-WMF.

Wed, Apr 3, 4:31 PM · Research (FY2023-24-Research-April-June)

Isaac closed T360815: Replicate Article Quality Training Notebook as Resolved.

Excellent work @DJames-WMF ! Took a readthrough of your notebook and everything looked good. Closing this as resolved. We didn't pursue the Nepalese Wikipedia extension but that's okay -- we can always come back to it later. For now, I'd like to progress to the HTML work that you've started in T361623.

Wed, Apr 3, 4:30 PM · Research (FY2023-24-Research-April-June)

Isaac updated the task description for T360815: Replicate Article Quality Training Notebook.

Wed, Apr 3, 4:30 PM · Research (FY2023-24-Research-April-June)

Isaac closed T360815: Replicate Article Quality Training Notebook, a subtask of T360572: Extend Article Quality Model to use HTML, as Resolved.

Wed, Apr 3, 4:30 PM · Research (FY2023-24-Research-April-June), Epic

Tue, Apr 2

Isaac added a comment to T318384: Put API on Cloud VPS .

(Shouldn't this be a factor for machine learning? I mean, if matching the title produced a wrong description as a general rule, wouldn't the machine learning algorithm infer it from the training set?)

Tue, Apr 2, 6:25 PM · Wikipedia-Android-App-Backlog (Android Release - FY2023-24)

Isaac moved T360815: Replicate Article Quality Training Notebook from Backlog to FY2023-24-Research-April-June on the Research board.

Tue, Apr 2, 6:21 PM · Research (FY2023-24-Research-April-June)

Isaac moved T348329: [Stretch] Support evaluation of text summarization for potential harms from FY2023-24-Research-January-March to FY2023-24-Research-April-June on the Research board.

Tue, Apr 2, 6:19 PM · Research (FY2023-24-Research-April-June)

Isaac moved T361623: Swap out wikitext for HTML in training quality model from Backlog to FY2023-24-Research-April-June on the Research board.

Tue, Apr 2, 6:19 PM · Research (FY2023-24-Research-April-June)

Isaac moved T361637: Support for topic infrastructure work from Backlog to FY2023-24-Research-April-June on the Research board.

Tue, Apr 2, 6:16 PM · Research (FY2023-24-Research-April-June)

Isaac created T361637: Support for topic infrastructure work.

Tue, Apr 2, 6:15 PM · Research (FY2023-24-Research-April-June)

Isaac closed T354565: Edit summaries: dissemination of findings, a subtask of T293465: Edit Types Research, as Resolved.

Tue, Apr 2, 5:49 PM · Research, Epic

Isaac closed T354565: Edit summaries: dissemination of findings as Resolved.

Tue, Apr 2, 5:49 PM · Research (FY2023-24-Research-January-March)

Isaac added a comment to T354565: Edit summaries: dissemination of findings.

Closing this task out. We can re-open or create a new one in case substantial new work is required as a result of COLM etc. I'll still update with an arXiv link when available.

Tue, Apr 2, 5:49 PM · Research (FY2023-24-Research-January-March)

Isaac moved T354559: Put together diff blogpost on AI + Wikimedia + Datasets from FY2023-24-Research-January-March to FY2023-24-Research-April-June on the Research board.

Tue, Apr 2, 5:26 PM · Research (FY2023-24-Research-April-June)

Isaac moved T360572: Extend Article Quality Model to use HTML from Backlog to FY2023-24-Research-April-June on the Research board.

Tue, Apr 2, 5:26 PM · Research (FY2023-24-Research-April-June), Epic

Isaac added a comment to T361623: Swap out wikitext for HTML in training quality model.

@DJames-WMF can claim and start this task when T360815 is complete.

Tue, Apr 2, 4:33 PM · Research (FY2023-24-Research-April-June)

Isaac created T361623: Swap out wikitext for HTML in training quality model.

Tue, Apr 2, 4:32 PM · Research (FY2023-24-Research-April-June)

Isaac added a comment to T318384: Put API on Cloud VPS .

Human
3 beams:

Ethnic group
Ethnic group of humanes
Ethnic group of humans

Thanks for passing this along @Jack_who_built_the_house! I checked a number of other very high-level topics and didn't find it in Civilization or Primates but did get "Class of plants" for Plants. This sort of error seems most likely with article about very high-level concepts (which often already have article descriptions thankfully) but would still be nice to fix obviously. We might be able to address this sort of tautological output by adding a simple string-matching check to ensure that the output doesn't contain the title itself. Before we implement anything, I'd want to think about what sort of issues this might cause though with e.g., very simple titles where text matching might introduce a bunch of false positives (and therefore not return results).

Tue, Apr 2, 12:58 PM · Wikipedia-Android-App-Backlog (Android Release - FY2023-24)

Mar 22 2024

Isaac added a comment to T354559: Put together diff blogpost on AI + Wikimedia + Datasets.

Weekly updates:

I'm behind on this in part because my thinking is still pretty wide-ranging but I'm continuing to process what I read and mull over the different angles of this. I've been attending to more urgent aspects too with Annual Planning / mentorship / etc.
Chris A. put together a nice spreadsheet of a few knowledge integrity tasks that he, Marshall, and Maryana P. put together for benchmarking some LLMs: https://docs.google.com/spreadsheets/d/1b2eG8ZlWVJa5LQDSJMACivWzZ9DCxiYjv6oImucnc20/edit#gid=0

Mar 22 2024, 8:19 PM · Research (FY2023-24-Research-April-June)

Isaac added a comment to T354565: Edit summaries: dissemination of findings.

Paper submitted to COLM and we'll hear May 24. I'll link to arXiv paper when posted.

Mar 22 2024, 8:16 PM · Research (FY2023-24-Research-January-March)

Isaac closed T360576: Extend evaluation data to include Chinese Wikipedia as Resolved.

Notebook looks good - thanks for the hard work and patience on this!

Mar 22 2024, 7:41 PM · Chinese-Sites, Research

Isaac closed T360576: Extend evaluation data to include Chinese Wikipedia, a subtask of T360572: Extend Article Quality Model to use HTML, as Resolved.

Mar 22 2024, 7:40 PM · Research (FY2023-24-Research-April-June), Epic

Isaac created T360815: Replicate Article Quality Training Notebook.

Mar 22 2024, 7:32 PM · Research (FY2023-24-Research-April-June)

Mar 20 2024

Isaac created T360576: Extend evaluation data to include Chinese Wikipedia.

Mar 20 2024, 8:28 PM · Chinese-Sites, Research

Isaac added a parent task for T360455: Add Article Quality Model to LiftWing: T360572: Extend Article Quality Model to use HTML.

Mar 20 2024, 8:15 PM · Research, Machine-Learning-Team

Isaac added a subtask for T360572: Extend Article Quality Model to use HTML: T360455: Add Article Quality Model to LiftWing.

Mar 20 2024, 8:15 PM · Research (FY2023-24-Research-April-June), Epic

Isaac created T360572: Extend Article Quality Model to use HTML.

Mar 20 2024, 8:14 PM · Research (FY2023-24-Research-April-June), Epic

Mar 19 2024

Isaac added a comment to T360455: Add Article Quality Model to LiftWing.

Task created -- @isarantopoulos just let me know if any details are missing or anything I can do to help with next steps when you are ready!

Mar 19 2024, 5:17 PM · Research, Machine-Learning-Team

Isaac created T360455: Add Article Quality Model to LiftWing.

Mar 19 2024, 5:16 PM · Research, Machine-Learning-Team

Mar 15 2024

Isaac added a comment to T354559: Put together diff blogpost on AI + Wikimedia + Datasets.

Weekly updates:

Sent thoughts over to SB about wiki benchmark datasets so we'll see if he's available to give feedback or has suggestions for others folk to talk with.
Bellagio draft points to importance of the ideas I had of writing a history of machine translation at Wikimedia with Eli. Specifically relating to the call for: A review of currently deployed systems, including (where available) quantitative and qualitative evidence of use and impact.
https://foundation.mozilla.org/en/research/library/generative-ai-training-data/common-crawl/ elicits to me a lot of parallels between Wikimedia data for AI and Common Crawl -- e.g., neither was intended as an AI dataset but are heavily used for that purpose, which can bring some considerations for appropriate usage. The major difference is that while there is no moderation of the internet (Common Crawl), Wikimedia does have excellent moderation and therefore the possibility for harm is far lower. In general, a good blogpost to cite when talking about why it's important to think critically about how content is used for training AI.

Mar 15 2024, 5:54 PM · Research (FY2023-24-Research-April-June)

Isaac added a comment to T354565: Edit summaries: dissemination of findings.

Weekly update:

Paper solidifying -- just style / readability aspects at this point. Should be good to submit to COLM this month. We'll upload the submitted paper to arXiv as well.

Mar 15 2024, 4:43 PM · Research (FY2023-24-Research-January-March)

Mar 8 2024

Isaac added a comment to T354559: Put together diff blogpost on AI + Wikimedia + Datasets.

Weekly updates:

Reached out to Sam Bowman who has worked on natural language benchmarks / LLM alignment so I could pick his brain over what it would mean to work on a Wikimedia benchmark and/or instruction-tuning dataset.
Connecting with Enterprise as they have some potential planned work in next FY around datasets on HuggingFace and benchmarks! They're also talking about the feasibility of a Commons dataset as there was another high-profile request for imagery data.

Mar 8 2024, 9:23 PM · Research (FY2023-24-Research-April-June)

Isaac added a comment to T354565: Edit summaries: dissemination of findings.

Weekly update:

Made Leila's requested fixes including adding a future work section that touched on different dataset filtering processes and ways of representing an edit to language models. We're a bit over 9 pages (the limit) but working on bringing it within requirements.

Mar 8 2024, 9:20 PM · Research (FY2023-24-Research-January-March)

Mar 7 2024

Isaac added a comment to T252227: Mobile redirects drop provenance parameters.

Very excited to see this gaining some traction (thanks @mpopov and @dr0ptp4kt)! Commenting on the analytics side of things (I don't know enough about Varnish to comment on implementation details):

Mar 7 2024, 1:24 PM · Data-Engineering, Data Pipelines, Traffic-Icebox, SRE

Mar 6 2024

Isaac added a comment to T343123: Migrate Machine-generated Article Descriptions from toolforge to liftwing..

This is really wonderful news! Thanks @kevinbazira for slogging through this with us and @isarantopoulos for your support as well! Those endpoints were working for me too so I'll let Android indicate what the next steps are.

Mar 6 2024, 5:10 PM · Wikipedia-Android-App-Backlog (Android Release - FY2023-24), Machine-Learning-Team

Isaac added a comment to T359340: Requesting GitLab account activation for desianabae1.

Thanks @taavi! Indeed, unblocked now

Mar 6 2024, 1:28 PM · GitLab (Account Approval), Release-Engineering-Team

Isaac created T359340: Requesting GitLab account activation for desianabae1.

Mar 6 2024, 1:18 PM · GitLab (Account Approval), Release-Engineering-Team

Mar 5 2024

Isaac added a comment to T257638: Topic Dataset : Model, Threshold, Post-processing.

Connecting to another ticket focused on producing these topic snapshots: T351118

Mar 5 2024, 1:33 PM · Movement-Insights

Mar 4 2024

Isaac updated subscribers of T358095: Outreachy Application Task: Tutorial for Wikipedia language-agnostic article quality modeling data.

Mar 4 2024, 3:25 PM · Outreachy (Round 28)

Isaac updated the task description for T358095: Outreachy Application Task: Tutorial for Wikipedia language-agnostic article quality modeling data.

Mar 4 2024, 3:15 PM · Outreachy (Round 28)

Mar 1 2024

Isaac added a comment to T354559: Put together diff blogpost on AI + Wikimedia + Datasets.

Weekly updates:

iteration on recommender system vision
began writing some of my thoughts around AI datasets though haven't gotten to benchmark data part, which I think is the piece I most need to think through (the rest is things that I've been thinking about and working on for a while now)
Read the STORM paper on generating Wikipedia articles from scratch using LLMs. There's a lot in there but in particular relevant to this work, they put together a dataset of newly-created, high-quality articles as a means of benchmarking a large-scale LLM on just new content. Good example for the benchmark datasets. I'd want to take their approach probably but combined with some details on how recent the references are perhaps (would add some complications around extracting that info unfortunately).
Watched a nice, long overview of tokenizers that would have been useful before we did the mwtokenizer project but also largely validated a lot of our thinking there and gives some more food for thought. Looking back, I think I would have dropped the sentences aspect and stuck to just paragraphs (easy) and words (hard). With increasing computational power, most things that use sentences can instead use paragraphs and that's way less prone to issues. Sentences become very important really only in the context of citations because the citation generally refers to the prior sentence and so it's useful to be able to extract that for determining entailment.

Mar 1 2024, 10:00 PM · Research (FY2023-24-Research-April-June)

Isaac added a comment to T354565: Edit summaries: dissemination of findings.

Weekly update:

Marija moved paper onto new template for COLM so going to do a pass on that next week

Mar 1 2024, 9:29 PM · Research (FY2023-24-Research-January-March)

Isaac added a comment to T305688: Make HTML Dumps available in hadoop.

I'll let others chime in but that would be my feeling about the correct scope. Going historical indeed adds a lot of complications and I think current snapshots are a huge first step. I'd coordinate with Enterprise obviously just to see if any changes are going to happen with schema etc. but hopefully relatively straightforward.

Mar 1 2024, 3:21 PM · Data-Engineering (Q4 2024 April 1st - June 30th), Structured-Data-Backlog

Isaac added a comment to T305688: Make HTML Dumps available in hadoop.

Moving this to discuss with the team. Seems reasonable to have 1 or 2 versions of this if we source it from the Enterprise dumps.

Thanks @lbowmaker for considering and @mfossati for raising! Just chiming in to add my support that having a current snapshot of Parsoid HTML from Enterprise would be very helpful. We've developed a Python library (mwparserfromhtml) that enables us to extract lots of features (references, infoboxes, plaintext, etc.) easily from the HTML so are in a good position to make use of it. Within Research, we're working on switching more of our models to using it too because the gap between wikitext and HTML is definitely growing (example with references). For example, we have an intern who will be working on converting the quality model used for knowledge gap metrics from using wikitext to HTML for this reason, so having a regular snapshot that could be used for computing article quality for all articles would be very helpful.

Mar 1 2024, 1:17 PM · Data-Engineering (Q4 2024 April 1st - June 30th), Structured-Data-Backlog

Feb 28 2024

Isaac added a comment to T358195: Investigate increased preprocessing latencies on LW of article-descriptions model.

Can we investigation reducing the computational need to just the language requested?

The model definitely benefits from some translations so "just the language requested" I would say is not the right approach. Are you suggesting capping it at 5 for example? It no longer seems to be an issue from the pre-processing perspective with Ilias' fix at least so hopefully not a blocker at this point and just a bonus for capping model latency. That said, if there's a desire to further constrain, I can look into it in the next few weeks but please don't let it be a blocker given that Android has said they're comfortable moving forward per T343123#9558740.

Feb 28 2024, 7:45 PM · Wikipedia-Android-App-Backlog, Machine-Learning-Team

Isaac added a comment to T352177: Proposal: Improve how Wiki Education Dashboard counts references added.

we currently have 4 workers consuming tasks for programs and events dashboard, and 3 for wikieducational dashboard. So I guess that the max number of concurrent requests would be 7 (if all the workers are working at the same time).

Yeah, that's quite reasonable! Thanks for looping back about it.

Feb 28 2024, 12:37 PM · Outreachy (Round 27)

Feb 23 2024

Isaac added a comment to T354559: Put together diff blogpost on AI + Wikimedia + Datasets.

Weekly updates:

Building up to this. I put together some vision around the recommender systems which is separate but complements the AI dataset work in that I see the recommender systems as a core means through which we can diversify our datasets by supporting editors (and importantly making it easier for them to discover and contribute to campaigns to help enrich and diversify content on the projects). The steps in that vision are:
- Migrate ContentTranslation / SuggestedEdits recommendation code to LiftWing
- Update code to share existing filters (seed article -> related articles; topic filters)
- Expand topical filters (country-level geographic filter; updated biography/gender approach)
- Connect in campaigns/wikiprojects by allowing for filtering to specific worklists
- Expand tasks available to cover more aspects of the editing process
I also started gathering resources related to the history of machine translation at Wikimedia. I'd been thinking about this for a bit but Amy's research showcase this week prompted Eli and I to connect over it. Beyond a fascinating general case-study around community governance, this history is very informative for thinking about how the community can also govern the use of generative AI models on the projects (as a far-less-constrained parallel of machine translation). I'm talking with Eli and will think about where to go with this and who/what to include.
Around actual AI datasets: I shared a bit with Kate Z about this in relation to Annual Planning and she pointed out that there might be some overlap with some of the goals by folks in Product Analytics.

Feb 23 2024, 6:53 PM · Research (FY2023-24-Research-April-June)

Isaac added a comment to T354565: Edit summaries: dissemination of findings.

Weekly update: no progress but will check in with team next week

Feb 23 2024, 4:06 PM · Research (FY2023-24-Research-January-March)

Isaac added a comment to T358195: Investigate increased preprocessing latencies on LW of article-descriptions model.

This puzzles me. Is that really necessary for the model to work?

Yes and no -- the model is multilingual so you can think of it doing a mixture of finding the right phrase from the first paragraph of the article within the target language with translating over descriptions from other languages and translating+extracting phrases from articles in other language editions too. In reality, I suspect we could come up with a simple but smart way to cap how many languages it queries without actually reducing the output quality (because I'd guess that the model has everything it really needs after probably at most 5 or so languages). But also in reality most article descriptions that are missing will be for languages in which there are at most only a few languages so this sort of optimization hasn't been tested because I think it'd be triggered pretty rarely. If we feel this is important, I can do some tests to see how much this changes things output-wise. Very simple code-wise at least, you'd still gather all the possible sitelinks but then cap them with something like (here we take five largest language editions):

descriptions, sitelinks, blp = await self.get_wikidata_info(lang, title)
# new code - excuse its hackiness
lang_by_size = {l:i for i,l in enumerate(['en', 'de', 'nl', 'es', 'it', 'ru', 'fr', 'zh', 'ar', 'vi', 'ja', 'fi', 'ko', 'tr', 'ro', 'cs', 'et', 'lt', 'kk', 'lv', 'hi', 'ne', 'my', 'si', 'gu'])}
sitelinks = {l:sitelinks[l] for i,l in enumerate(sorted(sitelinks, key=lang_by_size.get)) if i <= 5}

Feb 23 2024, 4:06 PM · Wikipedia-Android-App-Backlog, Machine-Learning-Team

Feb 22 2024

Isaac added a comment to P57453 article-descriptions: evaluate `preprocess()` and `predict()` runtime on LiftWing.

@kevinbazira thanks for explaining -- I was unaware of the Rest Gateway etc. stuff so assumed LiftWing was using the same entrypoints to the APIs. I saw the other ticket where you're working through possibilities. I'll monitor but sounds like you all have it and thanks for digging into this!

Feb 22 2024, 2:53 PM · Machine-Learning-Team

Feb 21 2024

Isaac changed the visibility for T358095: Outreachy Application Task: Tutorial for Wikipedia language-agnostic article quality modeling data.

Feb 21 2024, 10:42 PM · Outreachy (Round 28)

Isaac added a comment to P57453 article-descriptions: evaluate `preprocess()` and `predict()` runtime on LiftWing.

Oh wow good sleuthing. To help me understand, does this sound right: the challenge with preprocess is that it for every language (up to 25) that the article exists in, an API call has to be made to that language edition's page summary REST endpoint to get the first paragraph (enwiki endpoint). And as we can see, this normally is still under half a second because it's a pretty quick API and we do the calls async. But presumably something on LiftWing is preventing the up-to-25 API calls from being made async/simultaneously?

Feb 21 2024, 4:16 PM · Machine-Learning-Team

Feb 20 2024

Isaac placed T356641: Check home/HDFS leftovers of mhoutti up for grabs.

Checked and grabbed a few files that are important so mhoutti's home directory on stat1008 and HDFS may now be removed. Thanks!

Feb 20 2024, 8:57 PM · Data-Platform-SRE (2024.03.04 - 2024.03.24)

Isaac added a comment to T352177: Proposal: Improve how Wiki Education Dashboard counts references added.

Does it answer your question?

Getting us closer I think -- it is a batch job so possibility for a large number of requests all at once. Do you know what a maximum load might look like (doesn't have to be super specific, just a general sense to make sure it doesn't cause issues on Wikimedia end)? For instance, is it async but at most 20 concurrent requests or a sequential job that's only processing one revision at a time? There isn't necessarily a wrong answer though REST API documentation says max 200 reqs/second: https://en.wikipedia.org/api/rest_v1/. Older revisions could take some time to process as nothing would be cached on the Parsoid side.

Feb 20 2024, 6:51 PM · Outreachy (Round 27)

Feb 16 2024

Isaac added a comment to T354565: Edit summaries: dissemination of findings.

Weekly update:

No progress

Feb 16 2024, 9:23 PM · Research (FY2023-24-Research-January-March)

Isaac added a comment to T354559: Put together diff blogpost on AI + Wikimedia + Datasets.

Weekly updates:

No progress but formalized the importance of this with Miriam as it'll hopefully set much of my work for next FY

Feb 16 2024, 9:23 PM · Research (FY2023-24-Research-April-June)

Isaac added a comment to T352177: Proposal: Improve how Wiki Education Dashboard counts references added.

Sounds good -- one thing that came up when I was chatting a bit with our Parsoid folks. what's the strategy for collecting the ref counts? would it be a batch job with a lot of concurrent API calls for the HTML (latency really could start to become a factor because old revisions are unlikely to be cached) or something a bit more spread-out / kinder to the APIs?

Feb 16 2024, 9:16 PM · Outreachy (Round 27)

Isaac added a comment to T352177: Proposal: Improve how Wiki Education Dashboard counts references added.

It should work for any project actually assuming they follow the same approach to handling citations as Wikipedia does but I haven't tested much beyond Wikipedia. You'd just switch the project in the REST API URL -- e.g., https://en.wiktionary.org/api/rest_v1/page/html/heart/76995678 for the article you used above: https://en.wiktionary.org/wiki/?oldid=76995678

Feb 16 2024, 6:38 PM · Outreachy (Round 27)

Feb 15 2024

Isaac added a comment to T352177: Proposal: Improve how Wiki Education Dashboard counts references added.

Oooh yes excellent example to think through. I think there are two potential answers to the question of how many references are in an article but they only loosely relate to ref tags vs. footnote templates. There are two things that I think are relevant to count regarding references in an article. Other people might use different terminology (English Wikipedia notes that people often use these terms interchangeably unfortunately) but this how I'll distinguish:

Sources: how many unique references are in the article. In the screenshot below for the article in question, this is 38. This is what I call reference in the library which is perhaps confusing (sorry -- I'll have to think about whether to change this).
Citations: how many times a reference is used in an article. I think this is the current equivalent of references added in the dashboard. In the screenshot below you can count this by seeing how many footnotes link to each source. This is 1 if it's just a ^ label by the source but otherwise count up the individual letter labels. In this case, there are 48 total citations (1 for most sources and then 7 for source 11, 2 for source 18, and 4 for source 37). This is actually different than the 49 number returned by the wikitext-based APIs. I think what's happening is that there's a {{sfnref}} template also used in the article that's being counted in the footnote count but in reality is just providing some more information about a reference and not a distinct reference itself (again showing the challenges of working with wikitext).

Feb 15 2024, 8:36 PM · Outreachy (Round 27)

Feb 14 2024

Isaac added a comment to T352177: Proposal: Improve how Wiki Education Dashboard counts references added.

I suspect it might be a lot slower with HTML, especially since the HTML old revisions is probably not cached and so would rely on getting rendered by MediaWiki for each query. But maybe it won't be too slow.

@Ragesoss yeah, that's a fair point but hopefully not a blocker. Another point in favor: if you switch to HTML, with relatively little additional overhead you can also add in extraction of other elements. Most of the latency would come from requesting the HTML from the API and doing the initial parsing in Python but extracting additional features would be very cheap after that. My library has implemented this already for audio, categories, externallinks, sections, images, infoboxes, lists, math elements, message boxes, navboxes, hatnotes, references (unique sources as opposed to inline citations), videos, wikilinks, and wikitables. There are likely some gaps depending on whether the feature is an explicit mediawiki feature (e.g., images where extraction should be near perfect) or more norm-based and based on templates (e.g.., infoboxes where some language communities might not follow the norm). But at least the library codifies some expectations and works for any language of Wikipedia. If there are other features you're interested in too but don't see above, always happy to discuss and figure out how feasible adding support would be.

Feb 14 2024, 10:17 PM · Outreachy (Round 27)

Isaac added a comment to T352177: Proposal: Improve how Wiki Education Dashboard counts references added.

I finally got around to doing some analysis of wikitext vs. HTML. High-level: about 90% of sources/citations in HTML are correctly identified via ref-tags in the wikitext. This varies by language. This is in existing articles though and we might see that a lot of them were e.g., initially added by bots but wikitext and HTML still match up pretty well for new edits (as would be relevant to PG&E dashboard). That said, I think this is a good indication that long-term it makes sense to switch over to HTML as source for this data.

Feb 14 2024, 2:24 PM · Outreachy (Round 27)

Feb 9 2024

Isaac added a comment to T354559: Put together diff blogpost on AI + Wikimedia + Datasets.

Weekly updates:

Paper on machine translation above is a good read. Will try to find way to make an invite to research showcase to have a discussion around machine translation as I think about that aspect of the impact of datasets.
Formalized SPARQL dataset task: T357178
I've finally put out a full release of mwparserfromhtml though there's one remaining piece that I want to work out with regards to how it handles media. This library enables me to advocate more clearly for using HTML moving forward as a more accurate and easy-to-handle snapshot of content (prompted by this discussion). In the meantime, Enterprise is working on fixing a few bugs with the dumps (T305407) and should be re-releasing them in the next few weeks to further bolster this push.

Feb 9 2024, 9:17 PM · Research (FY2023-24-Research-April-June)

Isaac closed T348331: Select two AI datasets for prioritization, a subtask of T341907: Release datasets in support of Wikimedia-related AI modeling, as Resolved.

Feb 9 2024, 9:13 PM · Research, Epic

Isaac closed T348331: Select two AI datasets for prioritization as Resolved.

Finally getting arouond to resolving. The two (SQL + SPARQL) were selected and documented (T354455 and T357178). Work will continue under T354559, which is focused on writing up the role of datasets at Wikimedia, particularly with regards to development of AI tools for supporting the projects.

Feb 9 2024, 9:13 PM · Research

Isaac created T357178: [long] Train model for auto-generating SPARQL queries.

Feb 9 2024, 9:01 PM · research-ideas

Isaac added a comment to T354565: Edit summaries: dissemination of findings.

Weekly update:

Met with team and a few takeaways:
- Going to consider submitting to COLM: https://colmweb.org/ (March 29th full submission)
- Discussed whether to train a larger model (LLaMa?) to show that we could achieve GPT-4-level performance and dispel notions that e.g., we're just bad at fine-tuning models. But low priority as I think we're still interested in what a much smaller model can do.
- Focus on thicker evaluations -- error analysis, more human eval, etc.
- Focus on expanded discussion / establishing of research area -- i.e. emphasizing no existing baseline approaches and GPT performance is incidenttal.
- Not going to pivot towards policy prediction -- core of this paper is still good and interesting and that work can continue in parallel
- If time, we're consider more ways of representing the diffs for the models to see how that changes things

Feb 9 2024, 3:18 PM · Research (FY2023-24-Research-January-March)

Feb 8 2024

Isaac added a comment to T343123: Migrate Machine-generated Article Descriptions from toolforge to liftwing..

This is very useful (and exciting) data -- thank you @isarantopoulos !

Feb 8 2024, 11:27 PM · Wikipedia-Android-App-Backlog (Android Release - FY2023-24), Machine-Learning-Team

Feb 5 2024

Isaac claimed T356641: Check home/HDFS leftovers of mhoutti.

I'll aim to do this next week and give a heads up then when can be fully removed.

Feb 5 2024, 8:48 PM · Data-Platform-SRE (2024.03.04 - 2024.03.24)

Feb 2 2024

Isaac added a comment to T354559: Put together diff blogpost on AI + Wikimedia + Datasets.

Weekly update: no concrete progress but a good conversation earlier this week with an external researcher at UMich that revolved around AI, Wikimedia, and what role the research team plays in all of that vs. the community of editors vs. outside actors. One of the pieces that came out of that was the value of how Wikimedians have approached machine translation over the past few years for understanding how they're approaching the more powerful generative AI and what sorts of options exist for taking advantage of the technology without it steering things too much. And a new paper for me to read around that: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4708614

Feb 2 2024, 9:39 PM · Research (FY2023-24-Research-April-June)

Isaac added a comment to T354565: Edit summaries: dissemination of findings.

Weekly update:

Meeting next week but I think a few possible strategies with this to consider:
- Shift from state-of-art paper to qualitative exploration through a model: that is, make it less about achieving a high-quality model and more about establishing the research space, determining what the major challenges are around data etc., explore a few different approaches, greatly expand discussion section with recommendations for how to proceed.
- Extend human evaluations: add more human evaluations to the work. this might be about showing that while the model isn't great, for those half of edit summaries with nothing, it at least produces something viable.
- Pivot towards policy prediction aspect: move away from text-changing edits to be about improving revert summaries in particular. Could cast as classification challenge or keep text-generation focus. This would start to mix with other ideas we've discussed so from a collaboration standpoint, might be difficult (different people involved) but feels like the most promising direction to me in general for this area of research.
- Status-quo: minimal changes and look for new venue. I don't love this as I think this could be a really strong paper with a bit more adjustments.

Feb 2 2024, 6:45 PM · Research (FY2023-24-Research-January-March)

Isaac added a comment to T323875: Turn edit summary hashtags into change tags.

Chiming in as a major supporter of this ticket. Turning semi-structured data (hashtags) into structured data (edit tags) enables so so much more tooling / discovery. I also had some of @Trizek-WMF's concerns that converting every hashtag could create overload on database side or Special:Tags pages though perhaps making these tags hidden by default or some of those existing options would make that less of an issue at least from the UI side.

Feb 2 2024, 3:25 PM · Campaign-Tools, MediaWiki-Change-tagging

Feb 1 2024

Isaac added a comment to T305407: Stale data / missing pages in HTML ("enterprise") .

adding some more stale data examples I came across while working with a snapshot of the HTML dumps, specifically /public/dumps/public/other/enterprise_html/runs/20231201/enwiki-NS0-20231201-ENTERPRISE-HTML.json.tar.gz on PAWS:

A redirect that was briefly vandalized in July 2023 to not be a redirect: https://en.wikipedia.org/w/index.php?title=Adhirath&oldid=1167382940
This also very old briefly-an-article-now-a-redirect: https://en.wikipedia.org/w/index.php?oldid=1025016513

Feb 1 2024, 8:48 PM · Wikimedia Enterprise, Dumps-Generation

Jan 30 2024

Isaac added a comment to T346089: Investigate Isaac article quality ML model as option .

FYI the language-agnostic model is not currently on LiftWing to the best of my knowledge though this is something under consideration. The API Gateway link above is for the old ORES models which only cover a few languages.

Jan 30 2024, 8:08 PM · Epic, Wikimedia Enterprise - Content Integrity, Wikimedia Enterprise

Jan 24 2024

Isaac closed T348330: Human Rights AI Checklist Feedback, a subtask of T348328: Ethical AI Support, as Resolved.

Jan 24 2024, 10:38 PM · Research, Epic

Isaac closed T348330: Human Rights AI Checklist Feedback as Resolved.

Whoops thanks (I thought I'd closed this). Feedback passed back to team. Full feedback in [https://docs.google.com/document/d/1SOjFUkRClpssyZoBNNoAQaAksbvyz9RegY76i9Ni-jQ/edit?usp=sharing this google doc] but amounted to mixture of question-specific wording to improve clarity, raising the question of how to handle models that are a mixture of pre-trained language model + fine-tuning, and question about how to address maintainability of model in the checklist. I also have a TODO to propose guidance around what "open-access" means for ML models to Legal so we can give more specifics to folks there too -- i.e. what elements of a model (training data, training code, outputs, etc.) need to be explicitly licensed and how. That will happen separately.

Jan 24 2024, 10:38 PM · Research

Isaac added a comment to T354559: Put together diff blogpost on AI + Wikimedia + Datasets.

Weekly updates: no progress but reminder to myself to touch base with Nadee as this crystallizes as she's coordinating some of the blogpost sharing around AI

Jan 24 2024, 10:34 PM · Research (FY2023-24-Research-April-June)

Isaac added a comment to T354565: Edit summaries: dissemination of findings.

Weekly update:

Rejected from WWW so meeting in two weeks to discuss next steps. Possibility of KDD sooner but similar venue and I suspect that's not enough time to make changes we want to to increase chances.

Jan 24 2024, 10:33 PM · Research (FY2023-24-Research-January-March)

Isaac (Isaac Johnson)Research Scientist

Projects

Calendar

Today

Tomorrow

Tuesday

User Details

Recent ActivityView All

Fri, Apr 26

Thu, Apr 25

Wed, Apr 24

Tue, Apr 23

Fri, Apr 19

Tue, Apr 16

Mon, Apr 15

Fri, Apr 12

Wed, Apr 10

Wed, Apr 3

Tue, Apr 2

Mar 22 2024

Mar 20 2024

Mar 19 2024

Mar 15 2024

Mar 8 2024

Mar 7 2024

Mar 6 2024

Mar 5 2024

Mar 4 2024

Mar 1 2024

Feb 28 2024

Feb 23 2024

Feb 22 2024

Feb 21 2024

Feb 20 2024

Feb 16 2024

Feb 15 2024

Feb 14 2024

Feb 9 2024

Feb 8 2024

Feb 5 2024

Feb 2 2024

Feb 1 2024

Jan 30 2024

Jan 24 2024

Isaac (Isaac Johnson)
Research Scientist

Recent Activity
View All