Page MenuHomePhabricator

Isaac (Isaac Johnson)
Research Scientist

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Oct 1 2018, 2:19 PM (243 w, 4 d)
Availability
Available
IRC Nick
isaacj
LDAP User
Isaac Johnson
MediaWiki User
Isaac (WMF) [ Global Accounts ]

Recent Activity

Yesterday

Isaac added a comment to T334227: Ethical ML: Establish Initial Guidance.

Weekly updates:

  • Working on iterative feedback sessions on Human Rights Impact checklist. TODOs to help craft a few and potentially do a pilot implementation with some of Diego's models
  • Put together some patches for plugin to help improve quality of internal Wikipedia Search Results in anticipation of doing some testing of how well it does. Early indications are that it does just fine as ChatGPT is generally passing a standard list of keywords as opposed to the raw user question
Fri, Jun 2, 10:35 PM · Research (FY2022-23-Research-April-June)

Tue, May 30

Isaac updated the task description for T219903: Keep research.wikimedia.org landing page updated.
Tue, May 30, 4:00 PM · periodic-update, Research

Fri, May 26

Isaac added a comment to T334227: Ethical ML: Establish Initial Guidance.

Weekly updates:

  • Participated in Hackathon and processing outcomes from that!
  • Put in a few patches for wikigpt plugin to improve logging so we can better analyze the quality of the different search options
  • Provided feedback on AI Human Rights checklist and signed up to share out with team in two weeks
Fri, May 26, 9:39 PM · Research (FY2022-23-Research-April-June)

Tue, May 23

Isaac closed T333853: [Session] Self-hosting ML models on Cloud Services as Resolved.

Thanks @NicoleLBee for adding the notes!

Tue, May 23, 7:29 PM · Wikimedia-Hackathon-2023
Isaac added a comment to T337246: stat1008's /srv partition is getting full due to home dirs.

isaacj down from 146G -> 39G. Thanks for the nudge!

Tue, May 23, 12:24 PM · Data-Engineering

Fri, May 19

Isaac created T337019: Build dataset of Quarry queries.
Fri, May 19, 9:56 AM · Research ideas

Fri, May 12

Isaac created T336607: Make improvements to mwparserfromhtml.
Fri, May 12, 8:55 PM · Wikimedia-Hackathon-2023
Isaac added a comment to T334227: Ethical ML: Establish Initial Guidance.

Weekly updates:

  • Participated in AI + Wikimedia panel at WikiWorkshop
  • Figured out issue with Hackathon demo (cloud vps configuration) and so that is working now (current endpoint: https://wikitech-search.wmcloud.org/docs)! Working on putting together learnings now for the session.
Fri, May 12, 8:34 PM · Research (FY2022-23-Research-April-June)
Isaac added a comment to T334760: Understanding Contributors: Initial Analyses.

No updates -- hackathon next week and then will return attention here

Fri, May 12, 8:32 PM · Research (FY2022-23-Research-April-June)
Isaac added a comment to T321224: Wikidata Item Quality Model.

No updates still with prep for wikiworkshop/hackathon but after next week, hoping to get back to this!

Fri, May 12, 8:32 PM · Research, Linked-Open-Data-Network-Program, Wikidata
Isaac updated the task description for T219903: Keep research.wikimedia.org landing page updated.
Fri, May 12, 7:18 PM · periodic-update, Research

Mon, May 8

Isaac added a comment to T327936: Limit paws storage.

How about we set the limit to about 5GB? A little more than twice the RAM limit.

@rook If that would still solve your issue, that sounds great to me and unlikely to cause new problems! thanks!

Mon, May 8, 5:15 PM · PAWS
Isaac added a comment to T327936: Limit paws storage.

This makes sense to me overall with a few thoughts about how to reduce frustration on the user side. I took a look at mine and I was at ~5GB (sorry) and so removed a few larger data files that could easily be re-downloaded if needed again but found that actually almost 4GB of this was pip cache that I wasn't even aware of. Assuming this is not just me, a few thoughts:

  • The pip cache is pretty hidden -- I first did $ du -hs * and was confused because nothing was standing out as being particularly large but then I actually looked at the cache because I'd been doing some machine learning work and knew that the model files were stored there so assumed that was the issue (which luckily overlapped with the pip cache issue). All to say, you have to look for it explicitly to find it so folks who do a lot of Python work might find themselves inadvertently filling up their quota without understanding why.
  • Regarding my second guess for the issue about HuggingFace machine learning models -- it's true that PAWS has limited compute but it does have enough to make it a really nice place for showcasing how to use ML models (current example) for Wikimedia content. Might it make more sense to track storage space to at least match available RAM?
  • Personally, part of the challenge for me too is that I'm a long-time user of PAWS now so have built up a number of notebooks with small data components to them that together end up taking up a fair bit of space even if each one is quite within expected PAWS usage.
  • Potential compromises:
    • Is it possible to move some of this pip cache off the individual hosts or is that a headache / not a good security idea for some reason? If not and there's some sort of warning message that folks would get that could point them to some documentation/tips, including things like using $ du -hs ~/* and doing $ pip cache purge and checking the size of hidden folders would be useful pointers.
    • Will there be a way to request extra storage (as with Toolforge / Cloud VPS)? That would honestly solve most of my personal concerns because it sounds like from the statistics (thank you), there aren't too many of us who will be impacted.
    • Could there be a temporary space within a working session perhaps that has a much higher limit but is automatically deleted at the end of the session? That way it could be used for larger files such as huggingface models or pip libraries that are necessary within a session and fit normal use expectations but can be safely deleted and downloaded fresh in a future session -- perhaps this is by default the ~/.cache folder?
Mon, May 8, 4:53 PM · PAWS

Fri, May 5

Isaac added a comment to T333853: [Session] Self-hosting ML models on Cloud Services.

Just collecting some of our thoughts / intentions here for those who are interested:

  • Goal will be to demo a ML-backed tool for doing natural-language search of Wikitech documentation. You can see a simple demo here of the process start-to-finish on PAWS though the goal will be to host it as a webapp so folks can actually use it: https://public-paws.wmcloud.org/User:Isaac_(WMF)/hackathon-2023/wikitech-natural-language-search.ipynb
  • We'll share some of our learnings along the way about choosing models, adhering to open-source, challenges with working with some common libraries, etc.
  • Based on what the group of assembled folks is interested in, we can primarily do Q&A or some live coding / experimenting etc.
  • If folks have requests prior to the session, feel free to let us know though no promises that we'll be able to address them.
Fri, May 5, 6:52 PM · Wikimedia-Hackathon-2023
Isaac added a comment to T334227: Ethical ML: Establish Initial Guidance.

Weekly updates:

  • Reviewing first draft of Human Rights checklist
  • Reviewed some of the enwiki edits from the Android pilot and all were looking reasonable.
  • Continued work to pull together best practices / tips around hosting ML on cloud services.
Fri, May 5, 6:42 PM · Research (FY2022-23-Research-April-June)
Isaac added a comment to T334760: Understanding Contributors: Initial Analyses.

No updates this week

Fri, May 5, 5:14 PM · Research (FY2022-23-Research-April-June)

May 1 2023

Isaac created T335707: SuggestBot Paper Review.
May 1 2023, 6:50 PM · Research
Isaac updated the task description for T310379: SuggestBot Experimentation.
May 1 2023, 6:48 PM · Research

Apr 28 2023

Isaac added a comment to T334227: Ethical ML: Establish Initial Guidance.

Weekly updates:

  • Moving forward with Human Rights approach -- waiting to hear on next steps with them.
  • Android pilot seems to be going well -- been monitoring VPS instance to make sure it stays up and will do some evaluation of the edits to get familiar with any issues that are popping up in usage.
  • Started working on session for hackathon. Initial focus is on something like WikiGPT but for Wikitech Help namespaces both as a potentiallly useful tool for developers there and also to showcase what's possible with open-source tech. Example: https://public-paws.wmcloud.org/User:Isaac_(WMF)/hackathon-2023/wikitech-natural-language-search.ipynb
Apr 28 2023, 7:02 PM · Research (FY2022-23-Research-April-June)
Isaac added a comment to T334760: Understanding Contributors: Initial Analyses.

No updates this week.

Apr 28 2023, 6:58 PM · Research (FY2022-23-Research-April-June)
Isaac added a comment to T333127: [Session] LLMs, ChatGPT, machine learning tools, etc.

@kostajh see below:

Apr 28 2023, 4:40 PM · Wikimedia-Hackathon-2023

Apr 27 2023

Isaac added a comment to T295073: <Org-Wide Impact> Google Chrome User-Agent Deprecation Impact.

Checking in on the status of this issue. @Mayakp.wiki detected a large spike in pageviews that were being tagged as automated but look pretty clearly like human traffic (see T310846#8809323). The cause seems to be that the implementation by Chrome of the more generic user-agent seems to finally be rolling out in a substantial way (timeline) and so is breaking at least the bot detection pipelines in pretty significant ways. It seems the UA hints were dropped as it wasn't clear that we should be using them or that they would be of much benefit. Likely worth revisiting this conversation or considering alternatives though.

Apr 27 2023, 6:41 PM · Foundational Technology Requests

Apr 25 2023

Isaac added a comment to T333856: Cloud VPS open exception request.

Thanks for response and additional engagement. I don't expect the conclusion to change but some additional context / thoughts:

Apr 25 2023, 7:32 PM · cloud-services-team, Cloud-VPS

Apr 17 2023

Isaac added a comment to T333856: Cloud VPS open exception request.

Thanks for opening this ticket! I've added this to the agenda for next week's team meeting for consideration.

Thanks @nskaggs -- don't hesitate to let me know if any additional details would be useful for folks to know. FYI I'll be out the latter half of this week so if you have any clarifying questions, I might not get back to you until next week.

Apr 17 2023, 7:18 PM · cloud-services-team, Cloud-VPS

Apr 14 2023

Isaac added a comment to T334227: Ethical ML: Establish Initial Guidance.

Weekly updates:

  • Worked with Leila to generate some remaining questions for Human Rights folks about how that policy might be used to support the ethical ML space. They're out this week but hopefully responses next week that allow us to move forward.
Apr 14 2023, 6:49 PM · Research (FY2022-23-Research-April-June)
Isaac moved T334760: Understanding Contributors: Initial Analyses from Staged to FY2022-23-Research-April-June on the Research board.
Apr 14 2023, 6:31 PM · Research (FY2022-23-Research-April-June)
Isaac added a comment to T334760: Understanding Contributors: Initial Analyses.

A few starting plots as I consider the different ways to analyze/showcase the data. I think a lot of this will eventually be more useful when we have more focused questions to ask of it -- e.g., impact of a particular tool on edit types or specific use-cases to consider such as how often do new editors add a new sentence (as is being considered by the Editing team). The plots below are all based on edit data from the main article namespace in French Wikipedia in January 2023 (minus bot edits). Only the edit category chart contains reverts/reverted edits -- they are filtered out for all other charts. I have a TODO to make it easy to understand how the categories below are constructed, but for now, this code contains some details and interfaces were determined via edit tags.

Apr 14 2023, 6:31 PM · Research (FY2022-23-Research-April-June)
Isaac created T334760: Understanding Contributors: Initial Analyses.
Apr 14 2023, 6:07 PM · Research (FY2022-23-Research-April-June)

Apr 12 2023

Isaac moved T328264: NLP Tools: Word Tokenization from FY2022-23-Research-January-March to FY2022-23-Research-April-June on the Research board.
Apr 12 2023, 3:26 PM · Research (FY2022-23-Research-April-June)
Isaac moved T328260: NLP Tools: Sentence Tokenization from FY2022-23-Research-January-March to FY2022-23-Research-April-June on the Research board.
Apr 12 2023, 3:26 PM · Research (FY2022-23-Research-April-June)
Isaac moved T316941: NLP Tools for Content Gaps from FY2022-23-Research-January-March to FY2022-23-Research-April-June on the Research board.
Apr 12 2023, 3:26 PM · Research (FY2022-23-Research-April-June), Epic

Apr 11 2023

Isaac added a comment to T321224: Wikidata Item Quality Model.

From discussion with Lydia/Diego:

  • The concept of completeness feels closer to what we want than quality -- i.e. allowing for more nuance in how many statements are associated with a given item. We came up with a few ideas for how to make assessing item completeness easier (because otherwise it would require very extensive knowledge of a domain area to know how many statements should be associated with an item): I suggested providing the completeness score and quality score and asking the evaluator which was more appropriate but I like Lydia's idea better which was to just provide the completeness score and ask the evaluator if they felt that the actual score was lower, the same, or higher.
  • Putting together a dataset like this would be fairly straightforward -- the main challenge is having a nice stratified dataset and one that provides information on top of the original quality-oriented dataset. For example, for highly-extensive items, both models tend to agree that the item is A-class so collecting a lot more annotations won't tell us much. It's only for the shorter items where we begin to see discrepancies and so that's where we should probably focus our efforts. Plus because the model is very specific to the instance-of/occupation properties, we should make sure to have a diversity of items by those properties. This is my main TODO.
  • I read through the paper describing the new proposed Wikidata Property Suggester approach. My understanding of the existing item-completeness/recommender systems:
    • Existing Wikidata Property Suggester: make recommendations for properties to add based on statistics on co-occurrence of properties. Ignores values of these properties except for instance-of/subclass-of where the statistics are based on the value. Recommendations are ranked by probability of co-occurrence.
    • Recoin: similar to above but only uses instance-of property for determining missing properties and adds in refinement of which occupation the item has if it's a human.
    • Proposed Wikidata Property Suggester: more advanced system for finding likely co-occurring properties based on more fine-grained association rules -- i.e. doesn't just merge all the individual "if Property A -> Property B k% of the time" but instead does things like 'if Property A and Property B and ... -> Property N k% of the time". Also takes into account instance-of/subclass-of property values like the existing suggester. This seems like a pretty reasonable enhancement and their approach is quite lightweight (~1.5GB RAM for holding data structure).
  • I am following the Recoin approach in my model though if the new Property Suggester proves successful and provides the data needed to incorporate into the model (a list of likely missing properties + confidence scores), it would be very reasonable to incorporate that in in place of the Recoin model at a later point and also solve some of the problems that @diego was considering addressing via wikidata embeddings (more nuanced recommendations of missing properties).
Apr 11 2023, 7:50 PM · Research, Linked-Open-Data-Network-Program, Wikidata
Isaac added a comment to T246250: Sunset external automatic translation dashboard.

Ahh drat forgot to follow up but thanks @nshahquinn-wmf for the archiving and @mpopov for not letting this continue to remain unclear!

Apr 11 2023, 7:21 PM · Product-Analytics
Isaac added a comment to T333497: Include image/file changes in page-links-change.

Can you say more about this? IIUC, these are different kinds of links, yes? The page and image links are similar as @TheresNoTime says, since they are both internal hyperlinks. Is a link to a category or a template kind of the same, or are those very different?

@Ottomata fair question and I'll try to better explain myself: in theory, "links" cover a lot of interconnections between pages where changes might be useful to know about for an end-user. There are lots of ways to categorize them (intrawiki vs. interwiki vs. external; what syntax to use for creating them; how they're stored in Mediawiki; how they're used; etc.). Given that this link stream question depends on mediawiki code, I'll do my best to categorize them according to a mixture of what they do and how they're indexed on the backend. Apologies if I get any details wrong/missing in trying to do this quickly:

Apr 11 2023, 7:20 PM · Data-Engineering, Event-Platform Value Stream, EventStreams

Apr 7 2023

Isaac added a comment to T331401: Design event schema for ML scores/recommendations on current page state.

Q: Will it be useful to have the 'prior state' of predicted_classifications in this event?

This is very tempting but I don't personally have a super strong use-case for it and it feels reasonably expensive to get right. A few thoughts:

  • The best use-case I can think of for it is in being more kind when updating our Search indices -- e.g., for every revision, we compute the article topics and only if they're different from the previous topics do we send an update to the Search index. This would greatly reduce the updates to Search as most edits won't change an article substantially enough to change its topic. The tricky thing is that the topic model uses an article's links via the pagelinks table, so we don't currently have a way of getting a prediction for a past revision. For this to be feasible, I assume we'd need some cache of prior predictions? This is an extreme case but in general, it's not always a perfect assumption that the current model prediction for an old revision will be the same as the then-current model prediction for an old revision and that could cause issues depending on how we source the prior prediction.
  • For other use cases where we're just interested in triggering some behavior based on substantive changes to the article content as proxied by e.g., a large change in quality, my assumption is that we probably should instead focus on getting a stream enrichment that does edit types (diffs) and use that more directly. For example, if we want to flag when an article's quality decreases by a certain quantity, we're probably actually interested in edits that are removing certain types of content and we should just detect that directly with the edit types. The nice thing about the edit types library is that it would just be a direct enrichment and not a LiftWing call, so once there's a stream with previous+current wikitext in it, it's just a processing of those two strings with no additional API calls (or stream with current wikitext and we have the API call to get the previous wikitext).
Apr 7 2023, 3:45 PM · Event-Platform Value Stream (Sprint 12), Data-Engineering, Machine-Learning-Team, Research

Apr 6 2023

Isaac closed T316412: Understanding Contributors: Historical Edit Types as Resolved.

Resolving as we have now gotten to the state where I can do more large-scale analyses (many wikis across a month time period)

  • The main improvements came first via Muniza (pyspark config for handling the computation) and then via Fabian -- mwparserfromhell patch and optimization of my pipeline for computing edit types. Fabian's fix was merged into mwparserfromhell too which is nice confirmation and a valuable contribution to the larger community (though they haven't made a new release in about a year so might be some time before it's default).
  • For instance, non-bot edits from a month of French Wikipedia are aggregated below where you can see (if you remove reverted edits) that IP editors are more likely than other groups to be doing small maintenance edits and new editors (1-10 edits) are more likely than otthers to do content generation. The edit size plot (reverts excluded) backs this up. This can now be expanded to other wikis and also I have data on edit difficulty and whether it changed text:

Screenshot 2023-04-06 at 5.51.44 PM.png (1×1 px, 272 KB)

Screenshot 2023-04-06 at 5.57.55 PM.png (1×1 px, 218 KB)

Apr 6 2023, 9:59 PM · Research (FY2022-23-Research-January-March)
Isaac closed T316412: Understanding Contributors: Historical Edit Types, a subtask of T293465: Edit Types Research, as Resolved.
Apr 6 2023, 9:59 PM · Research, Epic
Isaac updated the task description for T219903: Keep research.wikimedia.org landing page updated.
Apr 6 2023, 9:56 PM · periodic-update, Research
Isaac added a comment to T333497: Include image/file changes in page-links-change.

@TheresNoTime thanks for explaining. I think I still lean towards separate streams all things equal then but ultimately I'm fine with whatever is decided so long as it enables your use case.

Apr 6 2023, 9:41 PM · Data-Engineering, Event-Platform Value Stream, EventStreams
Isaac moved T334227: Ethical ML: Establish Initial Guidance from Staged to FY2022-23-Research-April-June on the Research board.
Apr 6 2023, 2:56 PM · Research (FY2022-23-Research-April-June)
Isaac created T334227: Ethical ML: Establish Initial Guidance.
Apr 6 2023, 2:55 PM · Research (FY2022-23-Research-April-June)
Isaac updated subscribers of T333497: Include image/file changes in page-links-change.

What do you think?

Hmm...what's the use-case for having wikilinks to articles and images in the same stream? On one hand, assuming the stream specifies the link namespace explicitly, it simplifies things to only have one place to check for link changes. On the other hand, it could force folks to filter a lot of events just to get to the ones that interest them and opens to the door to other questions like whether the intent to also include templatelinks, categorylinks, etc.? As a potential end-user, my gut feeling is to keep them separate like the mediawiki tables because personally I'm not generally working with models that use both links and images (and if I am, I'd likely prefer to just listen for the more generic page-change events because I'm probably watching for a lot of other things like references that aren't link-specific). Curious to hear other perspectives though.

Apr 6 2023, 1:10 PM · Data-Engineering, Event-Platform Value Stream, EventStreams

Apr 5 2023

Isaac closed T327830: ML Equity: Shared Priorities as Resolved.

Resolving this task -- at this point we have a clearer picture of where the Tech dept is going with this:

  • ML platform will continue to lead the way on internal ML models
  • I'm working on our guidance for 3rd-party ML and keeping that aligned with our internally-hosted ML expectations. This is currently taking several forms:
    • Engagement with community via hackathon and presumably other venues
    • Collecting of teams' experiences with 3rd-party ML to inform our strategy moving forward
    • Discussions with Human Rights team about alignment between ethical ML and human rights policy
    • Continue piloting work of more advanced ML tools like machine-assisted article descriptions to get a sense of what guardrails are useful etc.
Apr 5 2023, 9:07 PM · Research (FY2022-23-Research-January-March)
Isaac closed T327830: ML Equity: Shared Priorities, a subtask of T293516: Recommender Systems + Content Equity, as Resolved.
Apr 5 2023, 9:07 PM · Research, Epic

Apr 3 2023

Isaac created T333856: Cloud VPS open exception request.
Apr 3 2023, 4:27 PM · cloud-services-team, Cloud-VPS
Isaac created T333853: [Session] Self-hosting ML models on Cloud Services.
Apr 3 2023, 4:19 PM · Wikimedia-Hackathon-2023
Isaac added a comment to T331401: Design event schema for ML scores/recommendations on current page state.

Anyhow, they can also be merged "on the client side" later.

I think I would lean towards this. I like the simplicity of separate streams and in Diego's example, I think might be nice to not have the multilingual model (which if I remember is higher latency) be a blocker for the language-agnostic prediction stream?

Apr 3 2023, 1:43 PM · Event-Platform Value Stream (Sprint 12), Data-Engineering, Machine-Learning-Team, Research

Mar 30 2023

Isaac added a comment to T332218: Request creation of hackathon-2023-ml VPS project.

Thanks! Indeed many models run pretty slow on CPUs but should be good enough for prototypes and we wouldn't be doing any training of models on Cloud VPS, so that bottleneck is not so awful.

Mar 30 2023, 7:25 PM · Cloud-VPS (Project-requests)
Isaac added a comment to T246250: Sunset external automatic translation dashboard.

Is that still the case?

Thanks for checking @mpopov The relevant context is T239876 and had to do with specific fixes that might be needed for some errors in the noteboo. My sense is that if someone wanted to revive it, a lot would change given the switches to airflow, updates to wmfdata, etc. so there's no reason to try to do that work unless someone picks up the larger project again.

Mar 30 2023, 6:10 PM · Product-Analytics

Mar 28 2023

Isaac added a comment to T333127: [Session] LLMs, ChatGPT, machine learning tools, etc.

@kostajh a few of us were also thinking about a session focused on some of the techical aspects of running a LLM on Toolforge/Cloud VPS infrastructure / (hopefully) demoing some LLMs that we'd set up in advance. I haven't submitted the session yet but thoughts on whether to combine efforts here or make separate?

Mar 28 2023, 8:03 PM · Wikimedia-Hackathon-2023

Mar 24 2023

Isaac added a comment to T316412: Understanding Contributors: Historical Edit Types.

Continued work with Fabian and having promising outcomes! Working now on frwiki after Fabian split up the job into separate stages and progress on enwiki. I'll be able to start analyzing the results now and be able to scale up this sort of analysis: https://public-paws.wmcloud.org/User:Isaac_(WMF)/Edit%20Diffs/Example_Edit_Analysis_French.ipynb

Mar 24 2023, 5:00 PM · Research (FY2022-23-Research-January-March)
Isaac added a comment to T327830: ML Equity: Shared Priorities.

Weekly update:

  • Presented at https://schedule.mozillafestival.org/session/SDTAZJ-1
  • Met with Santhosh to discuss their learnings from Content Translation. A lot of really good points from that that I'm using to update my learnings doc and he'll share additional background on the tool with me.
  • Also updating doc based on outcomes from MachineVision project: community feedback and response.
  • I will be submitting a Hackathon proposal around playing with Cloud VPS-hosted AI models. ML Platform will be submitting a separate one to demo WikiGPT but hopefully this should keep the number of demos from WMF staff to a minimum.
Mar 24 2023, 4:58 PM · Research (FY2022-23-Research-January-March)
Isaac added a comment to T321224: Wikidata Item Quality Model.

Updated API to be slightly more robust to instance-of-only edge cases and provide the individual features. Output for https://wikidata-quality.wmcloud.org/api/item-scores?qid=Q67559155:

{
  "item": "https://www.wikidata.org/wiki/Q67559155",
  "features": {
    "ref-completeness": 0.9055531797461024,
    "claim-completeness": 0.903502532415779,
    "label-desc-completeness": 1.0,
    "num-claims": 11
  },
  "predicted-completeness": "A",
  "predicted-quality": "C"
}

Details:

  • ref-completeness: what proportion of expected references does the item have? References that are internal to Wikimedia are only given half-credit while external links / identifiers are given full credit. Based on what proportion of claims for a given property typically have references on Wikidata. Also takes into account missing statements.
  • claim-completeness: what proportion of the expected claims does the item have. Data taken from Recoin where less common properties for a given instance-of are weighted less.
  • label-desc-completeness: what proportion of expected labels/descriptions are present. Right now the expected labels/descriptions are English plus any language for which the item has a sitelink.
  • num-claims: how many total properties the item has actually so it's a misnomer and something I'll fix at some point (I don't give more credit for e.g., having 3 authors instead of 1 author for a scientific paper)
  • predicted-completeness: E (worst) to A (best) based on (see guidelines), which uses just the proportional *-completeness features.
  • predicted-quality: same classes but now also includes the more generic num-claims feature too.
Mar 24 2023, 3:46 PM · Research, Linked-Open-Data-Network-Program, Wikidata

Mar 17 2023

Isaac added a comment to T316412: Understanding Contributors: Historical Edit Types.

Ok, I verified that I am using the patched mwparserfromhell library on the cluster but the french wikipedia run still fails with messages like:

23/03/17 16:36:50 WARN TaskSetManager: Lost task 73.0 in stage 8.0 (TID 16755) (an-worker1116.eqiad.wmnet executor 87): ExecutorLostFailure (executor 87 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding physical memory limits. 20.2 GB of 20 GB physical memory used. Consider boosting spark.executor.memoryOverhead.
Mar 17 2023, 9:07 PM · Research (FY2022-23-Research-January-March)
Isaac added a comment to T327830: ML Equity: Shared Priorities.

Weekly update:

  • Working to organize session proposals for Hackathon and understand what sorts of models are feasible to self-host on Cloud VPS in preparation
  • Continuing to reach out to teams to set up meetings to hear about their experiences with 3rd party ML
  • Preparing to be on: https://schedule.mozillafestival.org/session/SDTAZJ-1
Mar 17 2023, 9:05 PM · Research (FY2022-23-Research-January-March)
Isaac added a comment to T321224: Wikidata Item Quality Model.

I still need to do some checks because I know e.g., this fails when the item lacks statements, but I put together an API for testing the model. It has two outputs: a quality class (E worst to A best) that uses the number of claims on the item as a feature (along with labels/refs/claims completeness) and corresponds very closely to ORES model outputs and the annotated data, and, a completeness class (same set of labels) that does not include the number of claims as a feature and so is more a measure of how complete an item is (a la the Recoin approach).

Mar 17 2023, 9:05 PM · Research, Linked-Open-Data-Network-Program, Wikidata

Mar 15 2023

Isaac added a comment to T332218: Request creation of hackathon-2023-ml VPS project.

T312642 for similar request from last year for internal hackathon

Mar 15 2023, 7:52 PM · Cloud-VPS (Project-requests)
Isaac created T332218: Request creation of hackathon-2023-ml VPS project.
Mar 15 2023, 7:50 PM · Cloud-VPS (Project-requests)

Mar 14 2023

Isaac added a comment to T316412: Understanding Contributors: Historical Edit Types.

Thanks @fkaelin and @leila. Some relevant background to understand where we're at:

Mar 14 2023, 8:14 PM · Research (FY2022-23-Research-January-March)
Isaac created T332081: Update PAWS Public Link button to point to new public-paws.
Mar 14 2023, 7:56 PM · PAWS
Isaac updated the task description for T219903: Keep research.wikimedia.org landing page updated.
Mar 14 2023, 3:36 PM · periodic-update, Research

Mar 13 2023

Isaac added a comment to T331401: Design event schema for ML scores/recommendations on current page state.

Good point. Starting with predicted_ might be a good idea, so there are predicted_classification, predicted_embeddings and predicted_recommendations.

Makes sense to me -- the way I see it is the predicted_ fields could be the dependable/required fields for any downstream applications whereas a field like probabilities might be a bit less standardized and aimed more at research/debugging -- e.g., for a topic model with 64 classes, fine to include all probabilities it seems. If a model had 1000 classes though, maybe doesn't make so much sense to include them all.

Mar 13 2023, 6:00 PM · Event-Platform Value Stream (Sprint 12), Data-Engineering, Machine-Learning-Team, Research

Mar 10 2023

Isaac added a comment to T327830: ML Equity: Shared Priorities.

Weekly update:

  • To the above, I found out about the MediaModeration Extension, which checks images on Wikimedia Commons against the industry-supported PhotoDNA tool to detect extreme content. I also was reminded of an experiment using a tool called Rosette to do named-entity recognition -- i.e. text -> Wikidata item -- for figuring out new ways to connect Wikimedia content.
  • Starting to reach out to teams and Legal to understand what a process for providing guidance around these services would look like and what to be aware of.
  • Released our summary of the results of the internal ethical salons but will continue to gather feedback from staff about their questions / interests and also focus on learning what community members are thinking (there's a call setup by Partnerships with the community in late March that I'll join)
Mar 10 2023, 8:15 PM · Research (FY2022-23-Research-January-March)
Isaac added a comment to T316412: Understanding Contributors: Historical Edit Types.

Weekly updates:

  • Fuller edittypes calculation (content generation vs. maintenance etc.) failed with memory errors for a month of French Wikipedia so it does make the scaling more difficult.
  • As part of some explorations into edit summary generation, I put together a dataset of edit summaries from English Wikipedia in 2022. Trying to decide now how edit types might help us filter those summaries down to ones that would most benefit from automated edit summaries -- e.g., only ones where words/sentences/paragraphs were changed -- and how the library could potentially also help us structure the input "diff" to a language model for analysis -- e.g., aligning the before/after content as a prompt to the model. I also finished the qualitative coding of 45 edit summaries which gave the following results. Generally summaries are mostly complete and correct but they tend to focus on summarizing the edit over explaining why it was done and some of these high statistics are due to well-formulated bot edit summaries. The issues tend to come with the empty edit summaries which are still not uncommon. Edits sometimes include multiple distinct changes -- e.g., fixing typos and adding a template -- which further complicates the ability of an edit summary to capture what happened.
Mar 10 2023, 8:12 PM · Research (FY2022-23-Research-January-March)
Isaac added a comment to T321224: Wikidata Item Quality Model.

Weekly updates:

  • Discussed with Diego the challenge of whether our annotated data is really assessing what we want it to. I'll try to join the next meeting with Lydia to hear more and figure out our options.
  • Diego is also considering how embeddings might help with better missing property / out-of-date property / quality predictions for Wikidata subgraphs where we have a lot more data and the sorts of properties you might expect varies at finer-grained levels than just instance-of/occupation. For examples, instances where e.g., country of citizenship or age might further mediate what claims you'd expect. This could also be useful for fine-grained similarity to e.g., identify similar Wikidata items to use as examples or also improve.
Mar 10 2023, 8:03 PM · Research, Linked-Open-Data-Network-Program, Wikidata
Isaac added a comment to T331686: Evaluate reliability of sentence splitting approach.

I like the set of languages/scripts you already have for evaluation. I know you're already aware that it will fail for Thai given the lack of explicit punctuation there. A few suggested inclusions:

  • German (because that's where we see the most abbreviations -- i.e. likely false-positives for sentence splits).
  • Bangla and Armenian are other languages that have unique full stop punctuation that we've missed in the past and are worth checking as well.
Mar 10 2023, 6:40 PM · EditCheck, Editing-team (Kanban Board), Spike

Mar 7 2023

Isaac added a comment to T331401: Design event schema for ML scores/recommendations on current page state.

is score the best name for this field? Is that a generally used term for ML predictions?

I don't remember the logic behind score other than that's what Aaron had always used -- i.e. used to be Scoring Platform team before ML Platform. prediction probably is the more general term but then that feels confusingly redundant with the nested prediction field. Maybe something like model_output as top-level name to allow for different types of outputs? And then it seems we're using prediction to be the summary of the model ouputs (which makes sense generally) and probability to be the full set of outputs with their associated confidence scores. In my comment above though, I suggested a few ways to abuse the probability field in ways that don't really have probabilities (ranked results; embedding vectors) so if we go that direction, I'm wondering if something more generic like details is the only consistent umbrella term? If that feels too generic, then maybe it just makes sense to have three separate schema (one for classification models, one for embeddings, one for recommendation models)?

Mar 7 2023, 4:01 PM · Event-Platform Value Stream (Sprint 12), Data-Engineering, Machine-Learning-Team, Research
Isaac added a comment to T331401: Design event schema for ML scores/recommendations on current page state.

Do we like this scores field? We have the opportunity to do whatever we want here, so let's take some time to brainstorm and bikeshed on what would be best, so we can use it in all the various ML use cases coming up.
Q: Would it be possible to use the same event field data model for things like image-suggestions?

So I can think of a few types of models in terms of output types:

  • Classification models (topic, revert, quality, etc.) -- all of these are essentially some sort of class and associated [0-1] probability which seems well-supported.
  • Recommendation models (add-a-link; add-an-image) -- currently my understanding of these models is that they are run in batch and two types of data are produced: tags indicating if an article has a recommendation (that could easily be supported with this schema as a has-rec score with probability of 1) and then the specific recommendations themselves are stored in a Mediawiki table (example schema: T267329). These recs are a lot more complicated. The add-a-link example has a bunch of context fields (see below). Supporting these I assume would require re-introducing arbitrary maps? Or maybe recommendation outputs just would require a second schema (which makes a lot of sense to me because of how different they are from classification models).
* phrase_to_link (text)
* context_before (text -- 5 characters of text that occur before the phrase to link)
* context_after (text -- 5 characters of text that occur after the phrase to link)
* link_target (text)
* instance_occurrence (integer) -- number showing how many times the phrase to link appears in the wikitext before we arrive at the one to link
* probability (boolean)
* insertion_order (integer) -- order in which to insert the link on the page (e.g. recommendation "foo" [0] comes before recommendation "bar baz" [1], which comes before recommendation "bar" [2], etc)
    • A simpler subset of these recommendation models would be ones that require less context -- e.g., we have a model that generates potential descriptions to be added to Wikidata. For that, the output would just be a ranked list of text -- e.g., 1: Article description, 2: Description of an article, 3: Another description). The current schema presumably could be hacked to make that work. Top-ranked description in the prediction and then each description is the key in the probability part (or maybe the value and the key is the rank?).
  • Embeddings -- we currently train embeddings as part of some of the classification models but it's very reasonable to think that at some point we might want to have a model that just outputs embeddings for articles etc. everytime they're edited so other tools could make use of them without having to train their own. The output format for that then is an n-dimensional vector of floats. In the current schema, it wouldn't really have a prediction but the probability field could probably be repurposed to have the key be the vector index (0,1,2, ..., n) and the value be the embedding value for that index. For my own work, I use 50-dimensional embeddings for space reasons but it's not uncommon to see embeddings on the order of 1000 dimensions, especially for media like images.
Mar 7 2023, 3:47 PM · Event-Platform Value Stream (Sprint 12), Data-Engineering, Machine-Learning-Team, Research

Mar 3 2023

Isaac added a comment to T316412: Understanding Contributors: Historical Edit Types.

Weekly updates:

  • I accidentally overwrote some partitions so I'll have to redo some of the edit-type calculations. That's okay though because I wanted to scale up to not just the simple edit types but the more interesting categories (maintenance vs. annotation vs. generation, small/medium/large, etc.)
  • I started working on qualitatively evaluating existing edit summaries on Wikipedia to assess what it would be to auto-generate them and get a better sense of how the community currently uses them. Current status: https://docs.google.com/spreadsheets/d/1acuXczi9jS2WKNWGXiwi-cLqaWM5J6kGqhHgY-Osh4Q/edit?usp=sharing
Mar 3 2023, 8:51 PM · Research (FY2022-23-Research-January-March)
Isaac added a comment to T327830: ML Equity: Shared Priorities.

Weekly update: Continued going through notes from salons and brief report-out to Tech Dept. I started collecting examples of 3rd-party ML model services used by WMF to help understand the current landscape for that. What I've got so far:

  • Machine Translation
    • The Content Translation extension hosts several open-source models but also allows users to choose between several external machine translation APIs.
  • Text-to-speech
    • The Community Tech team is exploring different text-to-speech approaches for IPA rendering – i.e. pronouncing words.
  • Machine Vision (image annotation)
    • The Machine Vision extension makes it possible to use external APIs such as Google's Cloud Vision API to identify potential depicts statements for images uploaded to Wikimedia Commons.
  • OCR (image-to-text)
    • Wikisource makes use of Google's OCR API, especially for languages which are not currently supported by the otherwise standard open-source Tesseract models.
  • Plagiarism detection
    • Earwig's copyvio tool makes use of Turnitin's API for detecting plagiarism between passages added to Wikipedia and external documents.
Mar 3 2023, 8:49 PM · Research (FY2022-23-Research-January-March)
Isaac added a comment to T321224: Wikidata Item Quality Model.

I slightly tweaked the model but also experimented with adding just a simple square-root of the number of existing claims to the model and found that that is essentially that's all that is needed to almost match ORES quality (which is near perfect) for predicting item quality. That said, I think this is mainly an issue with the assessment data as opposed to Wikidata quality really just being about the number of statements. For example, the dataset has many Wikidata items that are for disambiguation pages and they're almost all rated E-class (lowest) because their only property is their instance-of. I'd argue though that that's perfectly acceptable for almost all disambiguation pages and these items are nearly complete even with just that one property (you can see the frequency of other properties that occur for these pages but they're pretty low: https://recoin.toolforge.org/getbyclassid.php?subject=Q4167410&n=200). So while the number of claims is a useful feature for matching human perception of quality, I think we'd actually want to leave it out to get closer to the concept of "to what degree is an item missing major information". Where most disambiguation pages would do just fine here but human items that have many more statements (but also a much higher expectation) wouldn't do as well.

Mar 3 2023, 8:46 PM · Research, Linked-Open-Data-Network-Program, Wikidata

Mar 1 2023

Isaac added a comment to T322332: Participants can provide optional PII info when registering for an event (MVP).

Chiming in late but in support of not using the ORES topics for this particular use-case. The ORES topics are most useful as search/list filters because they're standardized (easy to design into interfaces) and my new model makes them available for all Wikipedia languages so we can offer the same functionality to all. They are quite broad as Alex pointed out, but when just being used to narrow down lists of articles that editors are reviewing, that broadness is not such a big deal because editors can easily skip over content that's not relevant to them. Ideally they're also used in an interface that easily lets editors select/unselect them so they can explore the other topic areas (as opposed to stating them at the start and then not easily being able to change them).

Mar 1 2023, 2:07 PM · Campaign-Tools (Campaign-Tools-Sprint-35), Campaigns-Design, CampaignEvents, Campaign-Registration

Feb 24 2023

Isaac committed rRWWS2146bf545d7d: Links to template, privacy statement, easychair (authored by ELescak).
Links to template, privacy statement, easychair
Feb 24 2023, 7:06 PM
Isaac added a comment to T316412: Understanding Contributors: Historical Edit Types.

Weekly updates:

  • Transferring approach from Muniza got me through English and then next set of 6 biggest wikis for calculating an entire month's worth of edits (excluding bots and reverts/reverted). Doing the remainder threw an error though so probably need to cut that in half and try again.
  • Next step will be adding to the code so it doesn't just do the simple summary (e.g., 3 templates inserted etc.) but also gathers more details as necessary to map those changes to the broader categories I care about at scale -- e.g., content generation vs. maintenance. That might pose additional scaling challenges but I will cross that bridge when I reach it.
Feb 24 2023, 6:36 PM · Research (FY2022-23-Research-January-March)
Isaac added a comment to T327830: ML Equity: Shared Priorities.

Weekly update:

  • Fourth session wrapped up including this retroboard of outstanding questions/thoughts that folks have
  • Working now on summarizing feedback from the four sessions with HT. Current synopsis:
    • Lots of questions / interest from staff.
    • Concern over this being done in-line with our values, especially around knowledge equity (avoiding digital colonialism), transparency / avoiding exploitation, producing verifiable knowledge, and doing this work in close consultation with the broader editor community. We need to make sur we move carefully in this space despite the obvious excitement too (a lot of "on one hand... on the other hand...")
    • Clear benefits to trying to craft policy/guidance around how teams should consider using third-party AI-backed tools within their products
Feb 24 2023, 6:32 PM · Research (FY2022-23-Research-January-March)
Isaac edited projects for T316412: Understanding Contributors: Historical Edit Types, added: Research (FY2022-23-Research-January-March); removed Research (FY2022-23-Research-April-June).
Feb 24 2023, 6:27 PM · Research (FY2022-23-Research-January-March)
Isaac moved T316412: Understanding Contributors: Historical Edit Types from FY2022-23-Research-October-December to FY2022-23-Research-April-June on the Research board.
Feb 24 2023, 6:26 PM · Research (FY2022-23-Research-January-March)

Feb 23 2023

Isaac added a comment to T328842: Restructure paws away from special networking.

oh wow - thanks @rook!

Feb 23 2023, 4:11 PM · PAWS
Isaac added a comment to T328842: Restructure paws away from special networking.

I appreciate the transition period for updating links. I was curious about how many links would be affected and so ran the query (at least for Meta where I expected the most links to exist on-wiki; maybe worth running a similar query for others?) and migrating existing links seems pretty doable as most are on archive pages and so can presumably be ignored: https://quarry.wmcloud.org/query/71641

Feb 23 2023, 3:20 PM · PAWS
Isaac added a comment to T328899: Add a new outlink topic stream for EventGate main.

Is it ok if I arrange a meeting (maybe next week?), including the ML team, Isaac and Ottomata to discuss the source event and output score data model for the outlink topic stream?

Agreed - thanks @achou!

Feb 23 2023, 3:15 PM · Patch-For-Review, Data-Engineering-Planning, Event-Platform Value Stream, Machine-Learning-Team

Feb 20 2023

Isaac edited projects for T316941: NLP Tools for Content Gaps, added: Research (FY2022-23-Research-January-March); removed Research (FY2022-23-Research-April-June).
Feb 20 2023, 6:10 PM · Research (FY2022-23-Research-April-June), Epic
Isaac moved T316941: NLP Tools for Content Gaps from FY2022-23-Research-January-March to FY2022-23-Research-April-June on the Research board.
Feb 20 2023, 6:10 PM · Research (FY2022-23-Research-April-June), Epic
Isaac edited projects for T316941: NLP Tools for Content Gaps, added: Research (FY2022-23-Research-January-March); removed Research (FY2022-23-Research-April-June).
Feb 20 2023, 6:10 PM · Research (FY2022-23-Research-April-June), Epic
Isaac moved T316941: NLP Tools for Content Gaps from FY2022-23-Research-October-December to FY2022-23-Research-April-June on the Research board.
Feb 20 2023, 6:09 PM · Research (FY2022-23-Research-April-June), Epic
Isaac renamed T328264: NLP Tools: Word Tokenization from Word Tokenization to NLP Tools: Word Tokenization.
Feb 20 2023, 6:08 PM · Research (FY2022-23-Research-April-June)
Isaac closed T328265: Word Tokenization: White-spaced language Tokenization, a subtask of T328264: NLP Tools: Word Tokenization, as Resolved.
Feb 20 2023, 5:25 PM · Research (FY2022-23-Research-April-June)
Isaac closed T328265: Word Tokenization: White-spaced language Tokenization as Resolved.
Feb 20 2023, 5:25 PM · Research (FY2022-23-Research-October-December)
Isaac closed T328270: Sentencepiece: all non-whitespace languages as Declined.

now tracked under T328264

Feb 20 2023, 5:24 PM · Research (FY2022-23-Research-October-December)
Isaac closed T328270: Sentencepiece: all non-whitespace languages, a subtask of T328267: Word Tokenization: Non-whitespace languages, as Declined.
Feb 20 2023, 5:24 PM · Research (FY2022-23-Research-October-December)
Isaac closed T328269: Sentencepiece: Language Family Wise training, a subtask of T328267: Word Tokenization: Non-whitespace languages, as Declined.
Feb 20 2023, 5:24 PM · Research (FY2022-23-Research-October-December)
Isaac closed T328269: Sentencepiece: Language Family Wise training as Declined.

now tracked under T328264

Feb 20 2023, 5:23 PM · Research (FY2022-23-Research-October-December)
Isaac closed T328267: Word Tokenization: Non-whitespace languages, a subtask of T328264: NLP Tools: Word Tokenization, as Declined.
Feb 20 2023, 5:22 PM · Research (FY2022-23-Research-April-June)
Isaac closed T328267: Word Tokenization: Non-whitespace languages as Declined.

now tracked under T328264

Feb 20 2023, 5:22 PM · Research (FY2022-23-Research-October-December)
Isaac renamed T328260: NLP Tools: Sentence Tokenization from Sentence Tokenization to NLP Tools: Sentence Tokenization.
Feb 20 2023, 5:20 PM · Research (FY2022-23-Research-April-June)
Isaac closed T328263: Consider abbreviations while sentence splitting, a subtask of T328260: NLP Tools: Sentence Tokenization, as Resolved.
Feb 20 2023, 5:19 PM · Research (FY2022-23-Research-April-June)
Isaac closed T328263: Consider abbreviations while sentence splitting as Resolved.
Feb 20 2023, 5:19 PM · Research (FY2022-23-Research-October-December)
Isaac closed T328272: Sentence Tokenization: Evaluation Pipeline, a subtask of T328260: NLP Tools: Sentence Tokenization, as Declined.
Feb 20 2023, 5:18 PM · Research (FY2022-23-Research-April-June)
Isaac closed T328272: Sentence Tokenization: Evaluation Pipeline as Declined.

Closing -- now tracked under T328260

Feb 20 2023, 5:18 PM · Research (FY2022-23-Research-October-December)
Isaac updated the task description for T328260: NLP Tools: Sentence Tokenization.
Feb 20 2023, 5:17 PM · Research (FY2022-23-Research-April-June)
Isaac closed T328261: Rule-based Sentence Tokenization as Resolved.
Feb 20 2023, 5:08 PM · Research (FY2022-23-Research-October-December)