Page MenuHomePhabricator

Publicize results from SDS 1.2.1 B
Open, HighPublic

Description

Goal: publicize the findings from SDS 1.2.1 B (AI use-cases). The decision is to publish a longer-form piece (in the style of a journal) to arXiv and then consider if we should submit to a peer-reviewed venue. This will give us the space to incorporate in findings from other work (Wikidata revert risk, Reference need, etc.) and highlight the patterns that we're seeing across these projects.

Note: we may wish to update the meta page as well.

Details

Due Date
Sep 30 2025, 4:00 AM

Event Timeline

Starting with a more narrow group of folks subscribed and then we can add more collaborators depending on the direction chosen. @diego could you pick this up in January after the Q2 work is finalized and put together a proposal for which audience you want to target and what form the publication would take? Feel free to propose ideas outside of what I included in the initial task description.

Isaac triaged this task as High priority.Jan 13 2025, 10:58 PM
Quiddity subscribed.

Once it's ready, I believe this might be suitable for highlighting in Tech News, too. I'll just add the relevant tag for now.
The ideal end-situation would be: You provide me with 2-5 short sentences summarizing the full final public piece, and a link (whether that's on Meta-wiki or a blog), the week before it should be announced. (Tech News is written on Wed–Fri, and published on Mondays). Thanks!

Isaac set Due Date to May 16 2025, 4:00 AM.
Isaac changed Due Date from May 16 2025, 4:00 AM to May 23 2025, 4:00 AM.May 15 2025, 2:06 PM

Moving the deadline back one week -- I'm currently reviewing the paper and we'll make some decisions shortly on next steps.

Thanks @diego for your patience! I finally had some time to think a bit deeper about the current draft and put together some thoughts about potential directions. Let me know what you think.

When we last talked, I think we agreed that the central challenge right now is finding a clear narrative for what we're trying to say with this paper. The work itself is good and robust and it's been valuable for our own internal learning, but how do we generalize the message to be interesting to others? We've talked about it taking on more of a "position paper" type status and I still like that idea. My thinking is that we use this paper to be a defense of the importance of language models for domain-specific classification. So not just generic sentiment analysis or benchmark-type questions (not particularly domain-specific) and also not classification that purely works on bag-of-words or numeric features. But tasks like vandalism detection or other content moderation models where there are platform-specific trends that you'd want the model to understand. We can point to AI Strategy as this being an important area for Wikimedia. And we can use work like Ashkinaze et al. to say that evidence points to LLMs struggling with these tasks and this paper looks into what challenges still remain in putting together production-ready models for these use-cases. This would have a few implications:

Novelty:

  • Looking at the related work that you cited around SLMs vs. LLMs, most of them are calling 1B+ models "small" and none actually went down to BERT-size models (110M parameters).
  • Past work often uses "open" to mean open-license or open-weights but OSI's definition is stricter (mBERT meets it).
  • The past papers often aren't doing fine-tuning for a specific task (and comparing to LLMs) but many content moderation tasks are important enough to have their own fine-tuned model.

Potential contributions/claims/positions:

  • NLP community should be more specific with how they use "open" – while open weights is generally the only thing needed for more in-depth analysis within research community, production settings may require more transparency and a stricter definition of "open".
  • NLP community should not abandon domain-specific classification tasks as a genre of tasks. Still very important to platforms. We show array of results on both small and hard tasks and across a variety of languages.
  • NLP community should update their core open models. ModernBERT was created recently but no mBERT replacement so we're still largely using models from 2019 because nothing clearly better for production use-cases (fast/small, multilingual, truly open).
  • NLP community should continue to work on supporting multilingual models but also think more carefully about how multilingual models are tested – ideally using language-specific datasets (not just translations) and evaluating a range of different languages/scripts. Not just about size.
  • This further strengthens the importance of very low latency as content moderation models generally have to be applied to everything and hopefully catch things quickly.

Changes this would require:

  • While claiming novelty through saying "no one has compared truly small LMs that are also multilingual and fully open-source to LLMs" feels like a case of reaching novelty by just adding enough dimensions to the claim, I think it's on us to say why this is not a niche space but actually a really crucial intersection of needs. We can add some related work on domain-specific classification tasks on the Wikimedia projects (expanding out what you have now about revert-risk) to better justify why they're so important to running a large online platform. Maybe pull in some of the information from Miriam's product use-cases work from Q1. Probably focus on content moderation but can highlight others too.
  • In general, look for more opportunities to justify why multilingual modeling and open-source are important too.
  • Maybe we can gather some simple data to support the anecdotal evidence we've been seeing around ACL etc. having fewer Wikipedia papers as everyone tries to scale their work? This would then be us saying that Wikimedia projects still have many unsolved NLP needs, many of which potentially have been made worse by the introduction of Generative AI. But the research community is leaving this area behind.
  • Incorporate in Ziems et al.. They're also comparing fine-tuned language models to LLMs on classification tasks, though they're somewhat generic tasks. In their case it's roberta-large which I think is 355M parameters so a few times larger than mBERT and English-only.

Addendum:

  • You should read Tilman's write-up in Signpost on "ethical LLMs". The PleIAs models might be a potential mBERT update though we'd have to evaluate performance and they're still starting generative and larger than mBERT.
Miriam changed Due Date from May 23 2025, 4:00 AM to Sep 30 2025, 4:00 AM.Aug 26 2025, 12:35 PM

@diego I did a pass on the current draft and added some to-dos. I think before proceeding we should check-in with @leila about scope and framing. Will schedule something after the CHI deadlines.

Thank you, Diego and Miriam, for the conversation today. My summary notes are below:

  1. We have made a decision to conditionally (see point 2) publish the work on arxiv (roughly by mid-October). We can decide at a future point if we want to send the work to a peer-reviewed venue. At this point, it's lower priority to get peer-review on this particular message.
  2. The two of you will brainstorm about whether there is a good discussion and future work section that can be built for the paper, i.e., whether based on this research and exploration we can have anchored and meaningful recommendations or thoughts on future research in this space. I will look to you to give me your recommendation about proceed with arxiv or stop here.
  3. We agreed that one framing that can be helpful for our audiences is if we talk about the reality of the infrastructure we work with at this point (which can expand to some extent but not in any way close to where very large tech companies can expand theirs) and our commitments to how we develop and use AI models. This can be a grounded and inspirational message and also a formal record of how we handled this moment in time.

I look forward to hearing from you and thanks.

How should we word this for Tech News please?