Page MenuHomePhabricator

Developing the `algo-accountability` repository
Open, Needs TriagePublic

Description

I'm establishing this ticket as a place to centralize discussion and updates on creating prototypes for model cards, datasheets, etc.

Development is happening on gitlab in the algo-accountability repo: https://gitlab.wikimedia.org/htriedman/algo-accountability. In addition, we're meeting on a weekly basis (Thursdays 5:45-6:30 PM UTC) to share updates on the project — you can comment on this phab ticket if you'd like an invitation to that meeting.

At the moment, we're focused on getting to a baseline prototype for on-wiki model cards using ORES models. The focus is on minimizing technical debt, providing clear information for the community to evaluate models on, and providing a relatively simple UX for contributors (internal or external) to submit their models.

Event Timeline

Quick update — I spent some time visually charting out exactly what the infrastructure/workflow for this system might look like. I've attached an image of my proposed design, and you can make comments on the design on Google Drawings here.

(edit: updated image because I forgot to put some arrows in the first version)

WMF Algo Accountability Architecture v1 (1).png (937×1 px, 189 KB)

To spark a conversation: What do we gain by running a model against a test dataset here?

Take this situation: I train a model detecting harmful edits. In the process of making the model, I will inevitably have a test and training data set and therefore test and training performance metrics. Why can't the model's creator report those in the model card directly (i.e. manually)?

It seems like one of the biggest technical hurdles is running a model against a test dataset, but shouldn't we already know the test dataset performance metrics? It just seems like the test dataset would be supplied by the model creator and not change unless the model was retrained.

Many ORES models don't have test datasets, but I am worried about over-designing for ORES.

Put another way, is "make test set performance metrics for existing ORES models" and "create a programmatic way to create model cards and data cards" separate projects?

@calbon you're making a good point and this is definitely a conversation worth having.

Why can't the model's creator report those in the model card directly (i.e. manually)?

I would argue that inputting test metrics by hand has a couple of drawbacks:

  • In the case of ORES, which is the likeliest set of models to be regularly retrained, manual entry of data will be a pain and a large-scale chokepoint. If we want to use this model card approach for ORES and have a unified effort/presentation for both WMF-built and community-built models, I don't believe that manual data entry will suffice.
  • Even if we say that test set performance metrics for existing ORES models and model cards are separate projects, manual data entry also precludes auditing algorithms as they age/replicating results independently. My assumption would be that communities might want to audit the performance of algorithms as they get older — and it will be much easier to conduct an audit on top of a pre-existing infrastructure by merely collecting some crowdsourced training data and changing a single config file than to wade through more complex statistics.
  • It also might be a pain to maintain more than one toolkit for this stuff. I worry that the community contributor model card package might fall to the wayside as we spend more time/energy developing the WMF-centric tool.

Anyone else have other thoughts on this?

@Htriedman thanks for sharing this! Can I get some details about the Wikidata component to this? In what ways is Wikidata being imagined as a data repository? What would this look like? Does it create constraints on the format of the data? Why not have "Data hosting service" in the same way there is "Model hosting service"?

Why not have "Data hosting service" in the same way there is "Model hosting service"?

@Isaac I think this is the long-term goal with stuff like feature store, data catalogs, etc., but we are still a ways off from that. I think Wikidata might be really helpful in storing metadata for both models and their underlying datasets, but maybe not the actual data itself.

Why can't the model's creator report those in the model card directly (i.e. manually)?

Good point @calbon -- this makes me wonder about 'required' fields in the model card. What happens if a model creator does not input everything? Is there a mandatory set of information a model card should have?
Entering stuff manually or importing from another tool or csv file etc. might be a nice feature to have eventually.

Put another way, is "make test set performance metrics for existing ORES models" and "create a programmatic way to create model cards and data cards" separate projects?

Re: ORES test datasets - I don't think we should let it be a blocker for the initial draft. Eventually we may just want to run a pipeline to generate all that stuff on the fly, so I don't think we should spin our wheels on it too much right now. I think staying modular in design makes alot of sense and might help us avoid becoming too ORES-specific. Maybe we can do a first pass on writing a script to handle ORES data and put that in a separate repo on wmf gitlab? That way we can experiment with running that via PipelineLib etc.

I think this is the long-term goal with stuff like feature store, data catalogs, etc., but we are still a ways off from that. I think Wikidata might be really helpful in storing metadata for both models and their underlying datasets, but maybe not the actual data itself.

Ahhh okay, that makes more sense. I thought it meant the actual data would be stored there, which raises concerns about the constraints Wikidata would introduce around what is being modeled. But I could certainly see some simple metadata being stored effectively there and then even a link to e.g., a meta page with further details to better allow more free-form descriptions. Certainly some governance issues that would have to be worked out -- i.e. differences likely in how dataset metadata pages would be maintained and how most Wikidata items are maintained. It would be a big ask but maybe asking for our own namespace in Wikidata (in the same way Research is a separate namespace (202) in Meta: https://meta.wikimedia.org/wiki/Research) would help with making more clear that each page reasonably does have an owner who is both accountable and also the source of knowledge about that item. Of course, the community also already handles a wide range of use-cases so maybe this wouldn't be as messy as I imagine -- e.g., scholarly papers which have different norms than say an item associated with Wikipedia articles, the on focus list of Wikimedia project property P5008 that is used by various WikiProjects to track what content they're interested in and also led to some intense discussions between members of the Wikidata community and Wikipedians if I remember correctly about norms etc.

Many ORES models don't have test datasets, but I am worried about over-designing for ORES.

Another very reasonable instance of no-test-dataset would be (unsupervised) embeddings that we release to the community as a building block for other models. They also would likely lack a useful test set but still should be carefully evaluated.

Additionally, and maybe this is too early for this discussion, but I've long been interested in the question of what our "protected classes" are on Wikimedia projects (beyond language -- i.e. models should perform well across all projects). For example:

  • Content: gender and geography are two metrics that we can readily measure in a global and language-agnostic manner and tend to be a good canary for related biases that are much harder to measure -- e.g., race/ethnicity. Ideally this could be expanded but I'd feel reasonably confident saying that any ML model that deals with Wikimedia content should have reports about how it performs w/r/t gender and geography and that would catch most of the biggest issues with biased models. Possibly we could have a smaller, curated test set of articles for other important properties -- e.g., race -- that we also provide statistics for where possible.
  • Contributors: most platforms would talk about gender, race, etc. here too but we obviously do not collect that information. Some editors provide a proxy for gender (see more) but the availability of that data varies greatly by wiki and was not contributed for this purpose so I'm not sure it would actually be helpful to depend on it. Geography could still be a possibility -- i.e. country of editors based on their IP address -- but this would introduce a lot of complications as that data I believe is only retained for 90 days and is considered quite sensitive so obviously would have to be handled carefully to avoid data leaks. In past work, anons vs. newcomers vs. more-long-term editors have been separated into classes for evaluation and I think that's an important set to report on. Mobile vs. desktop might be another facet to pay attention to (as it can proxy for geography, experience, etc. too). I'm sure there are others that could be considered as well -- e.g., editor is on their home wiki? -- that would hopefully catch major issues.
  • Readers: this is the third group we might consider though historically we've done no personalization / profiling of readers and I expect that to remain so. So I think it's safe to not talk about how we might evaluate ML models that are directly impacting readers. The closest thing we have is the related pages extension, which recommends articles to readers but it's not personalized and the evaluation could reasonably focus on the content side of it -- i.e. does it bias towards e.g., articles about men or related to the United States.

End of the week update: I have officially been able to run a full model card through the pipeline! (yay) Example here (sorry for the sparsity of data, it's only running on ~50 revisions to keep testing relatively quick).

This model card paradigm does the following:

  1. Reads info about the model/data locations, developer descriptions of the model/intended uses/intended users/etc. from a config file
  2. Pulls test data from a remote data source (for the moment it's just a raw json file on gitlab)
  3. Sends test data to the ORES api endpoint for scoring
  4. Receives predictions back from the endpoint
  5. Runs a scoring suite on the whole dataset
  6. Breaks the dataset up into disjoint subsets that satisfy certain conditions — anons vs. non-anons, newcomers (<1 year since account creation) vs. veterans, mobile vs. desktop
  7. Runs a scoring suite on each disjoint dataset
  8. Converts the scores for each run through the suite into wikitext
  9. Writes that wikitext to meta using pywikibot

The code is in the model-card branch of the gitlab repository. As you might notice, there are lots of todos peppered throughout the code. Some ones to include in the future are:

  • intersectional analysis — how does the model perform on intersections of characteristics
  • some cleaning up and refactoring of code

Regardless, immediate next priorities are to make all of the relevant ORES datasets and get them up onto ores-data, to run the damaging model through this pipeline, and then to run the articlequality model through this pipeline.

ps: @Isaac I'll have a better response to your comment about "sensitive features" next week, but for the moment, suffice to say I think you're looking at/thinking about the right stuff.

Hi again @Isaac! Just wanted to re-respond to your post about protected classes with a strong measure of agreement. The only feature I might add to the content analysis is geography among anonymous IP editors. At the same time, I think a lot of the features we want to measure depend in large part upon the size of the test set to prevent bad stats and data leaks (it really doesn't have to be very large; 200-400 evenly-distributed samples would likely do just fine).

Regardless, it does seem like labeling new datasets (of revisions, of articles, of anything else) will be a constant feature of this project. I know that we already have wikilabels as an in-house solution, but @calbon and I had a discussion on Thursday about deploying something like Label Studio to serve that purpose. Would love some other thoughts on labeling solutions!

Just wanted to re-respond to your post about protected classes with a strong measure of agreement.

Yay! The beauty of the model cards and your work to automate them is that hopefully it should be relatively easy to extend to other classes if people propose them / have a good way of operationalizing them. Home-wiki being a great example where that's a relatively complicated aspect -- e.g., is it where the account was created? majority of edits? majority of non-article edits? etc. -- so maybe doesn't make the first cut but could be added at a later date.

Regardless, it does seem like labeling new datasets (of revisions, of articles, of anything else) will be a constant feature of this project. I know that we already have wikilabels as an in-house solution, but @calbon and I had a discussion on Thursday about deploying something like Label Studio to serve that purpose. Would love some other thoughts on labeling solutions!

@Htriedman thanks for raising this question. Ongoing labeling is certainly going to be core to building a more sustainable/equitable ML platform. Brain dump of my personal thinking below (hopefully it makes sense and I understand that nothing is getting resolved this week, month, etc.):

From my perspective, there are three basic approaches:

  • Stand-alone applications (wikilabels, Label Studio etc.): collect labels on a curated sample of edits/articles/etc. in a stand-alone interface.
    • Pros: lots of control over labeling approach etc. This is potentially quite useful for generating golden -- i.e. highly trusted -- groundtruth datasets or focusing on specific aspects -- e.g., closing a gap in the existing groundtruth.
    • Cons: generally completely outside of existing wiki workflows and so requires concerted outreach / goodwill to gather labels. As a result, generally smaller samples -- e.g., hundreds of labels -- and hard to do in an ongoing way.
    • Summary: I think it's useful to have this, especially when prototyping models, but I don't think this is a good solution for ongoing labeling needs.
  • Passively tap into existing wiki processes: use "found" groundtruth labels on the wikis. For example, reverted edits + regexes over edit summaries as groundtruth for vandalism models or extractors to gather article quality annotations from article talk pages (that's where WikiProjects add them via a semi-complicated template system).
    • Pros: You can potentially get a lot of data for very little upfront work.
    • Cons: The initial engineering work isn't bad -- e.g., extracting data from a single wiki -- but generally because this is ad-hoc, you have to put a lot of time into supporting new languages and making sure your old code still works. Extracting groundtruth might mean going through very large history dumps, which can be a slow/expensive process.
    • Summary: This maybe doesn't work for everything but most models we build should have existing wiki processes that they would be a part of. You lose some control over what content is evaluated (as compared to a stand-alone application) but the volume of labels should hopefully be high enough that you can sample from it as needed. The real challenge is that this doesn't scale very well across languages (as each community likely has its own ad-hoc way of doing things).
  • Actively tap into existing wiki processes: This is the idea behind e.g., WikiLoop. For a given model, tap into the existing wiki process that it's intended to support but also provide tooling to help the community and that outputs your desired labels in a very structured manner.
    • Pros: You'll get way more labels, be able to gather them in an ongoing way, and build goodwill with communities because your work will directly support their processes.
    • Cons: There might be a fair bit of engineering and community outreach work upfront for each model.
    • Summary: this to me is the most sustainable approach for gathering ongoing labels. It requires more coordination (and possibly support from other engineering teams in Product) but very directly connects the ML models we build with the community processes they are supporting.

Article quality as a case study

The existing ORES articlequality model uses approaches #1 (stand-alone) and #2 (passive extraction) above. For e.g., Basque, it depends on Wikilabels-gathered labels (makefile). For e.g., English, it extracts quality labels from article talk pages where they are stored by editors (makefile). The labels that were collected via stand-alone have never been updated as far as I know and are generally just a few hundred. The passive extraction is only used on a few wikis where it could be implemented and there's no guarantee that it'll continue to function properly if you use new snapshots of the data.

In 2016, the Community Tech team created a Mediawiki extension called PageAssessments. This extension sought to support the existing editor workflows for evaluating article quality among other things (and in practice is an example of #3 above though I don't think that was the intention). The extension data is stored in a Mediawiki table that can be easily accessed / processed. Arabic, English, French, Hungarian, and Turkish Wikipedias use the extension. This means that it is now very easy to access up-to-date and extensive groundtruth on article quality from those wikis. I have used the data from this extension in part for evaluating my work on language-agnostic quality modeling. Any wiki that adopted the extension would also require relatively little work to build a model for -- it's just a very simple SQL query to extract the data. Example of data from Climate Change English WikiProject: https://quarry.wmcloud.org/query/52210

Going forward, I think by far the most sustainable process is to use PageAssessments. For any wiki that wants a quality model, require (and support) them in adopting the PageAssessments extension so that they can produce initial (and ongoing) data for a model. This is a reasonable barrier I think and ensures both that the model can continuously be evaluated / improved and that there is an existing community workflow that it'll support.

What would this mean for other models?

I'd argue that many of the models we would want to build have existing community processes that they'd support and a ready supply of groundtruth that can easily be tapped into. You can see some more of my thinking around this here: meta:Content Tagging. Existing process entrypoints:

  • WikiProjects: article-level groundtruth for quality, importance, and topic is already supported via the PageAssessments extension. We just need to encourage more wikis to use it.
  • Wikidata: article-level groundtruth for gender, occupation, and geography. Also contains featured-article status for many wikis. This is already structured and (relatively) easy to extract so no additional software really needed!
  • Templates: article/section/sentence-level groundtruth for content reliability (NPOV violations, citation needed tags, improvements needed tags). This is more painful to extract and would likely benefit from outreach / engineering to make these templates more useable and cross-wiki. Growth and other Product teams should be interested in this too because they depend on these templates for identifying structured tasks for new editors.
  • Edit-level groundtruth is a bit messier I think -- e.g., for vandalism detection models. It's very tempting to choose the passive approach and just use reverts as training data for vandalism but that gets messy (and potentially problematic) because edits get reverted for lots of reasons that have nothing to do with vandalism. Hence, the good-faith and damaging approaches taken by Aaron in training the ORES models. Tools like WikiLoop DoubleCheck might be worthwhile investing in -- i.e. tooling that allows editors to both patrol the wikis and contribute groundtruth data at the same time. It's possibly that the ORES-powered RecentChanges filters could have feedback buttons incorporated too?
  • Recommendation models -- e.g., add-a-link, add-an-image, add-depicts-statements -- generally get their groundtruth from what is already existing on the wikis. So for these, using passive data collection for the initial model coupled with a feedback mechanism in any recommender system (to say that X is a bad recommendation) should handle them pretty well.

Hi all! Just wanted to post a quick update on some of these ORES transparency efforts — I have (mostly) compiled a repository of datasets (~10GB), model binaries (~0.5 GB), model architectures, model training performance, etc. that are used in ORES. You can check it out at the ores-data repo on Gitlab. There were some holes where datasets/models didn't compile or were otherwise corrupted somehow, but I did my best to document what didn't work for whatever reason.

Training datasets, models, and model info are all organized by project (e.g. enwiki, wikidatawiki, zhwiki...). Testing datasets are mostly nonexistent right now, but can be added as time in a similar schema goes on. In an ideal world, interested volunteer/staff developers might download the datasets, models, and model info for a specific project; develop a new model using a library supported by training wing; and then be able to have a containerized model in production on liftwing.

Overall, I would say we're about 85% of the way to full coverage (of datasets and models) for the ORES service, which is more than enough to start with data sheets, initial model cards (for models lacking a test dataset), and lowering barriers to access for model development with WMF data.

Let me know what you think!

edit: fixed a typo

I have some more updates after working on the algo-accountability repo for another week:

  • Spent a lot of time refactoring/streamlining code within the model_card.py file to make it cleaner and more rational to read through
  • Cleaned up and made prettier the wikitext that we’re posting to the model cards
    • Better naming for test data table
    • Added detailed model architecture, training performance, and score schema from the revscoring model_info cards that are created when the models are trained
    • Refactored the model card template into three main sections: Qualitative Analysis, Quantitative Analysis, and Model Information
    • Created some templates on metawiki to show both that these cards are automated and to clearly indicate their tier (more below)
  • Created a tier system (tiers 1, 2, and 3) to indicate the quality of the model cards and the total amount of information contained in the card
    • Tier 1: testing stats, training stats, model architecture, explanation of the model rationale, owners, creators, provenance, etc.
    • Tier 2: training stats, model architecture, explanation of the model rationale, owners, creators, provenance, etc.
    • Tier 3: explanation of the model rationale, owners, creators, provenance, etc.
  • I also ran almost all of the enwiki models through the pipeline!

I think the next step is maybe to write a little toolforge tool that can collect all of the fields you need to make a model card at different tiers and then create a yaml file? Would love to hear your thoughts

It's a long paper so I haven't gotten into it but might be useful when thinking about categories of harm etc.: Ethical and social risks of harm from Language Models (DeepMind)