Page MenuHomePhabricator

Supporting AI, LLM, and data models on WMCS
Open, LowPublic

Description

Given the concerns around licensing of data models raised on T333856: Cloud VPS open exception request, let's discuss how best to support and enable work in this space in support of the Wikimedia movement on Wikimedia Cloud Service Offerings. I believe it is in the health and interest of the Wikimedia movement to support this emerging field, and to utilize our existing platforms and resources, namely WMCS, to do so.

However, some questions remain:

Licensing

  • Given the lack of clear licensing and/or incompatible with OSI licensing, what can users run on WMCS? Is any work being done in the data community to create OSI compatible models? Can something like The Pile, which is free with no other apparent license, be used on WMCS?
  • According to @Isaac, T333856#8805764, "a model has at least three separate pieces that are often treated independently: the final model artifact (a bunch of numbers essentially), the code used to train the model, and the data that was fed into the model". Interpreting this, what requirements are imposed by WMCS upon each piece of the model? Must each piece used also be OSI-compatible? If not, which pieces must be? Is this scenario similar to utilizing non-free hardware and software to create an image which is then openly licensed?
  • What requirements does WMCS impose on non-code objects stored in WMCS? Are models code? If not, then what requirements should we place on them?

Hardware Requirements

  • How much disk, RAM, CPU might be needed? Can we meet those needs with our existing hardware?
  • Are GPUs required? If so, how many? How would access be controlled?

Non-blocking questions

  • What projects exist that wish to explore these fields? What goals / outcomes do they have?
  • Are multiple independent projects needed, created on request by any party? Or could the collective work be consolidated into a few primary projects?

The goal of this ticket is to discuss and collect feedback on the listed questions. In addition to update wikitech, WMCS policies, etc as required in accordance with any decisions made.

Event Timeline

Sorry for the delay on commenting on this but thanks for putting together the task @nskaggs ! Adding @calbon too so he's aware and give thoughts on how to possibly balance the allowances of LiftWing and Cloud Services in this space.

Around licensing: there's some on-going discussions at OSI that I'm aware of around what "open" means for AI models that I look forward to seeing the outcome of that. We'll have to decide then whether their decision meets our needs / values but I don't want to guess at that too much just yet. cc @SSpalding-WMF so you're aware of this need for clarifying AI licensing too.

Around hardware requirements: one outcome from the Hackathon and discussions around AI on cloud services is that the main value that Cloud Services likely can play in this arena is making it easy to prototype new models. Training models is pretty computationally-intense and hard to imagine being well-supported. Even hosting some of the mid-sized language models for purely inference/serving has gotten to a stage where you pretty much need GPUs for them to be reasonably performant. But Chris will have much more informed thoughts on what this could look like.

Around use-cases: we're working to gather some common use-cases for AI models so that we can provide them as services (rather than expecting each person who wants to use them to use an external API or figure out how to self-host). The first one is around text summarization (T342614) led by @MGerlach. Obviously not everything can be covered in this way but surfacing these use-cases and building general purpose APIs for them is a nice potential complement to providing more individual hardware.

Can something like The Pile, which is free with no other apparent license, be used on WMCS?

The Pile contains dataset "books3" which is collection of out-of-copyright books from Project gutemberg, but also a huge number of pirated ebooks. So least for books3 part it cannot be used.

Next question is that can we use any of the models which are trained using Pile / books3?

fnegri subscribed.

There's no pending discussion at the moment, so I'm moving this task out of "Needs discussion" column and back to the inbox. Feel free to leave a comment if you would like this to be prioritized.