Supporting AI, LLM, and data models on WMCS
Open, LowPublic
Actions

Assigned To

None

Authored By

	• nskaggs
	May 17 2023, 9:08 PM

Description

Given the concerns around licensing of data models raised on T333856: Cloud VPS open exception request, let's discuss how best to support and enable work in this space in support of the Wikimedia movement on Wikimedia Cloud Service Offerings. I believe it is in the health and interest of the Wikimedia movement to support this emerging field, and to utilize our existing platforms and resources, namely WMCS, to do so.

However, some questions remain:

Licensing

Given the lack of clear licensing and/or incompatible with OSI licensing, what can users run on WMCS? Is any work being done in the data community to create OSI compatible models? Can something like The Pile, which is free with no other apparent license, be used on WMCS?
According to @Isaac, T333856#8805764, "a model has at least three separate pieces that are often treated independently: the final model artifact (a bunch of numbers essentially), the code used to train the model, and the data that was fed into the model". Interpreting this, what requirements are imposed by WMCS upon each piece of the model? Must each piece used also be OSI-compatible? If not, which pieces must be? Is this scenario similar to utilizing non-free hardware and software to create an image which is then openly licensed?
- For previous discussion on licenses in the spirit of, but not TOU compliant on WMCS; T152581: Expand the Toolforge definition of "free license" to include FSF-approved and DFSG-compatible licenses
- As noted, CC-by-SA isn't OSI approved, despite being very much aligned with the Wikimedia movement (Note that it's not intended to be a software license)
What requirements does WMCS impose on non-code objects stored in WMCS? Are models code? If not, then what requirements should we place on them?

Hardware Requirements

How much disk, RAM, CPU might be needed? Can we meet those needs with our existing hardware?
Are GPUs required? If so, how many? How would access be controlled?

Non-blocking questions

What projects exist that wish to explore these fields? What goals / outcomes do they have?
Are multiple independent projects needed, created on request by any party? Or could the collective work be consolidated into a few primary projects?

The goal of this ticket is to discuss and collect feedback on the listed questions. In addition to update wikitech, WMCS policies, etc as required in accordance with any decisions made.

Related Objects

Mentioned Here: T342614: Models for text summarization using LLMs
T152581: Expand the Toolforge definition of "free license" to include FSF-approved and DFSG-compatible licenses
T333856: Cloud VPS open exception request

Event Timeline

• nskaggs created this task.May 17 2023, 9:08 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 17 2023, 9:08 PM

• nskaggs added a project: cloud-services-team.May 17 2023, 9:09 PM

• nskaggs moved this task from Inbox to Needs discussion on the cloud-services-team board.

Slst2020 subscribed.May 23 2023, 3:57 PM

aborrero subscribed.May 31 2023, 2:41 PM

dcaro subscribed.May 31 2023, 2:41 PM

fnegri removed a project: Cloud-Services.Jul 11 2023, 10:09 AM

Zache subscribed.Jul 23 2023, 1:02 AM

Sorry for the delay on commenting on this but thanks for putting together the task @nskaggs ! Adding @calbon too so he's aware and give thoughts on how to possibly balance the allowances of LiftWing and Cloud Services in this space.

Around licensing: there's some on-going discussions at OSI that I'm aware of around what "open" means for AI models that I look forward to seeing the outcome of that. We'll have to decide then whether their decision meets our needs / values but I don't want to guess at that too much just yet. cc @SSpalding-WMF so you're aware of this need for clarifying AI licensing too.

Around hardware requirements: one outcome from the Hackathon and discussions around AI on cloud services is that the main value that Cloud Services likely can play in this arena is making it easy to prototype new models. Training models is pretty computationally-intense and hard to imagine being well-supported. Even hosting some of the mid-sized language models for purely inference/serving has gotten to a stage where you pretty much need GPUs for them to be reasonably performant. But Chris will have much more informed thoughts on what this could look like.

Around use-cases: we're working to gather some common use-cases for AI models so that we can provide them as services (rather than expecting each person who wants to use them to use an external API or figure out how to self-host). The first one is around text summarization (T342614) led by @MGerlach. Obviously not everything can be covered in this way but surfacing these use-cases and building general purpose APIs for them is a nice potential complement to providing more individual hardware.

Can something like The Pile, which is free with no other apparent license, be used on WMCS?

The Pile contains dataset "books3" which is collection of out-of-copyright books from Project gutemberg, but also a huge number of pirated ebooks. So least for books3 part it cannot be used.

Next question is that can we use any of the models which are trained using Pile / books3?

There's no pending discussion at the moment, so I'm moving this task out of "Needs discussion" column and back to the inbox. Feel free to leave a comment if you would like this to be prioritized.

fnegri triaged this task as Low priority.Mar 28 2024, 3:24 PM

aborrero added a project: User-aborrero.Apr 1 2024, 11:32 AM

aborrero moved this task from Backlog to Radar/observer on the User-aborrero board.

Supporting AI, LLM, and data models on WMCSOpen, LowPublicActions

Description

Related Objects

Event Timeline

Supporting AI, LLM, and data models on WMCS
Open, LowPublic
Actions