Page MenuHomePhabricator

Request custom instance for recommendation-api labs project
Closed, InvalidPublic

Description

Project Name: recommendation-api
Type of quota increase requested: Custom Instance with extra disk space
Reason: need to host a database for surfacing experimental recommendations (T162912)

Current instance: experimental.recommendation-api.eqiad.wmflabs

There currently exists a table filled with schema wikidata_id varchar, <aawiki..zuwiki> decimal and indexes built on every <aawiki..zuwiki> column.

The postgresql database is too large to fit in the 160GB of the largest instance.

Data size (csv): 62GB
Table size: 47GB
Indexes: 78GB (approx 60% complete)

Requested specs:
VCPUs: 8
RAM: 32GB
Disk: 512GB

Event Timeline

@schana 0.5TB of disk for a single VM is a very large storage request. This looks like it is about 2x your expected need based on the data sizes you have described.

I'm also wondering if you have to have Postgres or could use MariaDB instead? We have shared database servers that may be able to hold your dataset. We actually have a shared Postgres instance on bare metal as well that is less promoted.

bd808 triaged this task as Medium priority.Jul 11 2017, 5:51 PM
bd808 moved this task from Inbox to Discussion needed on the Cloud-VPS (Quota-requests) board.

Is this going to be needed "forever" or is this a short term (3-6 month) project?

@bd808 This is not a "forever" project; it's more to surface intermediary results while developing the recommendation algorithm. The plan is to eventually have the dataset be accessible through production infrastructure.

The size of the request may be able to be trimmed, but I'm not confident the data will fit in 256GB. Some additional space is needed for holding the compressed dataset to load into postgres.

I'm not sure what using MariaDB entails, but there is likely to be a fair amount of re-loading and re-indexing occurring during development. There's nothing particularly special about the dataset that necessitates postgres.

If this is ultimately going to be offered as a shared data source, MariaDB is a much more likely backend than Postgres. We have bare metal MySQL/MariaDB databases in Cloud Services that can be used by Toolforge tools and other VPS projects. The "ToolsDB" server is meant for end-user data like this that is not a wiki replica. Reindexing frequency shouldn't be too much of a problem there although the db server is a shared resource so pounding it incredibly hard can cause issues.

@bd808 I created a database s53132__trex_p, but it seems that tables are limited to 64 indexes. Is there a way around this limitation besides exploding the data into its relational equivalent?

@schana That would be a question for the DBAs I guess. My first reaction is to wonder if you really need more than 64 distinct indexes on a table. That's a awful lot of indexing.

@bd808 Yes, it's a lot of indexing, but the table has records for every wikidata item and columns for every wiki, with the values being predictions that a particular item should exist in a given wiki. For performance reasons, sorting is necessary on every column. I could expand the data to be relational with tables per wiki and do joins, but that complicates the ingestion that will be somewhat frequent during the development of the algorithm.

@schana Are you still blocked by the InnoDB index limit? Do you want to re-examine the need for a custom instance to host your own DB server? Disk is the most heavily over subscribed resource in the Cloud VPS environment, so figuring out how to get your total requested disk down would be very helpful.

@bd808 I'm moving this to paused for now, as there aren't available resources to drive the task. For the future, if we still want to load the data into MySQL, it can be restructured to avoid the index limit.

bd808 changed the task status from Open to Stalled.Sep 6 2017, 3:08 PM

I'm going to resolve this as invalid just to get it off of the workboard. Please do reopen or open another similar task if the project becomes active again and a custom db server is needed.