Request custom instance for recommendation-api labs project
Closed, InvalidPublic
Actions

Assigned To

None

Authored By

	• schana
	Jul 5 2017, 4:13 PM

Description

Project Name: recommendation-api
Type of quota increase requested: Custom Instance with extra disk space
Reason: need to host a database for surfacing experimental recommendations (T162912)

Current instance: experimental.recommendation-api.eqiad.wmflabs

There currently exists a table filled with schema wikidata_id varchar, <aawiki..zuwiki> decimal and indexes built on every <aawiki..zuwiki> column.

The postgresql database is too large to fit in the 160GB of the largest instance.

Data size (csv): 62GB
Table size: 47GB
Indexes: 78GB (approx 60% complete)

Requested specs:
VCPUs: 8
RAM: 32GB
Disk: 512GB

Related Objects
Search...

Status	Assigned	Task
Resolved	• schana	T162912 Improve the algorithm for translation recommendations
Resolved	• schana	T169089 Surface scores from spark job
Resolved	bd808	T140904 Existing Labs project quota increase requests (Tracking)
Invalid	None	T169766 Request custom instance for recommendation-api labs project

Event Timeline

• schana created this task.Jul 5 2017, 4:13 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 5 2017, 4:13 PM

• schana added a parent task: T169089: Surface scores from spark job.Jul 5 2017, 4:13 PM

bd808 edited projects, added Cloud-VPS (Quota-requests); removed Cloud-Services.Jul 5 2017, 11:27 PM

@schana 0.5TB of disk for a single VM is a very large storage request. This looks like it is about 2x your expected need based on the data sizes you have described.

I'm also wondering if you have to have Postgres or could use MariaDB instead? We have shared database servers that may be able to hold your dataset. We actually have a shared Postgres instance on bare metal as well that is less promoted.

Is this going to be needed "forever" or is this a short term (3-6 month) project?

@bd808 This is not a "forever" project; it's more to surface intermediary results while developing the recommendation algorithm. The plan is to eventually have the dataset be accessible through production infrastructure.

The size of the request may be able to be trimmed, but I'm not confident the data will fit in 256GB. Some additional space is needed for holding the compressed dataset to load into postgres.

I'm not sure what using MariaDB entails, but there is likely to be a fair amount of re-loading and re-indexing occurring during development. There's nothing particularly special about the dataset that necessitates postgres.

If this is ultimately going to be offered as a shared data source, MariaDB is a much more likely backend than Postgres. We have bare metal MySQL/MariaDB databases in Cloud Services that can be used by Toolforge tools and other VPS projects. The "ToolsDB" server is meant for end-user data like this that is not a wiki replica. Reindexing frequency shouldn't be too much of a problem there although the db server is a shared resource so pounding it incredibly hard can cause issues.

@bd808 I created a database s53132__trex_p, but it seems that tables are limited to 64 indexes. Is there a way around this limitation besides exploding the data into its relational equivalent?

@schana That would be a question for the DBAs I guess. My first reaction is to wonder if you really need more than 64 distinct indexes on a table. That's a awful lot of indexing.

@bd808 Yes, it's a lot of indexing, but the table has records for every wikidata item and columns for every wiki, with the values being predictions that a particular item should exist in a given wiki. For performance reasons, sorting is necessary on every column. I could expand the data to be relational with tables per wiki and do joins, but that complicates the ingestion that will be somewhat frequent during the development of the algorithm.

64 secondary indexes per table is a MySQL InnoDB limit -- https://dev.mysql.com/doc/refman/5.7/en/innodb-restrictions.html

• schana added a project: User-schana.Jul 27 2017, 2:17 PM

Restricted Application added a subscriber: PokestarFan. · View Herald TranscriptJul 27 2017, 2:17 PM

• schana moved this task from Backlog to Doing on the User-schana board.Jul 27 2017, 2:18 PM

• schana moved this task from Backlog to In Progress on the Recommendation-API board.Jul 27 2017, 2:22 PM

• schana removed a project: User-schana.Aug 30 2017, 11:33 AM

@schana Are you still blocked by the InnoDB index limit? Do you want to re-examine the need for a custom instance to host your own DB server? Disk is the most heavily over subscribed resource in the Cloud VPS environment, so figuring out how to get your total requested disk down would be very helpful.

@bd808 I'm moving this to paused for now, as there aren't available resources to drive the task. For the future, if we still want to load the data into MySQL, it can be restructured to avoid the index limit.

bd808 changed the task status from Open to Stalled.Sep 6 2017, 3:08 PM

I'm going to resolve this as invalid just to get it off of the workboard. Please do reopen or open another similar task if the project becomes active again and a custom db server is needed.

bd808 moved this task from Discussion needed to Approved on the Cloud-VPS (Quota-requests) board.Jan 9 2018, 11:26 PM

bd808 moved this task from Approved to Discussion needed on the Cloud-VPS (Quota-requests) board.

bd808 moved this task from Discussion needed to Declined on the Cloud-VPS (Quota-requests) board.Jan 12 2018, 5:16 PM

bd808 mentioned this in T184676: Increase the limit of VCPUs for the recommendation-api project.Jan 12 2018, 5:22 PM

Request custom instance for recommendation-api labs projectClosed, InvalidPublicActions

Description

Related ObjectsSearch...

Event Timeline

Request custom instance for recommendation-api labs project
Closed, InvalidPublic
Actions

Related Objects
Search...