Page MenuHomePhabricator

[Spike] Store article quality data inside hadoop and make AQS outputs a public API
Closed, DeclinedPublicSpike

Description

Per our talk with @Halfak, I think it would be great to have these data inside hadoop (given that the data is super big and hadoop, by design, should handle these cases) and since we have AQS gives a public API to use the data for external users such as researchers, etc.

This task is done when: we know for sure if this is possible or not and proper phab cards are in place.

Event Timeline

Could we add a bit more info here? Data we will be storing, size, uprate date, privacy considerations

Nuria moved this task from Incoming to Backlog (Later) on the Analytics board.

See T146718. There are no privacy considerations. Right now, we have a dataset we want people to be able to query publicly (preferably by joining to production MediaWiki tables). It's not too big for MySQL, but hosting it there is not easy. So, we're looking at hadoop as an alternative. However analytics does not have a public hadoop. We should have a public hadoop and host this dataset there.

Is there a public hadoop task that we could make this task block on?

Is there a public hadoop task that we could make this task block on?

Our plans for next year include hosting the edit data lake on labs so everyone can access a version of edit data in an analytics-friendly storage, that might be a fit if the dataset is mean to be queried via SQL as our candidates now are druid and clickhouse for this storage. And, since it looks like this data needs to be joined to mw data , this idea would work too as the data lake will give you that ability. So, no to public hadoop but yes to public storage.

Serving data through AQS is an entirely different matter, the infrastructure of AQS is in production and thus not public so the two things this ticket mentions are unrelated. Also, storage considerations are different for an api and how things are set up will depend on particulars of any dataset.

mforns subscribed.

From team grooming:
Closing this task because of lack if activity, please @Ladsgroup reopen if needed.

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptAug 10 2020, 4:18 PM