Page MenuHomePhabricator

Labsdbs for WMF tools and contributors: get more data, faster
Closed, ResolvedPublic

Assigned To
Authored By
jcrespo
Oct 31 2016, 8:54 PM
Referenced Files
None
Tokens
"Pterodactyl" token, awarded by Marostegui."Barnstar" token, awarded by -jem-."100" token, awarded by bd808."Like" token, awarded by Volans."Love" token, awarded by Dalba."Cookie" token, awarded by Miriya52."Doubloon" token, awarded by Joe."Barnstar" token, awarded by Capt_Swing.

Description

Type of activity: Pre-scheduled session
Main topic: Building on Wikimedia services: APIs and Developer Resources

The problem

There is around 20 TB of data stored on Wikimedia's relational database infrastructure.

An interface to access that is provided with tools such as Quarry and tools replica access. However, accessing that amount of data is not necessarily easy or obvious- sometimes complex queries or tricks are needed in order to get more data, faster. Doing the right queries can speed up query time 100x or 1000x.

The idea of this is talking about:

  • Learning the mediawiki structure, and where things are and why
  • Helping queries run faster thought careful planning and query optimization
  • Answer community questions about how to do some things better
  • Get suggestions on how to improve the labsdb service
  • Tell about the new labsdb infrastucture, with 5x more capacity

Expected outcome

  • Community members (tool creators, developers, researchers, staff, etc. ) understand mediawiki data model and its relational nature
  • Community members can use labsdb faster and get more out of it
  • Resources are used efficiently, allowing a larger amount of users utilize them
  • Feedback is given on how to improve the service
  • Feedback is provided on latest infrastructure improvements

Current status of the discussion

Needs feedback

Links

Event Timeline

Qgil subscribed.

Can you add related tags/projects, please?

I am sorry, @Qgil, are there Summit-specific tags or do you mean general phab categories/tags?

General projects/tags. The idea is that people who are watching those tags/projects are aware that this proposal has been submitted.

Soliciting SQL queries to fix/optimize on labs-l: https://lists.wikimedia.org/pipermail/labs-l/2016-December/004835.html

This session is my personal current pick for most important in the "APIs and Developer Resources" track because of the potential to reuse the recorded talk to help many Labs and Tool Labs developers.

... on the other hand this basically looks like a proposal for a presentation/training session? The description doesn't offer any points for discussion and this task has been consistently silent so far.

Video recording should not be the main reason to make this proposal the top pick of your main topic. Number of people interested at the event should be a stronger factor. There will be video recording available in other rooms as well.

In terms of space, what would be your preferred option?

  • The biggest room in theater configuration (up to 200 people, only chairs, no tables) and required video recording (meaning also that people have to wait for the mic to speak etc).
  • A big room in classroom configuration (up to 70-80 people, chairs and tables) and required video recording (meaning also that people have to wait for the mic to speak etc).
  • A big room in classroom configuration (up to 70-80 people, chairs and tables) and optional video recording (i.e. only recording the initial introduction but then relaxing things during the discussion, or no recording at all).
  • A smaller room, flexible configuration, optional video recording...

... on the other hand this basically looks like a proposal for a presentation/training session? The description doesn't offer any points for discussion and this task has been consistently silent so far.

From the abstract:

  • Answer community questions about how to do some things better
  • Get suggestions on how to improve the labsdb service

Does it have tens of questions already stacked up? No, it does not. Will it have some by the point that the conference starts? I really don't know. Ideally yes, but in the world of the wikis it seems that the only topics which draw discussion are subjects of high contention or at least subjects where there is one tireless detractor. I'd be pretty surprised to see someone challenge @jcrespo on query optimization and the architecture and advantages of the new labsdb cluster that is rolling out.

Potential question for discussion (or moving to somewhere more appropriate!) -

  • Where should volunteers go to get help, if they are having difficulties with Quarry, either due to
    • (a) not knowing where to even begin writing/adapting a query,
    • (b) their own inefficiently written queries (which just need minor tweaking), or
    • (c) due to unavoidable timeouts even with efficient queries (which would need to be run from a terminal)?

Context: I regularly see questions in various IRC channels, with people at various levels of SQL expertise, asking how to do something, or why something isn't working.
It would be best to direct these people somewhere onwiki, so that busy staffers aren't getting pinged whilst they're busy with something else, or asleep, or etc. There currently doesn't seem to be an ideal place to point people towards, beyond suggesting they look through the examples given in these 2 pages:

https://quarry.wmflabs.org itself, points to https://www.mediawiki.org/wiki/Talk:Quarry from "Discuss", and I see there are quite a few questions along these lines - perhaps that page just needs to be watchlisted/regularly-checked by a few more SQL experts? (And then linked to, from the other relevant pages.) Then we can point people to there from IRC, with more confidence.

For unavoidably expensive queries, I think we need a [queue/process/thing], where non-experts can request that somebody else run a query for them, from terminal.
(I.e. IIUC, I could hypothetically follow the instructions at https://wikitech.wikimedia.org/wiki/Help:Tool_Labs#Database_access to re-run a timing-out Quarry query, but that's way outside my comfort-area, and I wouldn't know if I was about to cause problems by running a badly written or insane query. I like the safety-net of an auto-timeout! It gives me the comfort to experiment.)

To the owner of this session: Here is the link to the session guidelines page: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/Session_Guidelines. We encourage you to recruit Note-taker(s) 2(min) and 3(max), Remote Moderator, and Advocate (optional) on the spot before the beginning of your session. Instructions about each role player's task are outlined in the guidelines. The physical version of the role cards will be made available in all the session rooms. Good luck prepping, see you at the summit! :)

Note-taker(s) of this session: Follow the instructions here: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/Session_Guidelines#NOTE-TAKER.28S.29 After the session, DO NOT FORGET to copy the relevant notes and summary into a new wiki page following the template here: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/Your_Session and also link this from the All Session Notes page: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/All_Session_Notes. The EtherPad links are also now linked from the Schedule page (https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/Schedule) for you!

I am going to resolve this as fixed, even if we have yet to finish the setup and announce the new labsdb servers (so this is just the beginning of the work)- I think the scope of the talk is done- thanks to all note-takers and other people helping.

New updates will go to subtickets on T140788 and of course, on the labs list.