Page MenuHomePhabricator

Wikimedia Technical Conference 2018 Session - Building our storage systems for flexibility and scale
Closed, ResolvedPublic


Session Themes and Topics

  • Theme: Architecting our code for change and sustainability
  • Topic: ...

Session Leader


  • Birgit Mueller


We face many challenges around storage, among them are a growing Revisions Table which is being exasperated by the need to store more data centric content like JADE as well as the needs of Wikidata. What do we store in primary storage? How do we utilize Elastic to take some of the burden off of our storage systems? Are there opportunities to explore integrating other storage systems and VCSs to allow more flexibility and scalability in our storage systems?

Questions to answer during this session

QuestionSignificance: Why is this question important? What is blocked by it remaining unanswered?
What are the most pressing scalability issues with the current storage solutions?We should enumerate the pressing scaling issues that we know of to inform any changes to our existing storage systems.
What options do we have to address these issues? Changes to existing solutions? New storage types? What product decisions make this easier or harder?We need to assess our current solutions and make sure we identify any gaps and trade offs between them. These issues should be presented to product owners as tradeoffs so we can make good decisions in our products and platforms that ease scalability when possible.
How do we design storage to support curratable data/metadata in a scalable way? How do we handle the storage of granular data updates? Discuss the tradeoffs of using git and other types of VCS for storing revisions (augmented by Elastic for queries). Specifically address use cases which are not page based, i.e. metadata associated with revisions, users, diffs, etc…We have many use cases which require storage of data and metadata, but we have been unable to find ways to store in a scalable manner when that data requires curation. The size of the revision table has been a blocker for storing more types of (non-page) data that must be curatable (JADE). We are adding more data to the system that will increase this burden. We also now have MCR which has the potential to add a lot of revisions and doesn’t solve for non-page based metadata. Many types of VCS have matured over the past 15 years - we should take a fresh look at them to see if they solve problems. Wikibase supports curation, but is having scale issues and also backed by Elastic and Blazegraph. Is this the architecture we want to follow? Figuring out a storage solution to support these use cases is important to implementing new types of curation features.
Is there a way to change how we store content for projects, such as centralized storage, that would lessen the burden on other technology/use cases such as dependency tracking, event propagation and queries? Is sharding per project potentially “oversharding” and adding more overhead on these activities?We know that propagating many events, efficient queries across projects and dependency tracking across projects are key technical challenges we must solve. We should examine potential ways to architect storage as a way to ease challenges in those areas.
How do developers choose between storage technologies? How do they choose to use elastic for queries versus modeling them in primary storage?We need good guidelines for developers picking the type of storage. We need good guidelines for what queries we should design primary storage to support and which ones should be offloaded to elastic or Blazegraph (or similar)

Facilitator and Scribe notes

Facilitator reminders

Session Structure

  • Define session scope, clarify desired outcomes, present agenda
  • Discuss Focus Areas
    • Discuss and Adjust. ''Note that we are not trying to come to a final agreement, we are just prioritizing and assigning responsibilities!''
    • For each proposition [add etherpad link here]
      • Decides whether there is (mostly) agreement or disagreement and the proposition(s).
      • Decide whether there is more need for discussion on the topic, and how urgent or important that is.
      • Identify any open questions that need answering from others, and from who (product, ops, etc)
      • Decides who will drive the further discussion/decision process (ie: a four month deadline)
  • Discuss additional strategy questions [add etherpad link here]. For each question:
    • Decide whether it is considered important.
    • Discuss who should answer it.
    • Decide who will follow up on it.
  • Wrap up

Session Leaders please:

  • Add more details to this task description.
  • Coordinate any pre-event discussions (here on Phab, IRC, email, hangout, etc).
  • Outline the plan for discussing this topic at the event.
  • Optionally, include what it will not try to solve.
  • Update this task with summaries of any pre-event discussions.
  • Include ways for people not attending to be involved in discussions before the event and afterwards.

Post-event Summary:

  • ...

Action items:

  • ...

Event Timeline

kchapman renamed this task from Wikimedia Technical Conference 2018 Session - What is the plan for addressing and meeting the needs for storage of new types of data? to Wikimedia Technical Conference 2018 Session - Building our storage systems for flexibility and scale.Oct 3 2018, 2:42 AM

Having discussed with leaders of T206076, we're going to make those two sessions a part of a bigger whole, focusing on the storage systems, with a goal to start creating some more overview on what kind of storage solution work for what uses, what are requirements for these, etc.

Hello! We are starting to ramp up on session creation for the 2019 Wikimedia Technical Conference. If there is no longer anything remaining to do here please close this task to avoid confusion.

Hello! We are starting to ramp up on session creation for the 2019 Wikimedia Technical Conference. If there is no longer anything remaining to do here please close this task to avoid confusion.

@debt: No reply hence resolving. If there is work left in this task, feel free to either set the status of this report back to "Open" via the Add Action...Change Status dropdown and associate an active project tag to this task, or create separate followup tasks. Thanks.