Page MenuHomePhabricator

[GSoC 2020 Proposal] Making a supervisor for monitoring the status of all the Wikimedia database instances
Closed, DeclinedPublic

Description

Profile Information

Name: Amey Parundekar
IRC nickname on Freenode: thelostexplorer
Web Profile: https://www.github.com/the-lost-explorer
Location: Pune, Maharashtra, India
Typical working hours: 11am to 6pm (Indian Standard Time)

Synopsis

Wikimedia uses over 200 MariaDB instances to store content and metadata for Wikipedia and other free knowledge projects. While standard open source tools for both monitoring and automation are used when possible, there are some tasks that require custom development. For such monitoring tasks, there is a necessity to replace the existing database supervisor “Tendril” with a faster one that takes advantage of (or integrate with) modern technologies available at Wikimedia such as prometheus+grafana metrics monitoring, performance_schema, pt-heartbeat, or our database backup system.

This project aims at making a supervisor (we can call it a dashboard/monitor) for monitoring the status of all the Wikimedia database instances.

  • Possible Mentor(s): @jcrespo
  • Have you contacted your mentors already? Yes

Deliverables

This project will aim to deliver a web-based application to monitor the database status.

This contains:

  1. A table view of all the database instances.
    • The table can be adjusted to show information sorted by instances as well as servers.
    • It will also have a global search feature.
    • It will include the following columns:
      • Instance
      • Section
      • Location
      • Server
      • IP address
      • Port
      • Version
      • Uptime
      • Pool
      • QPS
      • Latency
      • Lag
      • I/O status
      • SQL
      • RO
  1. All the columns will be sortable(and optionally searchable) by clicking on the heading.
  1. Clicking on any row will show a Modal (Pop-up) with all the information available about that server/instance.
  1. It will use iconography and color gradients for rows where necessary for easily pinpointing important information since the table will provide too much information.
    • If a server/instance is down, the row becomes maroon.
    • If any server/instance has no SQL support, the row becomes yellow. Note: The above feature is very necessary for better UX.
  1. Option to export table information into comma-separated values.
  1. Feature to change the update frequency of the table. (Can be set to a custom period)
Optional features (to be implemented if time permits)
  1. Grafana-like metrics for a quick view of the database status.
  1. An intelligent feature that analyses the database instance failures, making it easier to find out the source of failure.
  1. Integrating GraphQL for ease of maintenance.
  1. Dockerizing the application for ease of deployment.
  1. A tree view of all the database instances.
      • This will show a horizontal tree view of the instances as opposed to the top-down tree approach used in dbtree.
    • Each node of the tree will consist of information related to the database instance viz.:
      • Database instance name(identifier)
      • Lag
      • QPS
      • Version
      • Binlog
      • Read/Write permissions
      • Latency

Participation

  1. In my opinion, since we are starting from scratch, we shall make a new GitHub repository (or on WikiMedia's local git system).
  2. I shall submit PRs to this repository.
  3. For sharing status, I will use Phabricator. Shared Google Sheets can also be used.
  4. I will share my experience on Medium on a timely basis.

About Me

  • Education (completed or in progress): Final year (4th) undergraduate student at Vellore Institute of Technology, India
  • How did I hear about this program? GSOC website
  • Will I have any other time commitments, such as school work, another job, planned vacation, etc, during the duration of the program? None. No other commitments during GSOC period.
  • We advise all candidates eligible for Google Summer of Code and Outreachy to apply for both programs. Are you planning to apply to both programs and, if so, with what organization(s)? I am only applying for the Google Summer of Code with Wikimedia organization.
  • What does making this project happen mean to me? Making applications is something I have always enjoyed ever since I have sailed the application development boat. I always strive to make things minimal and developing useful minimal applications gives me immense pleasure. Since I understand and cherish the power of open-sourced development, developing an application for Wikimedia would fill me with extreme pride for having contributed to something that helps or supports technology used by millions.

Past Experience

I am listing only application development related experiences. (I have experience in the field of electronics as well but it would be irrelevant here)

All of the projects are available on my GitHub profile.

  1. Project SHELF: A website for selling books, made solely for universities. This was made as a part of my internship with Hasura (https://www.hasura.io)
  1. Project munverse: A chatting application made specifically for Model United Nations. It was used by over 1000+ people at MUN in my university.
  1. Navi: An Augmented Reality based navigation system made using Google Maps API and Google ARCore.
  1. DWAC (Distributed Wireless Ad-hoc Computation Simulation): A simulation software for distributed wireless ad-hoc networks made for my wireless communication course.
  1. Lily: A tkinter based application to find beautiful natural patterns with geometry based on origami.
  1. iClaimNet: Generation of a large scale dataset using natural language processing for analysis of fake news on social media. Also published a paper on time.
  1. simon: A game written in assembly language and C made to run on 8051 microcontrollers.
  1. Research and Development Internship at PTC. (Product Lifecycle Management)
  1. Full Stack Development Internship at GMetri Inc. (AR/VR) - on-going and ends in April 2020
  1. One of the founders and board members for an AI Research forum at VIT. (https://ai-vithink.github.io)
  1. Teaching Assistantship at VIT for microcontrollers.
  1. Certification from India Institution of Technology, Madras for modern web application development.

Timeline

IMPORTANT: The timeline allocated for the tree shall be used solely for the table if the table needs more work even after the time allocated for it is over. Prior intimation will be given.

May 4, 2020 - May 14, 2020 - Community bonding period

  1. Knowing more about WikiMedia’s databases and Tendril.
  1. Survey to know exactly how sysadmins want their dashboards to look like. (can be verbal or more conveniently a Google Sheet.)
  1. Explore and discuss possible technologies that can be used for the creation of the software.
  1. Design a final mock-up for the application, get approval from mentor possibly a few sysadmins.

May 15, 2020 - May 22, 2020 - Playing with the database

  1. Understand the database.
  1. Design queries to get the relevant data from the existing database.
  1. For the table.
  1. For the tree.
  1. Optimize queries and remove any redundancy.
  1. Identify what data the queries will provide.

May 23, 2020 - June 7, 2020 - Writing APIs

  1. Understanding APIs necessary for the application.
  1. Writing APIs for the table view. (access and data)
  1. Writing APIs for the tree structure. (access and data)
  1. Figuring out the optimum structure for the output response of the above APIs.
  1. Roughly, the Tree’s API must be a JSON which follows a breadth-first-search friendly structure. (the final endpoint will be something like getTreeData)
  1. Roughly, the Table’s API can have a simple JSON array output with rows of the table. (the final endpoint will be something like getTableData)
  1. Polishing the API response and adding possible authentication measures.

June 8, 2020 - June 21, 2020 - Developing the Table

  1. Making a basic table component suitable for the information that we will use. (personal choice: Making React Components but can be simple HTML+CSS as per requirement)
  1. Making the Table structure available at an endpoint.
  1. Binding the table with data using API previously made.
  1. Making the modification to the APIs if necessary.
  1. Adding searching, sorting and other features listed in the “Deliverables” section for the Table.

June 21, 2020 - July 7, 2020 - Developing the Tree

  1. Making a basic tree component suitable for the information that we will use. (personal choice: Making React Components but can be simple HTML+CSS as per requirement)
  1. Making Tree structure available at an endpoint.
  1. Binding the tree with data using API previously made.
  1. Adding features listed in the “Deliverables” section for the Tree.

July 7, 2020 - July 14, 2020 - Polishing

  1. I shall use this time, to remove redundancies, making the UI user friendly if already not and understanding flaws.

July 15, 2020 - July 19, 2020 - Testing

  1. I shall test the application in every possible way. That includes:
  1. Performance Testing - Number of users, type of devices
  1. Integration Testing - To make sure everything works well with the existing infrastructure.

July 20, 2020 - July 27, 2020 - Implementing optional/ additional features

  1. I will use this time to implement the optional features as listed in “Deliverables”.

July 27, 2020 - August 7, 2020 - Documentation

  1. This time will be used to document everything related to the application:
    • API documentation
    • Usage documentation

UI Mock-ups

I shall use the mock-ups as posted in T246435 as the mock-ups for the application. (See below)

zarcillo.png (978×1 px, 271 KB)

dbtree.png (1×2 px, 246 KB)

Wikimedia Contributions

Patch sent to: T248661: Fix return code handling on transfer.py error - accepted and successfully deployed to production.

Event Timeline

@jcrespo please provide some feedback as soon as possible.

Thank you. :)

Hi, Amay,

Thanks for your proposal!

There is one thing that immediately caught my eye. I have been suggesting every student to prioritize between the javascript-focused "tree representation" and the python-focused "instance view and management". While I am ok with having one or the other as a "stretch goal", I personally prefer to have a polished version of one of the 2 first, rather than running out of time for both. While it may look simple enough- making things secure, extensible, simple and with a great UI will take some time. For example- showing a static table on HTML is trivial work- making it updated in real time through javascript, sortable by any column, being able to be filtered on client-side,... will take more time. Please take this as examples- doesn't necessarily have to be exactly like this.

The same thing goes for a sane database design (or memcache, or other store, if one thinks it is needed), documentation, tests, etc.

In other words, I'd prefer a solid foundation with a smaller scope than trying to implement lots of different features. You are free, of course, to prioritize which one you prefer to focus first. Also if you have previous experience doing work with this and are confident you can finish both options, I wouldn't be against it- but we should prepare for potential schedule changes.

On the positive side, I liked how you planned to take the time to gather requirements by users before starting the coding work- most people will want to jump right away. Thanks for having that into account.

One last thing- the title is a copy of the original idea one, and that doesn't look like a good one for your proposal- no automation of task in here. I suggest to change it to reduce to match the scope to the inventory/databases/web etc. part.

QEDK renamed this task from Create or improve a tool for monitoring or automating tasks for Wikimedia databases to [Proposal] Create or improve a tool for monitoring or automating tasks for Wikimedia databases.Mar 26 2020, 9:02 PM
QEDK moved this task from Backlog to Accepted Proposals on the Google-Summer-of-Code (2020) board.

Thanks, you asked for a small task to work on. I just created T248661, while it doesn't have much to do with tendril/web development, it is a (hopefully) simple Python 3 task that we DBAs need. transfer.py is a small script we use to transfer and backup data from database servers. Check the ticket, and if you think you could work on it, after your have read it and played with the code, assign it to yourself and start working on it.

A patch will be required for review using the methods and style standards as mentioned on https://mediawiki.org

Do not hesitate to ask further questions if you need more context on how to start, or to ask for an alternative ticket if that doesn't interest you or you prefer something else entirely for any reason.

Good luck!

One reminder, you haven't changed your title as I suggested on T248590#6002594.

For the Mockups, it is 100% ok to reuse mine, but your proposal should be auto-contained in that regard. I suggest that you upload the images here and just attribute that they are not yours.

L0st3xpl0r3r renamed this task from [Proposal] Create or improve a tool for monitoring or automating tasks for Wikimedia databases to [Proposal] making a supervisor (we can call it a dashboard/monitor) for monitoring the status of all the Wikimedia database instances.Mar 27 2020, 10:27 AM
L0st3xpl0r3r renamed this task from [Proposal] making a supervisor (we can call it a dashboard/monitor) for monitoring the status of all the Wikimedia database instances to [Proposal] Making a supervisor for monitoring the status of all the Wikimedia database instances.

[Proposal] Making a supervisor for monitoring the status of all the Wikimedia database instances

Thanks, that looks good, much less confusing :-)

Now go an try to work on T248590#6004482

Hey, @jcrespo, I will definitely go through the task T248661: Fix return code handling on transfer.py error thank you so much!

I have added the mockups here and changed the title.

Moreover, I agree that although I am confident enough that I can do both-the table and the tree, in the stipulated time, I will stick to the table when we start off. With that in mind, with regard to my timeline, I shall devote most of my time perfecting the table and the time assigned for the creation of the tree (see timeline) shall be given to the table, if we, at the point of completion of the table, feel like it is not a finished good yet. I will add a note regarding this in my proposal while keeping the tree creation in there. Would that be fine? Or do you think I should change the timeline by removing the tree creation for now?

One note, though, it is not just a single table, if you saw tendril already- there will be several views that now exist on tendril (per host view, single server views, replication chains, insert/update/delete server, blank pages for other sections on the menu that are not in scope, etc.), as well as the development of an api, database and other storage design, etc.

can do both-the table and the tree, [...] Would that be fine?

Fine to me, just wanted to warn you of potential schedule changes and delays on your side. If you feel confident enough, go for it. The patch review process may change your mind. :-D

Transfer.py is documented at https://wikitech.wikimedia.org/wiki/Transfer.py including its dependencies. Probably the most complicated, as it is not a standard package is Cumin. I don't think you need a full installation for a simple fix- you should be able to test it without a full working installation, but here are the installation documents: https://doc.wikimedia.org/cumin/master/installation.html

Every package SRE team works on has to work on Debian (stretch or buster). It should work in other systems, but we don't test elsewhere and it is not the target right now. Everthing you do, including your GSoC website proposal, has to work on such environment.

Even if manual testing of changes is irreplaceable, CI will allow to verify your change works as intended- but only if you create unit tests checking it. Currently there is no unit test verifying an exception isn't thrown on failing execution. You should try to make a test inside https://phabricator.wikimedia.org/diffusion/OSMD/browse/master/wmfmariadbpy/test/ (not sure if it would be a unit or an integration test, it depends on how your strategy for testing goes) that fails before and works after the patch is deployed.

L0st3xpl0r3r renamed this task from [Proposal] Making a supervisor for monitoring the status of all the Wikimedia database instances to [GSoC 2020 Proposal] Making a supervisor for monitoring the status of all the Wikimedia database instances.Mar 29 2020, 7:55 AM

Congrats on your first contribution! Feel free to remark on your proposal that it was accepted and successfully deployed to production!

Thank you for your approval and I am glad that I could help. Let me know if you have anything else for me to contribute to. :-)

I genuinely liked the review process and I see myself wanting to contribute more. Irrespective of GSoC, I want to be a part of this. Please guide me. :) @jcrespo

Also, I just uploaded the updated final PDF to GSoC website. :)

Thank you for your approval and I am glad that I could help. Let me know if you have anything else for me to contribute to. :-)

I genuinely liked the review process and I see myself wanting to contribute more. Irrespective of GSoC, I want to be a part of this. Please guide me. :) @jcrespo

If and only if you want some extra task meanwhile (completely optional), I would just suggest to get familiar, on your own, with our database infrastructure. I don't know if I passed these links to you before, but here are some presentations or documentation with some overviews of our infrastructure, which may be interesting for you for context:

You have already contributed to some of these, and maybe you would like to know more about them!

Don't worry, there will be an unending amount of task to contribute more to, (every month 2500 tasks are created here, and there are currently 40K+ open) :-). But some of those will need more familiarity with the environment to be able to contribute to them.

Thank you for your approval and I am glad that I could help. Let me know if you have anything else for me to contribute to. :-)

I genuinely liked the review process and I see myself wanting to contribute more. Irrespective of GSoC, I want to be a part of this. Please guide me. :) @jcrespo

If and only if you want some extra task meanwhile (completely optional), I would just suggest to get familiar, on your own, with our database infrastructure. I don't know if I passed these links to you before, but here are some presentations or documentation with some overviews of our infrastructure, which may be interesting for you for context:

You have already contributed to some of these, and maybe you would like to know more about them!

Don't worry, there will be an unending amount of task to contribute more to, (every month 2500 tasks are created here, and there are currently 40K+ open) :-). But some of those will need more familiarity with the environment to be able to contribute to them.

Hey, @jcrespo thank you so much for this information. I will read through all of it and get acquainted with the database infrastructure. :)
Looking forward to contributing more. :)
I will check out the tasks from the parent task T138562: Improve regular production database backups handling (of the one I did) to see if any of them can be done after reading through the infrastructure information. :)

I see myself wanting to contribute more

I will check out the tasks from the parent task T138562 [...] (of the one I did) to see if any of them can be done after reading through the infrastructure information. :)

There is one thing that can be done independently of production access (many things are difficult without direct access), which is packaging transfer.py (and its dependencies) into a proper debian package. If that is something you would feel comfortable with, or at least trying to learn, feel free to create a new ticket under T138562 to work on it. Here an example of a small utility we packaged: https://gerrit.wikimedia.org/g/operations/debs/wmf-pt-kill

Other option is thinking about the design for your proposal- which technologies and component you would need- and high level design choices (will it use mysql for metrics, does it need memcache? how does the data flow work?). Don't go too deep into this as we will have feedback about it, as we will have to maintan it ourselves.

Let me stress these are things I am only suggesting based on your questions, completely optional, and that will have no impact on your application, as they are beyond the deadline.

I see myself wanting to contribute more

I will check out the tasks from the parent task T138562 [...] (of the one I did) to see if any of them can be done after reading through the infrastructure information. :)

There is one thing that can be done independently of production access (many things are difficult without direct access), which is packaging transfer.py (and its dependencies) into a proper debian package. If that is something you would feel comfortable with, or at least trying to learn, feel free to create a new ticket under T138562 to work on it. Here an example of a small utility we packaged: https://gerrit.wikimedia.org/g/operations/debs/wmf-pt-kill

Other option is thinking about the design for your proposal- which technologies and component you would need- and high level design choices (will it use mysql for metrics, does it need memcache? how does the data flow work?). Don't go too deep into this as we will have feedback about it, as we will have to maintan it ourselves.

Let me stress these are things I am only suggesting based on your questions, completely optional, and that will have no impact on your application, as they are beyond the deadline.

Pardon for my late reply. I will do this definitely. It's just that I'll have to follow through the guide and know that I am on the right track while doing so since I've never created Debian packages. I will ask you things if required. But it will be a great learning experience. I will create the task when I am confident enough that I can do it. Maybe it's very easy and straight-forward. I will read and know.
So questions:

  1. I need to create a Debian package specifically for transfer.py and not wmfmariadbpy?
  2. Once I create the package, which repository does it go to? Where do I push it? What I think is - I have to package it and push it somewhere on Gerrit. So, how will we update that every time transfer.py is updated? Is my train of thought correct?

@jcrespo

I need to create a Debian package specifically for transfer.py and not wmfmariadbpy?

We may want eventually one for all the code, but for now I think one just for transfer.py (and its dependencies), with a dependency on our cumin package, would be cleaner and more simple.

Once I create the package, which repository does it go to?

We should create a separate one for the deb under operations/deb/transfer.py or something like that. However, for now, please send patches to the same repo. Once we are happy with it, we can create a new repo.

So, how will we update that every time transfer.py is updated? Is my train of thought correct?

We should package only "stable versions", so a new package would be created when a bunch of patches have been applied. The gerrit patch should make that trivial.

I realize this may be a big task, so feel free to discard it. I still believe you may prefer to work on advancing on your proposal ideas (e.g. creating diagrams of information flow, thinking about storage solutions, etc.).

@L0st3xpl0r3r We are sorry to say that we could not allocate a slot for you this time. Please do not consider the rejection to be an assessment of your proposal. We received over 100 quality applications, and we could only accept 14 students. We were not able to give all applicants a slot that would have deserved one, and these were some very tough decisions to make. Please know that you are still a valued member of our community and we by no means want to exclude you. Many students who we did not accept in 2019 have become Wikimedia maintainers, contractors and even GSoC students and mentors this year!

If you would like a de-brief on why your proposal was not accepted, please let me know as a reply to this comment or on the ‘Feeback on Proposals’ topic of the Zulip stream #gsoc20-outreachy20. I will respond to you within a week or so. :)

Your ideas and contributions to our projects are still welcome! As a next step, you could consider finishing up any pending pull requests or inform us that someone has to take them over. Here is the recommended place for you to get started as a newcomer: https://www.mediawiki.org/wiki/New_Developers.

If you would still be eligible for GSoC next year, we look forward to your participation!