Page MenuHomePhabricator

Create or improve a tool for monitoring or automating tasks for Wikimedia databases
Closed, ResolvedPublic

Description

NOTE: This project has received fewer contributions and is actively seeking new contributors.

Wikimedia uses over 200 MariaDB instances to store content and metadata for Wikipedia and other free knowledge projects. While standard open source tools for both monitoring and automation are used when possible, there are some tasks that require custom development. There are 2 options for a project to choose here:

  1. Proof of concept of a web-based dashboard that will be used for inventory and/or summary of the status of the MariaDB fleet. While there is already an existing tool, Tendril also used for DBTree, it has degraded over time and doesn't take advantage of (or integrate with) modern technologies available at Wikimedia such as prometheus+grafana metrics monitoring, performance_schema, pt-heartbeat, or our database backup system. There is also the need of a d3-based (javascript) replacement for https://dbtree.wikimedia.org.

    Mockups for inspiration:
  1. Small-scope command line utilities to perform maintenance and monitoring tasks, such as: data checks between servers, data transfers, consistency checks between data and metadata, strange MariaDB process monitoring, and more.

Integration of existing libraries and tools should be preferred when possible, rather than custom solutions, as well as focusing on future extensibility and maintainability rather than a large scope.

  • Skills required: Python (preferred) or PHP, SQL

There are a few tasks related to the above project with the details, they are listed below as subtasks. These are examples of task within the scope, others are possible too, but only 1 or 2 should be chosen to work on (not all).

Related Objects

StatusSubtypeAssignedTask
Resolvedjcrespo
DeclinedL0st3xpl0r3r
DeclinedGuozr.im
ResolvedPrivacybatm
ResolvedPrivacybatm
ResolvedPrivacybatm
ResolvedPrivacybatm
ResolvedBUG REPORTPrivacybatm
ResolvedPrivacybatm
ResolvedPrivacybatm
ResolvedPrivacybatm
ResolvedPrivacybatm
Resolvedjcrespo
ResolvedPrivacybatm
ResolvedPrivacybatm
ResolvedPrivacybatm
ResolvedPrivacybatm
ResolvedPrivacybatm
ResolvedPrivacybatm
ResolvedPrivacybatm
ResolvedPrivacybatm
DeclinedZekai99

Event Timeline

jcrespo created this task.Feb 28 2020, 10:56 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 28 2020, 10:56 AM
jcrespo updated the task description. (Show Details)Feb 28 2020, 11:42 AM

Hello @jcrespo @Marostegui, This is John Guo from the University of Texas at Dallas. I am pursuing my M.S. degree. I am familiar with Python and SQL. And here's my GitHub link.

I think project option 1 quite interesting but also a little tricky, can you give me some instructions on how to start?

Aklapper updated the task description. (Show Details)Mar 6 2020, 10:43 AM

Thanks for getting in touch, @Guozr.im I've sent you an email through the contact details I saw on your GitHub page, trying to answer your question. I hope to hear back from you!

Hi @jcrespo @Marostegui, I'm Karan Lohaan. I'm currently pursuing my Bachelor's degree from Amrita School of Engineering in India. I am familiar with both Python and SQL. The first option seems like a better fix, though i would love to get more insight related to the project and in getting started with the subtasks.

This is my github link. And my email-id is karanlohaan0@gmail.com

i would love to get more insight related to the project and in getting started with the subtasks.

Hi and welcome @Popsauce! For general information, see https://www.mediawiki.org/wiki/Google_Summer_of_Code/2020 .
If you have specific questions, please ask them! :)

Welcome, @Popsauce, I had sent you an email saying, basically, that the subtasks are examples of problems we need to solve with option 1 or option 2, but that one should not try to solve all, they are just there to give more information as to why Tendril (code) had gotten obsolete, and that we would like a new website to replace some of its functionality. Tendril's old code may be useful to get an idea of what was it like.

Query monitoring and metrics would be out of scope because that has already "solved" with performance_schema (T195578) and Prometheus + Grafana (so no need to implement that old tendril's functionality), but we still need to list the servers (both inventory and some basic monitoring parameters, probably using some python web framework), as well as a graphical representation of its topology (probably using some javascript library based on d3).

Scope should be as small as possible, and we would prefer if student would focus on future maintenance and extensibility (simplicity), tests, good coding and security practices, so this can be deployed to production with no worries.

Out of scope, but maybe useful to understand the future usage of the web would be:

  • Subsection with performance_schema metrics
  • Subsection with grafana dashboard metrics
  • Backup/provisioning status and progress
  • Misc Reports (table and schema sizes, growth, etc.)

We would prefer solid proposals (e.g. clean code, very user friendly, sane design) with a small scope (just solving 1 problem) than those too ambitious to finish properly.

@jcrespo I am an undergrad student in my final semester from the Vellore Institute of Technology, India. I have previous work experience in making such db dashboards and I'm pretty much doing the exact same thing where I am interning right now. In fact, I just setup grafana dashboards for the same use case a few days ago but you guys seem to have already done it! It would be great to work on this in summer. Could you please send me further guidelines on how I can get involved?

GitHub
Email: parundekaramey@gmail.com

Thank you! :)

Welcome, @L0st3xpl0r3r, I have sent you an email with basically the same as already expressed at T246435#5953147.

For others: nobody so far has expressed interest on database automation (2nd options). If someone wants to stand out from the rest of potential proposals, consider thinking about it-

  • Having a way to run compare.py in an automated way over all the server's tables, given a list of servers, fix it to make it work for string-based or composite PRIMARY keys; Or making it work for MySQL 8
  • Getting a script, improving or inspired by check_mariadb.py (but better structured, even if it had to be done from scratch) to output a bunch of properties (ip, hostname, host boot time, available memory, mariadb version, mariadb start time, etc.) and status variables (is it up?, read only, QPS/queries run since startup, replication status, connection and query latency, buffer pool size, current connections, write load, read load, ...), from a mariadb instance (e.g. the parameters seen on the dashboard mockup) in json format

would catch my interest right now.

Hello @jcrespo @Marostegui, This is Nayan from the Indian Institute of Technology Kanpur. I am pursuing my Master's degree. I am familiar with Python and SQL. And here's my GitHub link.
https://github.com/Nayan-Das
I want to work on the 2nd option.
Small-scope command line utilities to perform maintenance and monitoring tasks, such as: data checks between servers, data transfers, consistency checks between data and metadata, strange MariaDB process monitoring

Could you please guide me to getting started with some task.

Hi, Nayan88das, welcome.

Relevant tickets for the second proposal are:

The way to go is to read the descriptions there and really understand the scope of work (it should be only one of those -or a single new idea of the same style-, not all of them) by cloning:

https://github.com/wikimedia/operations-software-wmfmariadbpy

And make the relevant existing tool (if any) work on your environment/study how it works.
For example, to work on T156462, one would work to study transfer.py and its dependencies and see what is missing compared to the text of the ticket: Documentation, ease of usage, tests, P2P transfers, etc.

Another idea would be to integrate switchover.py (on the same repo) with dbctl so hosts are pooled and depooled automatically on topology changes, and perform the mediawiki master switch also automatically.

I cannot tell you what you proposal should say exactly, but you should be familiar with some of those utilities (understand what they do, maybe try to use them locally on your environement), and come up with ideas for improving, based on the needs expressed on these ticket (and your own interests/abilities). What I can do is, if you share some of those ideas, guide you if they are adequate on scope (time to implement), adequate to your skills, and useful for us. And of course, answer here specific questions about them "how to use X?" "What would be needed to do Y?" "I cannot make Z work, I get this error, can you help?" "Can you tell me more about Wikipedia databases so I can set up a similar env on my laptop?"

The general idea is that we DBAs need to do a lot of tasks, and we need automation to save time. You can get more context about our work at https://wikitech.wikimedia.org/wiki/MariaDB and https://www.mediawiki.org/wiki/File:MySQL_at_Wikipedia.pdf

It is true that the 2nd option was a bit more open-ended that the 1st one, but I wanted to give also the opportunity of creative people to propose fixes for the subtasks for something that wasn't 100% explicit how to do.

I won't be available during the weekend, if you have further questions, write them here and maybe someone else will be able to answer them, or I will do what I can after Monday.

Hi @jcrespo! I am interested in this project for GSoC 2020.

Thanks, @AdityaJ for your interest.

Which one of the options would interest you better (the web dashboard or the automation tools)?

Do you have the suggested skills on the ticket? Do you have ideas on how you would approach it? Any questions I could help with?

srishakatux updated the task description. (Show Details)Mar 18 2020, 11:46 PM

Hi @jcrespo , I believe having a web based dashboard for this tool would be much easier to use. We can still have background jobs which automatically checks if a server is up and raise alerts if necessary, run compare.py in an automated way. My approach would go like this:

  1. Building APIs which collects various statistics.
  2. Building web based dashboard on top of these APIs.
  3. Building necessary background jobs which uses these APIs.

Also, do we have a list of necessary features required by this tool or it is up to the student to come up with?

I have suggested skills to complete this project. I was a GSoC'2019 student for Wikimedia. My project was based on Python, Django, MySQL and JavaScript.

Privacybatm added a subscriber: Privacybatm.EditedMar 20 2020, 6:09 AM

Hi @jcrespo, I have interest in the second idea for GSoC 2020.
I have gone through the ticket T156462 and transfer.py, In transfer.py I think we can improve copy_dir function using multithreading and also the main transfer function in case of multicasting (as a starter). I am now interested to know what we can do about the following points:

  • Be as fast as the network bandwidth allows it (Doubt: Will this be taken care by TCP itself?), with configurable throttling (Searching for ways to do this).
  • Be easy to use (take automatic decisions when safe) (Can we discuss how to go about it?)

I have good knowledge in Python, PostgreSQL, Linux (certified for Linux foundation system administrator: 2016-2018)... etc.

My Github account: https://github.com/ajupazhamayil/

(mail-id : ajupazhamayil@gmail.com)

@AdityaJ Thank you for you interest.

For the #1 option of the 2 I mentioned above, only building a website and a read-only API to expose the data would be on scope. Scraping data and how that is inserted on a cache/database/other local storage (your points 1 and 3) would be out of scope. You can see, however, some ways in which we have started to work on designing those, see for example https://phabricator.wikimedia.org/diffusion/OSMD/browse/master/wmfmariadbpy/check_health.py and https://phabricator.wikimedia.org/diffusion/OSMD/browse/master/wmfmariadbpy/replication_tree.py to get inspired on the kind of things we are interested to visualize, and of course you could provide additional input, but for option 1 we are just interested on visualization work (dashboard "table" and/or tree view of the instances). Here is an example input of these command line tools:

Scrapping data is a much more difficult problem that requires direct production database access, and that is not suitable for a GSoC project.

Also, do we have a list of necessary features required by this tool or it is up to the student to come up with?

Regarding requirements, I have only given a general idea of the project and some screenshots for inspiration. You can also download the existing PHP tool we want to replace, available at https://phabricator.wikimedia.org/diffusion/OSTD/ and play with it so you can get a sense of what is needed. Only 2 main differences: We would prefer python over PHP, and metrics graphs and query monitoring are out of scope because they have been solved in a different way.

So detailed requirements are part of the project, during community bonding and at the start of it ("analysis of requirements/kick off discussion"), but you should have publicly available all code to get yourself a list of existing needs. I don't want to specifically create a list here, because every student could do a different proposal and focus on a different part/method, so yes, in a way, the proposal for the details of a project to work on is the student's responsibility 0:-D

Of course, if you want to start working on a proposal and you paste a draft here on a phabricator task, I will be able to advice in terms of scope and amount of work needed. :-) I think some students may be confused because the project looks like "just a website" (and technically, it is), but because its high-performance, high throughput, real-time, backend and security needs it is not such a simple project. It needs to be extremely well documented, extensible, reliable and simple so it can be used for a long period of time here at Wikimedia. That is why we would prefer to have a well polished but small project that a very large one.

Hi, @Privacybatm, thanks for your interest!

I love that you got specific interest in a tool and did some reasearch in advance. transfer.py is a highly used tool here, but that was coded in a rush, so there is a lot of room for improvement!

Answering your specific questions:

Be as fast as the network bandwidth allows it (Doubt: Will this be taken care by TCP itself?)

While TCP plus our QoS configuration may take care of this, we have observed in the past that some encryption algorithms or protocols limited the bandwidth used. It would be nice to do some tests patches to see if different encryption methods or protocols other than a raw socket could be used, tuning its parameters. Right now we achieve this with a raw socket usage + precompression, which actually makes the bandwidth larger than the "tube size", around 5x times. There is still a lot of work on improving the current situation- and doing some testing for alternative transmission methods/benchmarking.

with configurable throttling (Searching for ways to do this).

One possible way is to use the 'pv' command line utility, which has a --rate-limit option. Other options would be possible, too.

Be easy to use (take automatic decisions when safe) (Can we discuss how to go about it?)

So things like which port to use, should be automatically detect which ports are available and just use the first one, rather than it being hardcoded (so the --port option is not needed).
The other thing to improve is cumin integration, I would like to transform the current method into an abstract factory driven by configuration, rather than it being hardcoded.
Other things to improve: failure detection- when network has issues and the sender fails, the receiver should timeout after some X minutes, so it doesn't stay open forever.
Progress monitoring (could be related to pv): it would be nice to have an ETA to finish the copy and some progress bar.
Finally, we need to make check summing happening at the same time than transmission (in parallel) so it doesn't increase the total execution time. df utility is used to check total size of the transfer, but that is unreliable- we need an alternative that doesn't take much time (in case we are transferring hundreds of thousand of files).

Other ideas never implemented:

  • Use bittorrent or multicast for several hosts (right now, if a file is copied to several hosts, it does that serially)

In general, improving and documenting transfer.py to make it more reliable, simple, and clean it up with better coding practices, as well as increasing its documentation would have quite a lot of work. And project regarding that would be welcome. All your suggestions look good, too. Feel free to keep asking for more feedback and/or start writing on a ticket a draft proposal where to go through.

Thank you!

@jcrespo Thank you for the information. I will do a research on all the points you said here and I will make a ticket with a draft proposal. Thank you :-)

Sahajsk removed a subscriber: Sahajsk.Mar 22 2020, 3:04 AM

Hi @jcrespo,

My name is Hassan Shahzad, I am a junior year student in NUST, Islamabad. Yes, I do have suggested skills to solve for tickets available. I am interested in the web-based dashboard. Sorry it might seem a basic question but being new to Google Summer of Code, I wanted to ask that, Are these sub tasks needed to be solved before working on this task or we have to work on it during coding of these tasks. And what are the sub tasks related to Idea 1? Can you send me details on my E-mail?

My GitHub account is: https://github.com/HassanShahzad7
My Email is: hshahzad.bscs17seecs@seecs.edu.pk

Hi, @Hassanshahzadkhan welcome and thanks for your interest on this.

You should hurry up as the submission date is getting closer and closer.

Are these sub tasks needed to be solved before working on this task or we have to work on it during coding of these tasks. And what are the sub tasks related to Idea 1?

The subtasks are related to the project in several ways:

  • Some may be actual projects to be directly working on, such as T204110 T96499 T141968
  • Some are just relevant to give more background to the proposed ideas: T234900
  • Others are not related to option 1, but to option 2

Note that the general idea is that we can propose ideas we think are good candidates for proposals, but it is up to the student to come up with a good project based on their skills and doing some research on your own.

My suggestion is to clone the current software, tendril, available at https://phabricator.wikimedia.org/diffusion/OSTD/ and research how it works (brief documentation also available at https://wikitech.wikimedia.org/wiki/Tendril ), then ask anything you don't understand. We will want something equivalent to that -inventory of databases-, except written with a Python web framework, or a replacement for the tree visualization in javascript (it is ok to reduce the scope to a minimum viable product). Metrics (graphs) and query monitoring are out of scope of the project, as they have been solved in a different way, using standardized solutions. You also have the screenshots I pasted on the proposal for inspiration.

Please note that this is not a simple project- as we will require simplicity, high standards of security procedures, careful design for future expansion, and a polished user experience; that is why we are ok with a smaller scope in terms of actual work to be done (inventory management, display of lists of servers/tree and its almost-real time metrics).

Gathering metrics (scraping content) is also out of scope.

Can you send me details on my E-mail?

We prefer to use Phabricator for communication- you should create a draft proposal as a new ticket ASAP following the template at https://www.mediawiki.org/wiki/Google_Summer_of_Code/Participants#Application_process_steps and let me know so I can provide feedback.

Let me know if that helps you understand the project better and if you have further questions.

To everyone that expressed interest here- please remember to send your proposals to Google too, in addition to Phabricator- as mentioned on step 10 of https://www.mediawiki.org/wiki/Google_Summer_of_Code/Participants#Application_process_steps Even if your proposal is not yet final, it can be drafted there too and modified before the final deadline.

Remember for the proposals as new tickets here (which is also compulsory to be taken into account), they must be marked with the tag "Google-Summer-of-Code (2020)", otherwise the organization won't be able to find them and rate them!

A reminder to everyone that Google's summer of code deadline is tomorrow, 31 at 18UTC-- make sure your Phabricator proposals AND google sent docs are up to date and final before deadline (edits won't be accepted after that time). Make sure you double-checked our recommendations at: https://www.mediawiki.org/wiki/Google_Summer_of_Code/Participants#Application_process_steps (and don't leave if for the last minute ;-))

Best wishes to everyone!

Kormat added a subscriber: Kormat.May 4 2020, 12:37 PM
jcrespo closed this task as Resolved.May 6 2020, 7:34 AM
jcrespo claimed this task.

This task was created to be a meta task for student's more concrete proposals. Now that all GSOC students have been selected, this is now "resolved", as it was just a general idea, and students will work on more concrete individual ticket proposals.