Page MenuHomePhabricator

Develop a web dashboard or a command line tool to help inventory and/or monitor database and backup objects
Closed, ResolvedPublic

Description

IMPORTANT: Make sure to read the GSoC participant instructions and communication guidelines thoroughly before commenting on this task. This space is for project-specific questions, so avoid asking questions about getting started, setting up Gerrit, etc. When in doubt, ask your question on Zulip first!

Wikimedia uses over 200 MariaDB instances to store content and metadata for Wikipedia and other free knowledge projects. It also uses Bacula, mydumper and xtrabackup to perform backups if its hundreds of terabytes of data in its infrastructure. While standard open source tools for both monitoring and automation are used when possible, there are some tasks that require custom development. There are several options for a project to choose here, of which you should do your own research about which is the right one (only one) to complete within the 175 hours of work:

MariaDB instances inventory (zarcillo web interface)

There is an existing database "zarcillo" that inventories the existing mariadb instances. However, edits and check to it are using SQL directly. The aim is to change that into a web-based CRUD to inventory mariadb instances and its basic properties (servers, sections, etc.) and also develop a web api for easy querying from external services.

Mockup:

zarcillo.png (978×1 px, 271 KB)

Database object inventory

We would like to keep track of existing objects (databases, tables, columns, indexes) in our over 1000 databases throughout over 200 mariadb instances, and then detect (alert) on differences with the latest version of the schema deployed, in order to:

  • Facilitate and schedule schema changes
  • Detect backups errors (e.g. missing objects on backup)
  • Provision more instances/more disks due to growth

For that, we need to 1) scrape information keep a metadata database up to data with reality 2) provide reports of interest (number of objects, size) 3) regularly check for inconsistencies over the same objects on different servers (e.g. a column has a different type on 2 different servers).
There is already some shy initiatives working on this direction:

We would like to merge all those efforts into a canonical metadata solution to monitor db objects.

Database backup inventory improvements

Database backups provide a database (mariadb) structured log of backup output, including timing, success, and backed up objects, so we can track that the dozens of backups produced everyday finish correctly. While there is a basic check to ensure backups are fresh, we would like to expand the usefulness of this database by providing more fine grained info:

  • A postprocesed information of the size of each backup object (filling in the backup_objects table)
  • A web dashboard to the metadata, that easily reports the status of backups or its errors
  • A web dashboard that tracks the status of ongoing backups (next scheduled run, running, postprocessing, finishing ETA, etc.)

Improve WMF Bacula monitoring

While there is already a basic check_bacula.py script, that makes sure backups are being correctly taken, as well as exporting to prometheus the necessary metrics, and basic cli information, we would like to expand existing functionality.

Some of the ways in which the script could be complemented would be:

  • Provide additional metrics regarding retention time, media storage, available disk space, etc.
  • Provide additional information on command line that makes simpler to query backup status by engineers
  • Improve the Grafana bacula dashboard or create additional dashboards with those more detailed statistics
  • If Grafana is not the right technology for that, improve the visibility of bacula statistics by providing a nice web dashboard summarizing the status of bacula, that helps engineers understand backup status

MySQL account metadata inventory

Currently, most of account management for MySQL happens on Puppet configuration management tool. However, puppet may be far from ideal for deployment of certain changes that have to be done synchronously throughout the cluster, as well as monitoring anomalies.

That is why we want to develop a tool that help track and monitor account configuration among different distributed instances. Such tool should be able to:

  • Connect and read the account information as well as permissions (grants) for each instances
  • Track what should be the account information for a given instances (including different roles, usages and peculiarities of different group of servers)
  • Alert on anomalies found, such as accounts or grants missing from a Mariadb server, passwordless accounts or grants too open for a given account.

This will require maintaining a database inventory of accounts and instances, and a way to emit alerts (web interface or through a provided monitoring tool api).

Suggested microtasks

NOTE: * Skills required: Python (preferred, probably with Django or Flask), PHP, basic databases and SQL/file management knowledge
NOTE: * Mentors: Jaime Crespo aka #jynus [@jcrespo], Manuel Arostegui [@Marostegui].

Related Objects

StatusSubtypeAssignedTask
Resolvedjcrespo
OpenNone
OpenNone
OpenNone
OpenNone
Resolvedjcrespo
DeclinedPalak199
DeclinedDharmrajRathod98
Declineddanielbenedi6
DeclinedAshitabattu
Resolvedh.krishna
Resolvedh.krishna
Resolvedh.krishna
Resolvedh.krishna
Resolvedh.krishna
Resolvedh.krishna
Resolvedh.krishna
Resolvedh.krishna
Resolvedh.krishna
Declinedh.krishna
Resolvedh.krishna
Resolvedh.krishna
Resolvedjcrespo
Resolvedh.krishna
Resolvedh.krishna
Resolvedh.krishna

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@DharmrajRathod98: Welcome and thanks for your interest. Can you please elaborate which specific areas / questions are unclear and require more guidance? Please see https://www.mediawiki.org/wiki/New_Developers - thanks a lot! :)

Why do you think you need read privileges?

Hi @Aklapper,

on clicking at any database it redirects me to this message,

Screenshot 2021-03-01 at 2.19.20 PM.png (1×2 px, 906 KB)

I wanted to get read access to these pages, currently, I am able to see only a single page that is: https://dbtree.wikimedia.org/
Clicking on any link on this page redirects me to the message in the screenshot.

Clicking on any link on this page redirects me to the message in the screenshot.

Yes, that is expected. Dbtree is open for public, but if you check the links, they sent you to tendril, which is only available under NDA, as it contains private information, and such access won't be provided for GSoC students.

I can describe to you what tendril has (and the source code should help you understand the details), but we on purpose didn't go over this details, because the aim of the proposed project is not to reimplement all of tendril functionalities, but to do something else. Currently, tendril has a few sections:

  • A list of hosts, similar to the mockup at:
    zarcillo.png (978×1 px, 271 KB)
  • A tree of hosts, similar to https://dbtree.wikimedia.org
  • Metrics for each db instance, similar to https://grafana.wikimedia.org/d/000000273/mysql
  • A list of ongoing query activitly, listing currently running queries longer than 10s on all hosts
  • Reports for all hosts, such as past slow query activity, filtered by query, host, schema and user

Having said that, we do NOT want a clone of the existing tool. We have now better tools for many of these, such as Grafana+prometheus for metrics, orchestrator for database topology, and performance_schema for query profiling.

In fact, proposing to implement a full clone of tendril, will likely be rejected as out of scope. We only asked for a dashboard to manage the zarcillo database provided, no more.

Hi @jcrespo,

I was not planning to clone tendril, but I was looking to improve a small patch of code in the tendril that's why I was asking how to run it locally.

I believed I was supposed to do so,

As a way to value positively that you will be able to take over the web development for that particular project, sending a very small patch to fix/improve something small on the current tendril codebase (note that is PHP, which may not be your expertise), or other small python web service we SRE maintain (e.g. https://github.com/wikimedia/debmonitor ), or improving existing technical documentation about them, will prove you are familiar with Wikimedia's development process.

But as I am not able to run tendril locally, I don't feel trying to fix something in the tendril will be a good idea without testing it locally. I will try to contribute something to the debmonitor tool, at least in the form of documentation.

As a way to value positively that you will be able to take over the web development for that particular project

I feel that for this, I should try to unofficially build something similar to this dashboard using the given schema as my personal project, this will give you and me a better idea of whether I am suited to this or not.

main focus at first (for the MariaDB instances inventory project you chose) should be on understanding the use cases to solve particular to the WMF, and why things such as tendril or orchestrator doesn't cover those.

For this, I have started self-learning PHP and I am trying to understand the use cases of the tendril.

prove you are familiar with Wikimedia's development process.

For this, I am planning to do send some small patches in debmonitor.

Honestly, I feel that I was getting too much hooked on the Tendril tool and PHP, thanks for not letting me drift from the main objectives of this project :P

@DharmrajRathod98: Welcome and thanks for your interest. Can you please elaborate which specific areas / questions are unclear and require more guidance? Please see https://www.mediawiki.org/wiki/New_Developers - thanks a lot! :)

Hi @Aklapper, @jcrespo
I want to contribute to Database backup inventory improvements this specific project idea it seems quite interesting to me. I have read the details of this idea and I have seen the schemas of backup tables. I have also read the https://www.mediawiki.org/wiki/New_Developers. can you guide me to setup the environment for this specific project idea?

Thank you very much

Hi @jcrespo, @Marostegui Wikimedia has been officially accepted for GSoC 2021! https://summerofcode.withgoogle.com/organizations/5372073939042304/

As the student's applications period is from March 29 - April 13th, I want to encourage you to go through the further steps to ensure there isn't anything that you are missing: https://www.mediawiki.org/wiki/Google_Summer_of_Code/Mentors#_Before_the_program

I've added 3 microtasks related the the proposed projects that @Marostegui and I think are small but adequate to demonstrate understanding of how to contribute to Wikimedia.

@Rohitesh-Kumar-Jain

but I got the error unit: commands failed.

What did the error message exactly said? Probably you had some dependency issues.

@DharmrajRathod98 the environment will depend on your proposal (cli vs web vs other). Most of the above are system-level projects, and at Wikimedia our target environment is Debian 10. Installing a partition or a virtual machine with Debian Buster will probably simplify the development, although not required as we may be able to provide a testing environment to connect to. Regarding the stack, most projects will require a database- we use MariaDB 10.4. Python version on Debian 10 is 3.7, however we cannot use fstrings because in most cases we still have to run on Debian 9 (python 3.5). For web development, our server of choice is Apache. It is likely you will need some basic building utilities to create a package. We also use tox for python testing. Some projects use sphinx for documentation (but not all). Regarding framework, for new web projects, that is for you to decide. We suggest django or flask as the safe options, but we are open to other proposals if properly justified.

Hi @jcrespo , I am Palak, another GSoC 2021 enthusiast and found this project interesting and matching to my skillset. However I am unable to make a start on

MariaDB instances inventory (zarcillo web interface) project. Kindly assist

Hi @jcrespo , I am Palak, another GSoC 2021 enthusiast and found this project interesting and matching to my skillset. However I am unable to make a start on

MariaDB instances inventory (zarcillo web interface) project. Kindly assist

Hello Palak,

Thanks for your interest. Can you be a bit more specific on which problem are you facing?

Hello Palak,

Thanks for your interest. Can you be a bit more specific on which problem are you facing?

I am not able to find the link to this project's repository and thus not able to understand the code base. if you can please let me know that. What am I supposed to work on? Start building the webapp?

@Palak199: Please see the task description which links to all relevant codebases. For what to work on, please follow info in https://www.mediawiki.org/wiki/Google_Summer_of_Code/Participants . Thanks.

@jcrespo
for Database backup inventory improvements this new project idea flask webapp would be better for me. Should i set up flask with Maria DB 10.4, Apache on my local (ubuntu 20.04 ) machine ? After that for testing i need to clone this (tox testing) operations/software/wmfbackups respository. Am i understood correctly?

Thank you

@Palak199: Please see the task description which links to all relevant codebases. For what to work on, please follow info in https://www.mediawiki.org/wiki/Google_Summer_of_Code/Participants . Thanks.

@Aklapper the link to relevant script to the project MariaDB instances inventory (zarcillo web interface) seems to be broken.

https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/files/prometheus/mysqld_exporter_config.py;blame=off
can you please check

@Palak199: Please see the task description which links to all relevant codebases. For what to work on, please follow info in https://www.mediawiki.org/wiki/Google_Summer_of_Code/Participants . Thanks.

@Aklapper the link to relevant script to the project MariaDB instances inventory (zarcillo web interface) seems to be broken.

https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/files/prometheus/mysqld_exporter_config.py;blame=off
can you please check

This should work:
https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/files/prometheus/mysqld_exporter_config.py

@jcrespo
for Database backup inventory improvements this new project idea flask webapp would be better for me. Should i set up flask with Maria DB 10.4, Apache on my local (ubuntu 20.04 ) machine ? After that for testing i need to clone this (tox testing) operations/software/wmfbackups respository. Am i understood correctly?

For most of this project propoals, there is no existing code- a new repository will be setup so you start from 0. What exists is some related pieces of code or a database -that you can read and get an understanding what are the current usages. For example, for the database backup inventory, there is another repo for reading and writing to the database, and a database schema, but there is no existing code for a dashboard. You should download the .sql file and build the web dashboard around that database schema. I hope that is clear- please ask for further clarifications.

For everybody else: doing a deeper understanding of the requirements will be part of the project time.

Sorry, with so many questions sometimes I get confused. 0:-) The above comment was mostly in response to @Palak199 question about the repo, although it also partially answered a common confusion.

The requirement are vague on purpose for several reasons:

a) Students are supposed to do their own research about what they can contribute
b) We don't want to put exact details about the project to do, students should come up with their own concrete proposal.

Thank you.

hi @jcrespo , Thanks for answering my queries, as stated earlier I want to work on MariaDB instances inventory (zarcillo web interface)
I read this python script https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/files/prometheus/mysqld_exporter_config.py and got a basic idea on what it does. Would you please guide me on if I should start building its web interface straight away in django or flask or is there anything else I should work/research upon wrt this project.

is there anything else I should work/research upon wrt this project

Work on the projects proposed should not start before the date, as you may not know the exact requirements. Instead, you should start working on a written proposal.

I am afraid I don't understand how do I read that, there is no information about its workflow and everything in README.md and as you mentioned earlier it can't be installed on local machine.

  • You should find a small task to contribute to

By this you mean contribute to tendril tool's existing project?

@jcrespo
for Database backup inventory improvements this new project idea flask webapp would be better for me. Should i set up flask with Maria DB 10.4, Apache on my local (ubuntu 20.04 ) machine ? After that for testing i need to clone this (tox testing) operations/software/wmfbackups respository. Am i understood correctly?

For most of this project propoals, there is no existing code- a new repository will be setup so you start from 0. What exists is some related pieces of code or a database -that you can read and get an understanding what are the current usages. For example, for the database backup inventory, there is another repo for reading and writing to the database, and a database schema, but there is no existing code for a dashboard. You should download the .sql file and build the web dashboard around that database schema. I hope that is clear- please ask for further clarifications.

For everybody else: doing a deeper understanding of the requirements will be part of the project time.

@jcrespo
Thank you very much for clearing my queries. i think this much information is enough for me to understand the project idea. that would be great if you can provide link of another repo for reading and writing to the database, i have seen schemas from wiki/MariaDB/Backups#Metadata (if anything else is there please give me a link) and from where can i download .sql file for Database backup inventory improvements this specific project idea.

@DharmrajRathod98 I hope this helps: https://gerrit.wikimedia.org/r/c/operations/software/wmfbackups/+/673292 If it is clear, I will merge it.

However, the current database is very simple- you should start with the existing db, but will be able to add/change anything you would consider necessary.

The code for that project suggestion would probably entail a combination of changes needed to the existing repository at:

https://phabricator.wikimedia.org/diffusion/OSWB/

And a new repository for the extra features of the frontend/web, both only communicating through a shared database. This is only an initial idea, and you have lots of room to suggest a different way, if well justified.

@jcrespo
for now , it is clear. i will let you know if any further changes required. i will try to add new respsitory for frontend/web.

Hi, @jcrespo && @Aklapper

You should read old tool's code, tendril: https://phabricator.wikimedia.org/diffusion/OSTD/

I have set up the tendril tool and seen the code, there are certain questions I have.
The existing tool uses LAMP stack, need I propose my idea in same tech stack?

You should find a small task to contribute to

Should I make contributions to improve tendril tool?
Like write a detailed README.md on how to run the project locally.
Or send a patch on UI improvements, if that is needed
I saw that in some database tables, there are attributes whose datatype can be BOOL but is VARCHAR. Should I work on it?

If there is some other task I must contribute to, please tell.

@jcrespo
i have added new repository at https://phabricator.wikimedia.org/diffusion/OSWB/ named " web" for frontend/web components. i have just added basic flask webapp which can be further modified as we add frontend/dashboard component. if any further changes required let me know and what should be the next step ? should i start integrating .sql with maria DB and put some basic components ?
for review..
https://gerrit.wikimedia.org/r/c/operations/software/wmfbackups/+/673693

@DharmrajRathod98 I hope this helps: https://gerrit.wikimedia.org/r/c/operations/software/wmfbackups/+/673292 If it is clear, I will merge it.

However, the current database is very simple- you should start with the existing db, but will be able to add/change anything you would consider necessary.

The code for that project suggestion would probably entail a combination of changes needed to the existing repository at:

https://phabricator.wikimedia.org/diffusion/OSWB/

And a new repository for the extra features of the frontend/web, both only communicating through a shared database. This is only an initial idea, and you have lots of room to suggest a different way, if well justified.

The existing tool uses LAMP stack, need I propose my idea in same tech stack?

Our preference is a Python-based stack (we suggest with frameworks such as Django or Flask if developing a website, but that is up to the student). Other languages are possible, including PHP, which we at Wikimedia are very familiar. The reason why suggest Python is that it could integrate well, if necessary, with other existing tools we use as DBAs/backup owners. It is also the language of preferences by Wikimedia Site Reliability Engineers so will likely get more help with it than with others. PHP would be a second option- probably ok for a website, but less justified at first unless it was intended to integrate with Mediawiki software.

In the end, the student can propose anything, but the mentors will value the reasons/future maintenance for a particular option.

Like write a detailed README.md on how to run the project locally.

That would be a perfectly adequate task for a first bug, but please note that tendril is at its end of life- so your patch may end up unused. I would encourage you to search an active project such as the ones proposed in this ticket (and just consider tendril as "read only"):

I will try to add a few more simple tasks on this task today. You can also propose some, but please mention it here and we will create a task to track them properly. Thank you.

@DharmrajRathod98 Please note that work on the actual GSoC should not start before one is selected. The patch you sent is good as a proof that you know how to contribute, but it cannot be merged/reviewed, as it is in the scope of the project itself.

If you want to see a patch of yours reviewed and merged, check one of the proposed patches suggested as "first good patch" above. I will add more suggestions later in the day.

So as a summary of things that everybody should be doing now:

  • Read carefully the stubs of projects above
  • Understanding existing code/functionality/design that related to their project- by downloading it, trying to run it, fully understanding what it does, and reading related documentation. E.g. if a database related work: How are databases managed at the WMF? if a backup related work: How backups are managed at the WMF?
  • Create a developer's account and a phabricator account
  • Preparing tools/environment necessary to contribute: Do I need to install a compiler? A virtual machine? A web server? A database?
  • Asking specific technical questions here of things they don't understand e.g. ("I tried running tendril on PHP 7.2 and it doesn't work, why?)"
  • Proposing a patch for an improvement/bug fix on the list of good first patches, or suggesting here something new that we didn't propose so a task can be created about it (please do not work on anything before asking here and a task has been created first). For example, so far we got no questions/patches for T268258
  • Remember that, for now, your final goal is to write a GSoC written proposal that demonstrates that you will be able to develop and finish in the time allocated for it- writing 1 good patch and communicating clearly are good ways to do so

Apologies if we take some time to answer, it is difficult to answer the many questions we are getting at the same time.

I've added an extra good first task- this shouldn't need more than a few line of changes, buy may require more setup and research:
T253959: Check we are preparing (xtrabackup --prepare) with the same package version as the server version of which the backup was taken

A heads up that the schedule for proposals will open soon (there is still plenty of time) and candidates may want to start drafting their proposal for Wikimedia projects as a subticket of this- This is just a reminder to read https://www.mediawiki.org/wiki/Google_Summer_of_Code/Participants#Application_process_steps (step 10).

Please also do not add any personal information on Phabricator, as it will be public for everyone to see, you will add the necessary data to Google privately. We only need the technical details on phabricator for discussion/help/advice.

Hi @jcrespo, I was drafting my proposal. I was wondering if you could please elaborate more on the features required for the web dashboard. For eg. Whether the role-permissions module is needed?

@Palak199 Your proposal is very personal and we value creativity, so I wouldn't like to tell you, or anyone what you should do. It is ok to start a draft on phabricator and we can comment further.

What I can say is a few tips: keep it as small as possible- overly ambitious projects get turned down as unrealistic. New unknown problems will arise. Please don't think that a large proposal will be considered higher than a humble one- sometimes is the opposite. Plus there is always time for stretch goals later.

Another thing I can tell you is what we DON'T want, specifically for the project you mention:

  • We don't want permission/grant handling/authentication/autorization, that would be handled at infrastructure side
  • We don't want tree-like structure for databases, we already have that on orchestrator
  • We don't want slow query reporting retrieval- we have that already on performance_schema
  • We don't want graphs/metrics monitoring- we already have that at Graphana + prometheus
  • Something flexible and extensible is better than something that cannot be built upon

The simplest application you can think covering the given proposal is likely to be chosen, as quality will be valued over number of lines of code. 180h of work is not a lot to ship a complex piece of software, given analysis, design and testing will have to be included there.

A reminder that was just posted to zulip:

Just to put it to your attention. The application period has officially begun. You'd have to split your time between coming up with a proposal and working on the microtasks at the same time. Make sure you're subscribed to the tasks for the projects you're interested in on Phabricator. We will be updating those shortly on how to submit your microtasks for evaluation.

As for the proposal, the best way to approach it is to look at the past proposal

@jcrespo

The simplest application you can think covering the given proposal is likely to be chosen, as quality will be valued over number of lines of code. 180h of work is not a lot to ship a complex piece of software, given analysis, design and testing will have to be included there.

Thankyou so much on explaining it so well. It cleared me a lot on do's and don'ts of writing proposal. I am supposed to fill this form
https://phabricator.wikimedia.org/maniphest/task/edit/form/1/
right?

We will be updating those shortly on how to submit your microtasks for evaluation.

What if the microtask I am working on, is still open? Can it be still counted?

@jcrespo

The simplest application you can think covering the given proposal is likely to be chosen, as quality will be valued over number of lines of code. 180h of work is not a lot to ship a complex piece of software, given analysis, design and testing will have to be included there.

Thankyou so much on explaining it so well. It cleared me a lot on do's and don'ts of writing proposal. I am supposed to fill this form
https://phabricator.wikimedia.org/maniphest/task/edit/form/1/
right?

yes i think we have to create a task for proposal as shown in previous examples.

Thankyou so much on explaining it so well. It cleared me a lot on do's and don'ts of writing proposal. I am supposed to fill this form
https://phabricator.wikimedia.org/maniphest/task/edit/form/1/
right?

Check the exact link on the wiki, it has a template pre-applied. Make sure it is tagged with the right gsoc tag and you can edit it and make it a "child" of this very task- or paste it here and I can do that. This go for @DharmrajRathod98 too.

What if the microtask I am working on, is still open? Can it be still counted?

Yes, don't worry- we should try to close them by the end of the application process, but every contribution and communication will be taken into account positively (the microtasks are an exercise, not an end goal). Reviews for database and backup stuff normally take a lot, specially for new developers.

Hi @jcrespo.
For the 'MYSQL account metadata inventory', I have a couple of questions regarding it's relevant script.

1)Is the database host("db1115.eqiad.wmnet") mentioned in script directly accessible from the public network or should I be using some VPN to connect to it?
2)How are entries inserted into instances table?
3)How are new users created i.e how are rows inserted into mysql.users?
4)Would I get access to the db instances so that I can have a look at the other columns?
5)How do I test things I've coded. Would there be a staging setup or should I test things on my local machine?

1)Is the database host("db1115.eqiad.wmnet") mentioned in script directly accessible from the public network or should I be using some VPN to connect to it?

The host is not publicly accessible, nor it will be made accessible for the student working on this project. This is a programming project, so no special access will be provided to people working on it. This is not something special about GSoC, not even most of employees and volunteers have access to the underlying infrastructure.

You can assume that the application will have direct access, and you should be able to setup your own database for development. Export of the several database is made available. Although we might be able to provide testing virtual machines (we did it last year) that may serve as "staging" for the project, to prove the project works on an identical environment to production.

2)How are entries inserted into instances table?

Right now, manually, using SQL. The whole point of the project is to create an interface to make that easier- both for manual operation and potentially, automatic too (e.g. for example, with a REST api). What writes to it is not in scope of the project, at least not at this stage.

How are new users created i.e how are rows inserted into mysql.users?

You won't have to handle that, that is infrastructure, so out of the scope of the project. MySQL and server setup is already "solved" through our configuration management. Authentication, configuration, server management, security are not part of the scope to solve. However, one is expected to use security best practices (eg. avoiding SQL injection, XSS, etc.).

4)Would I get access to the db instances so that I can have a look at the other columns?

Not sure what you mean with "other columns". A structure export has been provided already. I can provide example content if useful, but for a proposal the mockup should be enough to understand the kind of data that could be stored there. More details will be provided during the analysis/development phase. Again, no access will be provided to the student to any production resources.

5)How do I test things I've coded. Would there be a staging setup or should I test things on my local machine?

Yes, if possible you will have a development environment on your machine. But we may be able to provide a staging environment on WMF Cloud (we did it last year by asking for a couple of VPSs), including a very similar configuration management setup.

Hi @jcrespo,
As the weekend is coming up and I don't have any other tasks to do and my draft for proposal is also done, how about continuing with the second microtask or do you think there's anything else that I can contribute to?

1)Is the database host("db1115.eqiad.wmnet") mentioned in script directly accessible from the public network or should I be using some VPN to connect to it?

The host is not publicly accessible, nor it will be made accessible for the student working on this project. This is a programming project, so no special access will be provided to people working on it. This is not something special about GSoC, not even most of employees and volunteers have access to the underlying infrastructure.

You can assume that the application will have direct access, and you should be able to setup your own database for development. Export of the several database is made available. Although we might be able to provide testing virtual machines (we did it last year) that may serve as "staging" for the project, to prove the project works on an identical environment to production.

2)How are entries inserted into instances table?

Right now, manually, using SQL. The whole point of the project is to create an interface to make that easier- both for manual operation and potentially, automatic too (e.g. for example, with a REST api). What writes to it is not in scope of the project, at least not at this stage.

How are new users created i.e how are rows inserted into mysql.users?

You won't have to handle that, that is infrastructure, so out of the scope of the project. MySQL and server setup is already "solved" through our configuration management. Authentication, configuration, server management, security are not part of the scope to solve. However, one is expected to use security best practices (eg. avoiding SQL injection, XSS, etc.).

4)Would I get access to the db instances so that I can have a look at the other columns?

Not sure what you mean with "other columns". A structure export has been provided already. I can provide example content if useful, but for a proposal the mockup should be enough to understand the kind of data that could be stored there. More details will be provided during the analysis/development phase. Again, no access will be provided to the student to any production resources.

5)How do I test things I've coded. Would there be a staging setup or should I test things on my local machine?

Yes, if possible you will have a development environment on your machine. But we may be able to provide a staging environment on WMF Cloud (we did it last year by asking for a couple of VPSs), including a very similar configuration management setup.

Thanks for the clarifications @jcrespo .

A web dashboard that tracks the status of ongoing backups (next scheduled run, running, postprocessing, finishing ETA, etc.)

@jcrespo
I evaluted the database but i could not able to find connection between 'backups(DB_ table) > status(table_field_value = ongoing) ' with (next scheduled run, running, postprocessing, finishing ETA, etc.) can you please explain how to link them in backups table ?

can you please explain how to link them in backups table

Indeed, that is a really good question.

The backup table status has 3 possible explicit states:

  • ongoing: The backup has started, but it has not finished yet
  • failed: The backup started and finished with an error
  • finished: The backup finished with no errors (including tests of completeness)

There are, however, other states that could be inferred from other fields:

  • Backups are scheduled at a point in time, if there are no ongoing entries that have been scheduled, it means it is queued to be executed but not started yet
  • Metadata is started to be gathered before postprocessing- if there is any metadata, like the total size, you can know it is being postprocessed/pending to be compressed, etc.

The details are not that important for the proposal, as they would be for an initial analysis after the fact- e.g. if we consider we need more explicit states, we can add those to the original tool.

I hope that is clear.

Both mentors took most of last week away from the computer.

I went over the list of questions and tried to answer every mention, but if we missed something, feel free to ping me again.

Please as a reminder, if you have interest on proposing a project related to this option for GSOC, you should be already be creating a draft proposal in Phabricator following the guide/template at https://www.mediawiki.org/wiki/Google_Summer_of_Code/Participants#Application_process_steps (points 10 and 11).

That way we will be able to point obvious mistakes in scope before it is too late!

You have only until April 13 as a deadline for the Google application! Before that date, you will copy that draft (plus your personal information) into the final application onto Google Summer of Code website.

Hi @jcrespo . Are there any microtasks I can contribute to, in order to get a better understanding of the project - 'MYSQL account metadata inventory'?

@Ashitabattu There are 4 microtask suggestions at the end of the body text of this ticket. None have been closed yet, so you can check if you can send a patch to any of them- I will try to review all, although be aware some have been started being worked by other students.

For everybody- on the draft proposals here on Phabricator, and on the final proposal you should write for the Google Summer of Code website, please remember to highlight any past contribution to open source projects, be it by working on the suggested microtasks, helping others, or e.g. sending pull requests to other projects. While we will be evaluating your entire proposal, demonstrating initiative and/or past experience (even if very small) communicating with an open source project is the best way you can make sure we know you will be able to complete the proposed work. Deep technical knowledge is something that is not that important for GSoC.

I will make sure to review all proposals submitted in time- and you will be able to change those on phabricator with no issue-, but as a reminder, the submission deadline on Google (the one that will be considered as the "definitive one") is April 13 18:00 UTC. Check the time so you don't miss any changes/late submission. You probably should have something almost 100% ready by Monday to prevent last minute internet issues. I will try to send another reminder near the deadline.

I am assigning this task to @Marostegui because we cannot assign it to multiple people at the same time (both mentors) :-(, but that way it won't be confused with a student's proposal.

I think I have given at least some initial comments to every proposal so far. Thanks for taking the time, and of course, all are Work in Progress until the deadline, so you will have time to improve them if you consider it necessary (it was the whole point of having initial drafts).

The general theme is that people are usually very ambitious and very prone to underestimate the time needed to complete a project- as software developers you must be very conservative about your estimations- it is all about quality, not quantity! Specially this year, that project's time has been halved.

can you please explain how to link them in backups table

Indeed, that is a really good question.

The backup table status has 3 possible explicit states:

  • ongoing: The backup has started, but it has not finished yet
  • failed: The backup started and finished with an error
  • finished: The backup finished with no errors (including tests of completeness)

There are, however, other states that could be inferred from other fields:

  • Backups are scheduled at a point in time, if there are no ongoing entries that have been scheduled, it means it is queued to be executed but not started yet
  • Metadata is started to be gathered before postprocessing- if there is any metadata, like the total size, you can know it is being postprocessed/pending to be compressed, etc.

The details are not that important for the proposal, as they would be for an initial analysis after the fact- e.g. if we consider we need more explicit states, we can add those to the original tool.

I hope that is clear.

Thanks a lot

A reminder that there are just a few hours left for the official and final submissions of GSOC proposals (less than 10 hours from the moment of this comment). So make sure your proposals have been submitted to the GSOC website https://summerofcode.withgoogle.com/ and are set to final before deadline. It is ok to do some final adjustments if necessary, but they have to be submitted to the Google's website to be elegible.

hi @jcrespo and @Marostegui
Thankyou for the guidance you both gave for the proposal and contributions :)
Please guide me on how should I be contributing during this phase like through any microtasks related to this project or anything else?

Thanks to everybody that applied. I counted 10 successful final applications that I think were all related to this idea (or at least contained "databases" or "backups"), including everybody that had expressed interest here! Thank you for reaching to the end of the application process. I think this was a very successful idea thanks to you.

Now it is time for you to relax. The Wikimedia GSoC organization will evaluate the participants and I think the results will be communicated on May 17, 2021. If you want to continue working on WIP microtasks, you are free to do so (like the many volunteers that send us patches), but that will not be taken into account in relation to Google Summer of Code selection, so if I were you, I would take this time to relax/focus on other things for now. :-)

Good luck to everybody!

hi @jcrespo and @Marostegui
I saw the zarcillo database and think that there can be certain minute improvements here and there. Where shall I raise an issue and discuss about it?

jcrespo claimed this task.
jcrespo removed a project: dbbackups-dashboard.

Thanks to everyone that participated! We had very good submissions this year. Hopefully we have you next year too!