Page MenuHomePhabricator

Improve filename regex in cli/recover-dump
Open, LowPublic

Description

Currently, recover-dump has a regex for validating the filename of the backup dump.

This is the code for the regex which is in the file (in the link above)

# FIXME: backups will stop working on Jan 1st 2100
DUMPNAME_REGEX = r'dump\.([a-z0-9\-]+)\.(20\d\d-[01]\d-[0123]\d\--\d\d-\d\d-\d\d)(\.tar\.gz)?'

It should accept a filename starting with dump, followed by a timestamp of the format YYYY-MM-DD--HH-MM-SS followed by the file format .tar.gz

See this example here to help analyse the regex above

In addition to the fixme there are other issues with this regex.

Currently as it stands, the regex accepts valid inputs below

dump.s3.2022-11-12--19-05-35.tar.gz
dump.s3.2021-03-19--12-05-32.tar.gz
dump.s3.2011-09-05--00-35-26.tar.gz
dump.s3.2021-03-19--12-05-32.tar.gz

It also accepts the following below as valid. All backups and recoveries can work with a directory or a tarball of such directory, based on the option set.

dump.s3.2022-11-12--19-05-35
dump.s3.2021-03-19--12-05-32

Fixing the FIXME would need the regex to accept the below (currently rejects it)

dump.s3.2105-03-19--12-05-32.tar.gz

The following below is definitely supposed to be invalid (invalid timestamp), however the regex accepts them.

dump.s3.2021-19-39--99-05-32.tar.gz
dump.s3.2021-19-19--12-99-32.tar.gz
dump.s3.2021-19-19--12-99-99.tar.gz
dump.s3.2021-19-19--12-99-99

Tasks

  • Fix existing regex so that it can accept input with a valid timestamp and also reject input with invalid timestamp. (in addition to the fixme for Jan 1st 2100 issue)
  • If regex check fails, consider adding a error message to print something like "Invalid input - regex check failed"

Tests

  • The regex is not in a function, that might make it difficult for unit testing. Figure out a way to make it unit-testable. (without breaking existing functionality - check usages of the DUMPNAME_REGEX global variable in code)
  • Ideal to write unit tests to check that the regex works correctly with varying test inputs.
NOTE: Please remember to keep the scope small- backups can get quite complicated :-)

Event Timeline

h.krishna renamed this task from Improve/add tests for filename regex in cli/recover-dump to Improve filename regex in cli/recover-dump .Mar 18 2021, 2:32 PM

It also accepts the following below as valid. This could be to compensate for backup files with missing tar.gz extension (needs clarification - is it a bug or a feature?)

dump.s3.2022-11-12--19-05-35
dump.s3.2021-03-19--12-05-32

You can amend that on the description- all backups and recoveries can work with a directory or a tarball of such directory, based on the option (compress: True/False).

As a practical decision, for WMF backups, we don't compress the logical dumps (as they use per-table compression, and would be a waste of cpu cycles), but we pre-compress the snapshots (as it saves network bandwidth when recovered). See below:

dbprov2001# grep '\(compress\|type\)' /etc/wmfbackups/backups.cnf 
type: 'dump'
compress: False

cumin1001# grep '\(compress\|type\)' /etc/wmfbackups/remote_backups.cnf 
type: 'snapshot'
compress: True

dbprov2001# ls /srv/backups/snapshots/latest/
snapshot.m5.2019-05-07--20-00-02.tar.gz  snapshot.s6.2021-03-18--04-30-42.tar.gz
snapshot.s4.2021-03-17--22-12-54.tar.gz  snapshot.s8.2021-03-17--19-00-00.tar.gz
snapshot.s5.2021-03-18--02-38-19.tar.gz

dbprov2001# ls /srv/backups/dumps/latest/
dump.s4.2021-03-16--00-00-02  dump.s6.2021-03-16--03-35-09
dump.s5.2021-03-18--09-22-30  dump.s8.2021-03-16--00-00-02

Please remember to keep the scope small- backups can get quite complicated :-)

Could someone please add a code base project tag so this task could be found? Is this maybe Data-Persistence-Backup? Or Data-Services?

h.krishna updated the task description. (Show Details)

@jcrespo I've updated the task description and have also added your advice to keep the scope small. I'm not sure about the code base project tag, so can you add that as requested above by Aklapper?

@Aklapper I've edited all tasks I am aware of with the same parent so we can make sure we can clean up/retag later, if necessary.

@Sahilgrewalhere i am working on this so, please work on some different task.

Change 675010 had a related patch set uploaded (by Sahilgrewalhere; author: Sahilgrewalhere):
[operations/software/wmfbackups@master] Improved filename regex in cli/recover-dump

https://gerrit.wikimedia.org/r/675010

Change 673693 had a related patch set uploaded (by DharmrajRathod98; author: DharmrajRathod98):
[operations/software/wmfbackups@master] regular expression fixed with only validtime stamp and ending with required extensions(.tar.gz) and give message accordingly if doesn't match.

https://gerrit.wikimedia.org/r/673693

@Aklapper As per Data persistence manager and Site Reliability Director direct, strict instructions, this is not something the Data Persistence team is working on, sorry. Please ask any doubt to either of them.

@jcrespo: Can we please have a codebase project tag then? Tasks should be findable. Thanks.

@Aklapper Hit me on IRC, it is difficult to explain.

Change 675010 had a related patch set uploaded (by Sahilgrewalhere; author: Sahilgrewalhere):
[operations/software/wmfbackups@master] Improved filename regex in cli/recover-dump

https://gerrit.wikimedia.org/r/675010

@jcrespo
dump.s3.2022-11-12--19-05-35
dump.s3.2021-03-19--12-05-32

should not accept so i have fixed.

@DharmrajRathod98 @Sahilgrewalhere as a reminder, more important than quality patches are the communication skills, which will also be evaluated. "Taking over" each others ownership of a ticket without communicating with the ticket owner or mentor will not reflect well. I have reviewed both tickets and both have lots of room for improvement. Collaboration, not competition. :-)

Feel free to do so I'd say; no announcements needed. :)

While that is the usual policy, I asked on the parent ticket to communicate the willingness to work on a ticket and ask questions first as there seems to be quite a lot of people interested and avoid coordination issues- for this ticket only.

@jcrespo i am sorry about that i will be more carefull from the next time. so should i work on this now or hand over to @Sahilgrewalhere ?

Hi everyone. Based on review comments left by jcrespo in both of your patches - I believe improving the regex is a short-term solution and I believe regex may not necessarily the best way to solve this issue, as there could be edge cases like leap year. This is something I hadn't considered whilst I wrote the ticket, so thanks for pointing this out @jcrespo

There might be other ways that are better. While I would be inclined to make a regex that would also reject invalid leap years - a better method would to use a pythonic method to validate time, using the datetime module.
I would break it down into a function that does these tasks

  1. Grab the filename using the regex above (as mentioned in ticket)
  2. Grab the time from the filename, parse it properly into the datetime module and validate it.
  3. If time/date isn't valid, you could print a message saying [FILENAME] does not pass regex, but this is up to you

If you can do it in a better way than my suggestion above then please feel free to do so :)
I would also recommend writing unit tests to test/prove the function/regex you have written works well with edge cases (such as leap year dates likes 30-02-2021 and 31-02-2021).
I am not sure about whether python honors/does leap seconds, as leap seconds are manually added by a group at random (International Earth Rotation and Reference Systems Service (IERS)) -- I'm not sure if it's possible to validate a leap second - this might be too much effort and/or impossible
If so, a regex/function that would support valid leap years should also work.
But I think datetime would be nicer and should work for invalid leap year (i-e. 30-02-2022) which the regex fails to catch

In both the patchsets (673693 Patchset v4 by @DharmrajRathod98 and 675014 Patchset v1 by @Sahilgrewalhere ), I noticed the continue statement is being replaced with a print message - not sure if this is a good idea as this would definitely break existing functionality. (the purpose of continue being to skip over any other files in the same directory I think).
It looks like there isn't a need for a message to be printed as I had originally thought, but this is up to you.

What are your thoughts on the above? The way I see it, there's two options

  1. Make the regex work with leap year (may be complicated?)
  2. Use datetime to validate time instead
  3. (Suggest a different way)

@jcrespo @h.krishna
i would suggest that to make the regex for leap year is more complicated so as you said we should use datetime module to validate. So function that validate datetime is better idea.
or another way is only validate leap year using code but in this case everytime we have to go through regex and code both. So i think better idea is to make function.

About the leap second, after validate the timestamp using datetime module we may think about more robust way to validate it. if it is necessary because it is added randomly we may think to add manually.

Let me know if the function is the better idea then i will start implementing it but before that i should have clear idea what should the function contain.

@h.krishna very good comment, that is the kind of comment why I asked you to create a ticket.

Having a bare minumum regex (e.g. ...\d\d\d\d\-\d\d\-\d\d) + a call to validate the date would be a good option.

message to be printed as I had originally thought

It would be ok to print a DEBUG or at most an INFO one, but not anything above that (the print was too worrying). There is an expectation of having other section files on the same dir.

@h.krishna very good comment, that is the kind of comment why I asked you to create a ticket.

Having a bare minumum regex (e.g. ...\d\d\d\d\-\d\d\-\d\d) + a call to validate the date would be a good option.

message to be printed as I had originally thought

It would be ok to print a DEBUG or at most an INFO one, but not anything above that (the print was too worrying). There is an expectation of having other section files on the same dir.

@jcrespo i think it is good idea. i will stick to this idea.

@jcrespo @h.krishna
for validating time we can use regex and for validating date we can call a function is this correct ? that will be very easy to understand.

your leetcode test cases seem to fail for the following inputs. I don't think it should, these inputs look fine to me
regex check failed for : dump.s3.2022-06-01--21-06-56.gz
regex check failed for : dump.s3.2021-03-19--12-05-32
You'll probably need to recheck your method a bit.

@jcrespo @h.krishna
dump.s3.2022-06-01--21-06-56.gz and dump.s3.2021-03-19--12-05-32 are failed because it is not ending with .tar.gz however, datetime is correct so, my function will return true for datetime validation but regex check failed for not ending with .tar.gz

@DharmrajRathod98 see https://phabricator.wikimedia.org/T277754#6925247
I've added a comment on your patch as well
edit: Apologies for confusion -- my bad. The .gz shouldn't be valid I think (based on old regex) -- it's just the one with the filename without a fileformat that should be also valid (as it is a tarball of a directory)

I will be checking this contribution soon as part of my job maintaining backups, and hopefully, getting it merged.