Page MenuHomePhabricator

DBA review of UrlShortener
Closed, ResolvedPublic

Description

This extension is already deployed but disabled. It has a table that is urlshortcodes in wikishared database on x1. Adding new short urls is throttled so the excepted growth rate is not super high.

Event Timeline

Do you know if the table is public (could be 100% published as is with all its columns)?

Do you know if the table is public (could be 100% published as is with all its columns)?

The content should be public unless usc_deleted is not zero (Already mentioned in T218397)

I will talk to @Marostegui to see what is the best way to proceed. I will meanwhile review the queries and structure. If you have more info on the predicted usage of the table (it is ok to be wrong, just a rough prediction is ok): Number of new rows/s, number of reads/s, predicted size in X amount of time, percentage of "deletions" that would be great!

First, sorry for not informing you directly, the table is already in production for years now. I though it went through DBA review.

Number of new rows/s

  • It's ratelimited with 10 new rows per two minutes for IPs and newbies and 50 per two minutes for other users. It can be ratelimited harder if needed
  • It's very unpredictable how fast it grows, it depends on if it get adapted by others. Like apps team is very interested in this, WDQS also want to use it.
  • This extension used to have an option to add a link in sidebar to get short url link of articles. I disabled this to reduce the writes.
  • Imagining we have 1M new rows per month. It'll be 0.4 new rows /s. It'll decrease because lots of already will be there in tables, specially the most used ones.

number of reads/s

  • The reads are not much because it's cached both in varnish and client.
  • The only cases are A) the first time someone makes the short url for that given url B) when someone tries to create a short url for something that already exist, we just return the old value.
  • Sticking to the 1M new rows per month, it'll be around 2 rows/sec.

predicted size in X amount of time

  • The table is already on compressed mode
  • not compressed on my localhost AVG_ROW is 819B. Let's imagine we have 1M rows per month, I don't think we'll have that much, it'll be 781MB per month but only on one table.

percentage of "deletions" that would be great!

<0.1%, only happens in severe cases of information breaches hidden URLs

Would making the table made fully private ok for now? We can review later and change it, but it would speed the process for now (filtering per rows may involve creating custom views, etc.).

apps team is very interested in this

I knew about wdqs, do you have more information about that one? My fear is this may become a very popular api and won't "fit" into x1 being a single table.

Would making the table made fully private ok for now? We can review later and change it, but it would speed the process for now (filtering per rows may involve creating custom views, etc.).

Totally, Direct using of that table for cloud VPS tools doesn't make much sense TBH. I can't think of any usecase right now.

apps team is very interested in this

I knew about wdqs, do you have more information about that one? My fear is this may become a very popular api and won't "fit" into x1 being a single table.

It happened in Wikimedia Summit one wek ago. There's no plan to implement it in near future but you can ask Josh about it, I can't find his phabricator handle :/ If the tables grows too fast, we can always decrease the ratelimit. And keep in mind, any new API using it will be rolled out gradually.

Just to be clear, aside from a sanity review, there are 3 things we make sure of:

  • wikireplicas filtering is in place
  • backups are produced so we can recover the data in case they get lost for some reason
  • there are enough resources to serve the new functionality (hw like disk space, cpu, etc.)

That is why we want to be part of the "review" process, if that makes sense, not to be a burden

The table includes a blob. While I understand why, (wikidata query urls can be quite large), I wonder if there should be a limit on code to prevent sending 4GB to the api- do you know if there is any restriction in place for that on the application?

Unrelated, but FYI:

Screenshot_20190401_150927.png (856×2 px, 285 KB)
(missing schema change on beta)

That is why we want to be part of the "review" process, if that makes sense, not to be a burden

I totally understand it, I thought it went through it and I'm sorry for that.

The table includes a blob. While I understand why, (wikidata query urls can be quite large), I wonder if there should be a limit on code to prevent sending 4GB to the api- do you know if there is any restriction in place for that on the application?

Browsers won't accept any URLs longer than 2000 characters, I don't know if this includes section or not. will check. but having a limit is a good idea regardless. I will add it.

Unrelated, but FYI:

Screenshot_20190401_150927.png (856×2 px, 285 KB)
(missing schema change on beta)

Yeah, I'm on it.

+1 to filter the table for now so we don't have to worry about any possible issues.
We don't replicate x1 to labs for now anyways, so maybe putting a note to review this table linking to this task on the puppet code will be also good so we can specifically review it if we decide to replicate x1.
Probably also worth noting it on {T103011}

Change 500470 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] realm.pp: Add urlshortcodes to private table

https://gerrit.wikimedia.org/r/500470

@jcrespo @Marostegui: What is the limit of character for the blob you find acceptable?

Change 500480 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/UrlShortener@master] Add size limit to urls

https://gerrit.wikimedia.org/r/500480

So the issue with the length is for ~4GB, reading https://stackoverflow.com/questions/417142/what-is-the-maximum-length-of-a-url-in-different-browsers 8000 bytes would be much less than 4 GB. Any other value within that order of magnitude would be ok.

I guess x1 will have to do for now, but let's reevaluate if this ends up being a huge bottleneck in terms of reads and writes. Other than that I do not have further concerns (backups will be created as part of x1 backups of wikishared) except suggesting we apply preemptively InnoDB compression to the urlshortcodes table. Doing it now, while the table is empty, would be just a single command, while later will be much more invoved.

Is innodb compression something that you would be ok, @Ladsgroup ? If you don't know, the answer would be yes, as it will not affect MySQL api at all, but it would reduce its future footprint.

+1 for Innodb compression! Good idea!

I guess x1 will have to do for now, but let's reevaluate if this ends up being a huge bottleneck in terms of reads and writes. Other than that I do not have further concerns (backups will be created as part of x1 backups of wikishared) except suggesting we apply preemptively InnoDB compression to the urlshortcodes table. Doing it now, while the table is empty, would be just a single command, while later will be much more invoved.

Thanks. Does it mean this is good to go?

Is innodb compression something that you would be ok, @Ladsgroup ? If you don't know, the answer would be yes, as it will not affect MySQL api at all, but it would reduce its future footprint.

When I looked at the database information, it looked it's already compressed but I'm definitely in favor of it.

It is not compressed (I only checked the master):

root@db1069:~# mysql wikishared -e "show create table urlshortcodes\G"
*************************** 1. row ***************************
       Table: urlshortcodes
Create Table: CREATE TABLE `urlshortcodes` (
  `usc_id` int(11) NOT NULL AUTO_INCREMENT,
  `usc_url_hash` binary(32) NOT NULL,
  `usc_url` blob NOT NULL,
  `usc_deleted` tinyint(1) NOT NULL DEFAULT '0',
  PRIMARY KEY (`usc_id`),
  UNIQUE KEY `urlshortcodes_url_hash` (`usc_url_hash`)
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=binary

I'm okay with its being compressed.

Change 500470 merged by Marostegui:
[operations/puppet@production] realm.pp: Add urlshortcodes to private table

https://gerrit.wikimedia.org/r/500470

It was empty or almost empty (we should double check on all hosts), a single command on the master should be enough to compress, Marostegui you do it or I do it?

As per our last IRC chat, I will compress it :)

Mentioned in SAL (#wikimedia-operations) [2019-04-02T08:16:40Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool all slaves in x1 T219777 T143763 (duration: 00m 53s)

Mentioned in SAL (#wikimedia-operations) [2019-04-02T08:20:36Z] <marostegui> Compress wikishared.urlshortcodes table on x1, directly on the master with replication (table has 1 row) - T219777

This table has been compressed everywhere:

root@cumin1001:/home/marostegui# ./section x1 | while read host port; do echo $host:$port ; mysql.py -h $host:$port wikishared -e "show create table urlshortcodes\G" | grep FORMAT ;done
dbstore2002.codfw.wmnet:3320
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8
dbstore1005.eqiad.wmnet:3320
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8
dbstore1001.eqiad.wmnet:3320
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8
db2096.codfw.wmnet:3306
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8
db2069.codfw.wmnet:3306
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8
db2034.codfw.wmnet:3306
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8
db2033.codfw.wmnet:3306
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8
db1120.eqiad.wmnet:3306
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8
db1069.eqiad.wmnet:3306
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8
db1064.eqiad.wmnet:3306
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8

This is what was run on the master:

ALTER TABLE urlshortcodes ROW_FORMAT=Compressed KEY_BLOCK_SIZE=8;

Anything else left pending on this task or it can be closed after merging https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/UrlShortener/+/500480/ ?

Mentioned in SAL (#wikimedia-operations) [2019-04-02T08:58:56Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool all slaves in x1 T219777 T143763 (duration: 00m 53s)

Change 500480 merged by jenkins-bot:
[mediawiki/extensions/UrlShortener@master] Add size limit to urls

https://gerrit.wikimedia.org/r/500480

Mentioned in SAL (#wikimedia-operations) [2019-04-08T05:18:01Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool all slaves in x1 T219777 T143763 (duration: 01m 30s)

^ that was not meant to be for this table