Page MenuHomePhabricator

Draft a proposal for granular blocks table schema(s), submit for DBA review
Closed, ResolvedPublic5 Story Points

Description

Data types here are illustrative, and open to change.

ipblocks_restrictions

FieldType
ir_ipb_idint
ir_typeint
ir_valueint

ir_type will be an int representing one of two values (page, namespace) and it will unlikely support anything else in a foreseeable future

ipblocks

FieldType
ipb_sitewidetinyint

A bool named sitewide will be added to ipblocks to indicate if the block is sitewide or partial.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

There are some product requirements we need to iron out first, namely do we want to support multi-blocks. I'll post more on T2674

TBolliger changed the task status from Open to Stalled.May 4 2018, 10:57 PM
TBolliger changed the task status from Stalled to Open.May 9 2018, 7:01 PM

We discussed, not blocked by multi-blocks decision.

TBolliger set the point value for this task to 5.May 9 2018, 7:06 PM
dmaza claimed this task.May 15 2018, 12:32 AM
dmaza moved this task from Ready to In progress on the Anti-Harassment (AHT Sprint 21/22) board.
dmaza added subscribers: dbarratt, Niharika, kaldari, MaxSem.EditedMay 22 2018, 11:40 PM

Before I go ahead and create the SQL queries for the tables I wanted to pass this by you'all. @dbarratt, @MaxSem , @kaldari, @Niharika

I apologize for my lack of diagrams skills, hopefully the idea gets across clearly. To be more specific, the bluish entities are new tables that we would have to create. My suggestion is to go with Option A. What do you think?


Option A

This is the simplest and would be flexible enough to accommodate any other restrictions we would like to add to a particular block. I'm not a fan of EAV but in this particular case it suits us well and it is not unheard of in our current database layout.
Data types here are illustrative, and open to change.


Option B

The advantage here is that we can make use of FK constraints and "possibly" some performance gains over EAV. The disadvantage is that for every new restriction we want to add, a new table has to exists. This is also not very flexible or useful for things that are not stored in the DB (like namespaces)

In A), what does io_value do?

In A), what does io_value do?

For what we have planned so far, io_value could be a category_id, page_id or namespace.

dmaza added a comment.EditedMay 24 2018, 5:12 PM

For what is worth we are also working on T100070: Allow CheckUsers to set User agent (UA)-based IP Blocks and part of it is T195508: Update Database Schema to support User Agent blocks. This could be another column on ipblocks or it could be part of ipblocks_options.

By making use of ipblocks_options as we do with user_properties it will give us the flexibility to expand on blocks without making any significant change to the database in any future improvements we decide to work on.

dbarratt updated the task description. (Show Details)May 28 2018, 6:38 PM
dbarratt added a comment.EditedMay 28 2018, 7:10 PM

I agree with Bill Karwin in SQL Antipatterns that Option A is an antipattern with the following disadvantages (pg. 65-66):

  • You can't make mandatory attributes (as well as unique attributes)
  • You can't use SQL data types
  • You can't enforce referential integrity
  • You can't make up attribute names (as in, you must ensure that everyone (core, as well as any extension) uses the same io_type)

He concludes (pg. 68-69) by writing:

It's hard to justify using the EAV antipattern in a relational database. You have to compromise too many features that are strengths of the relational database paradigm. But that doesn't address the legitimate need in some applications to support dynamic attributes.
Most applications that need schemaless data really need it for only a few tables or even just one table. The rest of your data requirements conform to standard table designs. If you account for the extra work and risk of EAV in your project plan, it may be the lesser evil to use it sparingly. But keep in mind that experienced database consultants report that systems using EAV become unwieldy within a year.
If you have nonrelational data management needs, the best answer is to use a nonrelational technology.

He goes on to list out nonrelational technologies like MongoDB.


Since I don't think we'll be adding a nonrelational technology anytime soon (although, as we've talked about, I think there's a legitimate argument to add one), I don't think it's a good idea to give up so many features of a relational database in order to have fewer database tables, and initially a simpler query that, in the future will be expanded into a more complicated dataset (as expressed in T193449#4229434) that will result in queries that will be more complicated than the fixed queries set by Option B. Therefor, I can only support Option B, since the costs of Option A outweigh the perceived benefits.

Also, Option B adheres to the rule of least power.

Reedy added a subscriber: Reedy.May 28 2018, 7:17 PM

You can't enforce referential integrity

FWIW, we don't do DB enforced FK on mysql/mariadb in MW

FWIW, we don't do DB enforced FK on mysql/mariadb in MW

Yeah I know. :( sad day.

FWIW, I prefer option A if it seems feasible from a performance POV. Tagged for DBA input.

Honestly, A feels more manageable. One question: do you intend to have more than one options row corresponding to a block row?

dmaza added a comment.EditedMay 30 2018, 8:54 PM

Honestly, A feels more manageable. One question: do you intend to have more than one options row corresponding to a block row?

Yes, a user can be blocked from more than one page/category/namespace.

Now that you said options. They are not really options, they are restrictions I'd say. So an alternative name to Option A table would be ipblocks_restrictions. Unless we are thinking about this table as a more "all purpose" solution in which case ipb_options is a better fit.

dbarratt added a comment.EditedMay 30 2018, 8:58 PM

Now that you said options. They are not really options, they are restrictions I'd say. So an alternative name to Option A table would be ipblocks_restrictions. Unless we are thinking about this table as a more "all purpose" solution in which case ipb_options is a better fit.

Don't forget that you proposed using the the ipb_options table for the UA string T193449#4229434 :)

EDIT: Doh! just read that "all purpose" part again. :)

Discussion here T104099 makes my argument stronger for Option A

Discussion here T104099 makes my argument stronger for Option A

Why?

jcrespo added a subscriber: jcrespo.Jun 1 2018, 1:32 PM

I that you are reading and quoting my former colleague Bill Karwin. He is normally right about all of this. However, the first rule in the query optimization club is that you optimize -and normally, optimized design normally means one that creates simple queries- for the most common operation you are going to do- normally, a single type of read queries (SELECTs). I don't see which query(ies) will be the most common, and without knowing that, I cannot provide any opinion on the topic.

I don't expect having prepared code in advance, but a plan of the read operations to be performed would be a top priority. E.g., "ON edit, the following will be checked" "on edit of a category, the following attributes will be needed for user X" "on every read, all attributes for users that start with the letter A will be queried" etc. While thinking about maintenance and extensibility for the future is a wanted feature, the priority is to analyze and optimize for the former. Design decisions will be decided from that, and not the other way around.

FWIW, we don't do DB enforced FK on mysql/mariadb in MW

It is not sad, it is a result of the combination of history (MySQL 15 years ago didn't support that well) and the fact that we need to server 20000-30000 queries per second per server, and it is cheaper (performance wise) to do that on application side (multiple-table checks, or in general, safer transactional operation can create high contention under the high load we have to support; it is a non-issue for 99% of the applications out there). A 1-second row lock, for example, can bring down a server.

dmaza added a comment.Jun 1 2018, 8:46 PM

I don't expect having prepared code in advance, but a plan of the read operations to be performed would be a top priority. E.g., "ON edit, the following will be checked" "on edit of a category, the following attributes will be needed for user X" "on every read, all attributes for users that start with the letter A will be queried" etc. While thinking about maintenance and extensibility for the future is a wanted feature, the priority is to analyze and optimize for the former. Design decisions will be decided from that, and not the other way around.

Like you said, it is hard to foresee in advance exactly what queries we'll need to change. From the top of my head, block checks will remain the same and only if a block is found then we'll need to go after the restrictions. So no big changes there. FWIW block checks are selects to ipblocks filtering by ipb_address or ipb_id.
The only instance where we would need to pull everything altogether that might have a performance impact will be the Block List.
I don't know if this is good enough or you need me to come up with more detailed queries.

I defer to others with a better understanding of mediawiki to correct me here.

Do you have any rough estimations of number of writes/reads?
What is the expected growth of this(these) table(s)?

I don't expect having prepared code in advance, but a plan of the read operations to be performed would be a top priority

AFAIK, what happens currently is that, if go to the edit page (or perhaps this happens on session load?), the blocks table is queried for any matching blocks (User or IP Address). If there is a matching block, then the code goes through User::isBlockedFrom(). We'll be modifying this to determine if a user is blocked sitewide (boolean we'll add on the ipblocks table) or if they the page they are editing is on the list of pages that is blocked, or a member of a blocked category, or a part of a blocked namespace.

So we could either load all of the "things" they are blocked from when the block is loaded, or we could do another query inside User::isBlockedFrom() and only query for what we're looking for (i.e. only query for the page name since we know the name of the page they are attempting to edit).

This additional querie(s) would only happen if they have a block that is not sitewide, there aren't that many blocks, but I'll let @dmaza get you numbers. :)

Also, as he mentioned, we might be querying it for the logs and blocklist, however, I think those pages would just tell you if it's a site-wide block or not.

mark added a subscriber: mark.Jun 4 2018, 4:25 PM
dmaza added a comment.Jun 4 2018, 5:01 PM

What is the expected growth of this(these) table(s)?
Do you have any rough estimations of number of writes/reads?

Currently we have an average of ~60 new blocks per hour on enwiki. We expect that number to increase about 50% with those being granular blocks (this is an aggressive estimate). For each granular block we estimate a total of 2-3 restrictions on average (Each restriction represents a row in one of the new tables) but we also expect some users to accrue lengthy lists of granular blocks over time, potentially around 30-40 for particularly mischievous users

To summarize:

  • ipblocks could increase by 50% in number of rows
  • The new table(s) in question will get about ~75 rows per hour (With option A being in one table or 75 rows spread out into 3 tables with option B)

It is worth mentioning that blocks get deleted when they reach their expiration date. In our findings roughly 20% of blocks are indefinite on enwiki.

jcrespo added a comment.EditedJun 4 2018, 5:24 PM

What is the rough timeline to production? Any hard deadlines you know as of now (if you have those)?

It is still unclear to me what kind of information you will be retrieving from those tables, given an a user id or user_text, which rough select queries are you planning to do on those tables? Again, I don't need a perfect SELECT, just an idea of the filter and filelds used for the retrieving results, and how it relates to the existing tables- the above graph, which just a bare diagram isn't 100% clear to me.

Also, are you aware of the actor tables reformatting, which will change how tables store references to users, and probably new deployments should start with that in mind?

dmaza added a comment.Jun 5 2018, 12:06 AM

What is the rough timeline to production? Any hard deadlines you know as of now (if you have those)?

This is one of our quarterly goals. We want to get on this ASAP and DB changes are the first step.

Also, are you aware of the actor tables reformatting, which will change how tables store references to users, and probably new deployments should start with that in mind?

Unless I'm missing something, those changes are irrelevant to this. We are attaching more data to the currently existing blocks (ipblocks table).

It is still unclear to me what kind of information you will be retrieving from those tables, given an a user id or user_text, which rough select queries are you planning to do on those tables? Again, I don't need a perfect SELECT, just an idea of the filter and filelds used for the retrieving results, and how it relates to the existing tables- the above graph, which just a bare diagram isn't 100% clear to me.

Fair enough, I'll try to be more specific. I'll use the queries from the logs in my dev environment.

The first query that will run are to check if there is a valid block that can be applied to a user. These queries are already in place and will probably remain exactly as they are. They are as follow:

-- Logged out users
SELECT  ipb_id,ipb_address,ipb_timestamp,... FROM `ipblocks` 
WHERE ipb_address IN ('10.0.2.2','10.0.2.2') OR ((ipb_range_start  LIKE '0A00%' ESCAPE '`' ) AND (ipb_range_start <= '0A000202') AND (ipb_range_end >= '0A000202'))
-- Logged in users
SELECT  ipb_id,ipb_address,ipb_timestamp,... FROM `ipblocks`
WHERE ipb_address = 'Admin'

There is a third query when we already have a block ID where we will run the select against ipb_id

SELECT  ipb_id,ipb_address,ipb_timestamp,... FROM `ipblocks`
WHERE ipb_id = 1234

So far, this is how blocks currently work against the db and it is unlikely to change.


What we will be adding

If any of the previous queries return a valid block for the current user (logged in or otherwise), we will want to check if the block is a full site block or if the user is only blocked from certain pages, namespaces or categories.

There are many ways of doing this, I haven't properly thought about it but here goes one.
Lets assume a user is blocked with blockID = 1234 and they are editing a page with ID 5678

-- Using Option A
select io_type, io_value from ipblocks_options where ipb_block_id = 1234
or
select count(*) from ipblocks_options where ipb_block_id = 1234 and io_type = 'page' and io_value = 5678
-- Using Option B
select count(*) from ipblocks_page where ip_block_id = 1234 and ip_page_id = 5678

There might be cases where we could benefit from loading ipblocks and the "new tables" in one single query. This would be less frequent and it would be a select to ipblocks filtered by ipb_id with the necessary joins depending on what option we choose (Option A or Option B)

The last bit that I can think of is when loading a list of blocks like https://en.wikipedia.org/wiki/Special:BlockList. In this case we need to know if there are any granular restrictions so we can say so in the list. We probably don't need details as to what page or namespace or whatever the block is linked to, we only need to know if there is a granular restriction or not.

Let me know if there is anything else I can answer for you.

This is one of our quarterly goals. We want to get on this ASAP and DB changes are the first step.

We are supposed to be consulted before work starts on a goal that depends on us. This is the first time we were pinged about this, and we are not even mentioned as dependencies https://www.mediawiki.org/wiki/Wikimedia_Audiences/2017-18_Q4_Goals Because we have lots of other dependencies and work planned 3 months in advance with other teams, we cannot guarantee a quick response. In particular, just the schema changes like the ones you are proposing will take more than a month to deploy to production, even if they where 100% clear and set on stone- having into account we were scheduled and requested other changes before. We also have department travel coming this month, so things are going to slow down in june. Aside from that, DB production changes are normally the last step, and we require code deployed into a release to apply production schema changes.

This doesn't mean we will not help you, I just want to be clear about our scheduling and short-term commitment to the task.

There are many ways of doing this

Both options seem functionally equivalent (from a performance point of view), so the decision should come from you in terms of extensibility and maintainability. Which other functionality could be requested in the future? Think if blocks have to be applied using ids or titles (e.g. will a user be blocked from editing page id 2345, which has the title 0, "Database_design", and can be renamed to 0, "Design_of_Databases" or on title 0, "Database_design", which even if it is moved and another page created on its place, is the conflictive one. This is pure development/design (so not our realm) but will inform what database changes are needed.

Those count(*) (or in general, selection of multiple rows) can be problematic if there is many blocks per user or many rules per block, a limit should be set, to avoid large range scans per user (e.g. in the order of single digit), on every attempt of edit (so edits are not slowed down). It is also highly important that all those (fast) operations happen by primary key selection. Other things like Special pages can be done with secondary indexes. Joins are not a problem, again if they are done by PK.

From a management perspective, while multiple tables are always preferred for things that are different, I am also concerned with an explosive growth, adding 3 tables at the beginning, but later one per type if there can be much more in the future.

jcrespo added a subscriber: Anomie.Jun 5 2018, 7:23 AM

Also, are you aware of the actor tables reformatting, which will change how tables store references to users, and probably new deployments should start with that in mind?

Unless I'm missing something, those changes are irrelevant to this. We are attaching more data to the currently existing blocks (ipblocks table).

I would definitely ask @Anomie and ask if there are plans to redo ipblocks using actors rather than ips+usernames/ids so you can coordinate on changes on that table.

If any of the previous queries return a valid block for the current user (logged in or otherwise), we will want to check if the block is a full site block or if the user is only blocked from certain pages, namespaces or categories.

Those count(*) (or in general, selection of multiple rows) can be problematic if there is many blocks per user or many rules per block, a limit should be set, to avoid large range scans per user (e.g. in the order of single digit), on every attempt of edit (so edits are not slowed down). It is also highly important that all those (fast) operations happen by primary key selection. Other things like Special pages can be done with secondary indexes. Joins are not a problem, again if they are done by PK.

There is a possibility that they do not have a sitewide block nor do they have any granular blocks (See T104099). For this reason, I would add a boolean (tinyint) flag to the ipblocks table. That way we can skip the count queries completely if it is a sitewide block. It also would allow the users to switch between site-wide and granular blocks without losing the pages/categories/namespaces specified. What do you think?

From a management perspective, while multiple tables are always preferred for things that are different, I am also concerned with an explosive growth, adding 3 tables at the beginning, but later one per type if there can be much more in the future.

I don't think there would be any more in the future (the only one I can think of would be attaching multiple actors to a single block, but that seems like a terrible idea and I don't know of any interest what-so-ever to do that). Just to clarify, your preference would be three tables over a single table (in general), correct?

This doesn't mean we will not help you, I just want to be clear about our scheduling and short-term commitment to the task.

I think we can always move forward and change it if we need to later (before deployment to production). It might incur more work on our team, but I think that's better than being blocked or forcing your team to prioritize the task.

I would definitely ask @Anomie and ask if there are plans to redo ipblocks using actors rather than ips+usernames/ids so you can coordinate on changes on that table.

I think this might be outside of scope for our project (since we would be attaching this data to a block, not an actor), but that's really good to know, I didn't know that would be changing in the future.

Anomie added a comment.Jun 5 2018, 3:01 PM

Also, are you aware of the actor tables reformatting, which will change how tables store references to users, and probably new deployments should start with that in mind?

Unless I'm missing something, those changes are irrelevant to this. We are attaching more data to the currently existing blocks (ipblocks table).

I would definitely ask @Anomie and ask if there are plans to redo ipblocks using actors rather than ips+usernames/ids so you can coordinate on changes on that table.

I don't see anything here that would be affected by the actor table change. They're not proposing to alter the ipblocks table at all, just to add one or more tables tying ipb_id to other things (page_id, cat_id, namespace number).

At the moment there are no plans to replace ipb_address/ipb_user with actor table references, since ipb_address can contain ranges which aren't "actors". Although in the future replacing ipb_user with ipb_actor and using it for single-IP blocks might be a possibility.

The schema change to add the ipb_by_actor field (identifying the blocker as an actor) is already merged and completed on the cluster.

Data types here are illustrative, and open to change.

Personally, I think it would be better to avoid using an ENUM in the database. Support isn't very consistent across the dbms's we support, ordering behavior can be unexpected, and adding new values to the enum requires a schema change.

Consider using NameTableStore (with a name table) instead.

With either option, I note there's no allowance for blocking someone from a page that doesn't already exist. I've heard a proposal for a "titles" table that would replace various xx_namespace+xx_title columns throughout the schema by giving an ID to any title that was ever referenced whether or not it ever existed, although if anyone wrote a Phab task for it yet I can't find it.

There are many ways of doing this, I haven't properly thought about it but here goes one.
Lets assume a user is blocked with blockID = 1234 and they are editing a page with ID 5678

-- Using Option A
select io_type, io_value from ipblocks_options where ipb_block_id = 1234
or
select count(*) from ipblocks_options where ipb_block_id = 1234 and io_type = 'page' and io_value = 5678
-- Using Option B
select count(*) from ipblocks_page where ip_block_id = 1234 and ip_page_id = 5678

Why count(*)? It would seem these would work just as well as existence checks, e.g.

select 1 from ipblocks_options where ipb_block_id = 1234 and io_type = 'page' and io_value = 5678 LIMIT 1;
select 1 from ipblocks_page where ip_block_id = 1234 and ip_page_id = 5678 LIMIT 1;

If you get back a row they're blocked, and if you don't get a row they're not.

The PKs I imagine would be on these tables would result in the count being 0 or 1 anyway.

In particular, just the schema changes like the ones you are proposing will take more than a month to deploy to production,

Maybe I missed it, but it looks like the main proposal here is only adding tables, no changes to existing tables?

There was mention of a possible alter in T193449#4229434, although I think the ipblocks_options alternative mentioned there may be better than adding a text field to ipblocks. And there was mention of a possible alter in T193449#4257526.

^Basically, +1 what anomie says

I've heard a proposal for a "titles" table

That could be me, but I will not push for it until MCR, revision, actor and other important changes are finished.

I already mentioned that potential need for ns,title references on a previous comment. Your comment (e.g. avoiding creating a new page with a specific title) is a more concrete reason to allow that.

Thanks for your comments.

With either option, I note there's no allowance for blocking someone from a page that doesn't already exist. I've heard a proposal for a "titles" table that would replace various xx_namespace+xx_title columns throughout the schema by giving an ID to any title that was ever referenced whether or not it ever existed, although if anyone wrote a Phab task for it yet I can't find it.

We discussed this when we were in Florida and we'd rather store the page id (and have the block persist even if the page is moved) than the page title (and block creating pages of that name). The user would need to block the namespace in order to block creating pages (of any title). If I'm understanding this correctly, the ID of a page does not change if the page is moved, but the title will obviously change, correct?

Anomie added a comment.Jun 5 2018, 5:09 PM

As long as it's planned for rather than something overlooked, I don't have an issue with it.

Moves preserve the page ID, yes. The redirect left behind, if any, will have a new page ID.

dbarratt added a comment.EditedJun 5 2018, 7:03 PM

Well, as I mentioned in T193449#4237241, I prefer Option B.

It seems like the greatest value in Option A is being able to make future "schema" changes without having to involve the schema change process. While this is inherent to any schemaless solution, it feels wrong to implement an antipattern in order to avoid an organizational process.

The other value would be if we had a dynamically changing list of fields, but I don't foresee having that.

Also, as mentioned in T193449#4257526, I think we should add a boolean (tinyint) field to ipblocks to identify if the block is sitewide or not.

@dbarratt: What, in your view, is the downside of using Option A?

@dbarratt: What, in your view, is the downside of using Option A?

tl;dr: The downside of Option A is that you have to give up effectively all of the features of a relational database.


Well I don't know if I can express it better than Bill (See T193449#4237241) but in my own words:

I don't think there is anything inherently wrong with a schemaless solution (or EAV or key-value or whatever you want to call it) that is Option A. It has it's place and for certain applications I think it's a really good idea. However, I do think that schemaless solutions do not belong in relational databases. If we want to use a schemaless solution, then I'm 100% in favor of doing that, and I think we should evaluate those solutions, but as far as I know, there would be a lot of resistance to adding a non-relational database (like MongoDB) to our stack.

That ideology aside, I think the biggest problem with Option A is that:

  • It creates an ambiguous relationship between blocks and categories/pages/namespaces. I know that we don't use Foreign Keys right now, but I think a "proper" database design would allow them to be added at any point in the future. With an ambiguous relationship, there effective is no relationship. The relationship only exists within the application, rather than within the database itself.
  • The io_value will have to be a varchar rather than an int so that other things can be stored in this table (See T193449#4229434). This means there wont be any data typing for the the value and everything will be string.
  • If we store other single-value items in the table (like User Agent) and we wanted to make those things required (or unique) later, we wouldn't be able to do that on the database level (we'd only be able to do it in code).
  • It violates the rule of least power in which we are shift the "work" from the database (least power) to the application (most power).
  • It allows us to avoid a schema change in the future. I think that's a fantastic argument to have a schemaless solution, but that's not a very good reason to implement a schemaless design in a relational database.

I told @dmaza that this isn't a hill I'm going to die on. :) so if I'm the only one that is pro Option B then I can agree to move forward with Option A. I just wanted to make sure the dissent was documented. :)

dmaza added a comment.Jun 6 2018, 5:05 PM

From a management perspective, while multiple tables are always preferred for things that are different, I am also concerned with an explosive growth, adding 3 tables at the beginning, but later one per type if there can be much more in the future.

I have to agree with @dbarratt that Option B is the more solid implementation and like you said, multiple tables is preferred for things that are different. So far, we don't see any other types being added but there are a few things that we are discussing that will probably alter ipblocks in the future (probably a different discussion)

The reason I proposed Option A is because it is a common design that we currently use and it will provide for much more flexibility. It is something to consider as an alternative with all it's downsides and reputation. Option A can become useful in a way that we won't need to mess around with ipblocks if we want to add other restrictions like "block users from sending emails" or "thanking" or even "user agent blocks".

Personally, I think it would be better to avoid using an ENUM in the database. Support isn't very consistent across the dbms's we support, ordering behavior can be unexpected, and adding new values to the enum requires a schema change.
Consider using NameTableStore (with a name table) instead.

Yup, valid. io_value will not be an int either.

Why count(*)? It would seem these would work just as well as existence checks, e.g.

select 1 from ipblocks_options where ipb_block_id = 1234 and io_type = 'page' and io_value = 5678 LIMIT 1;
select 1 from ipblocks_page where ip_block_id = 1234 and ip_page_id = 5678 LIMIT 1;

If you get back a row they're blocked, and if you don't get a row they're not.
The PKs I imagine would be on these tables would result in the count being 0 or 1 anyway.

That's an oversight on my end. Thanks for correcting me.

In particular, just the schema changes like the ones you are proposing will take more than a month to deploy to production,

Maybe I missed it, but it looks like the main proposal here is only adding tables, no changes to existing tables?
There was mention of a possible alter in T193449#4229434, although I think the ipblocks_options alternative mentioned there may be better than adding a text field to ipblocks. And there was mention of a possible alter in T193449#4257526.

Yes and yes. We are only adding tables as part of this new feature.


At this point I feel like we are all in the same page regarding what we are doing and the expected functionality out of these DB changes so How do we move forward from here? How do we reach consensus and/or what do we need to do to get us there?

jcrespo added a comment.EditedJun 6 2018, 5:13 PM

I feel like we are all in the same page

This is all DBAs ask at this point, the next milestone will be when db-related production patches are available for review. We ask to be involved before db design decisions or code is done so that it is impossible for us to say "no" on review about functionality we never heard about (e.g. because major blockers that are completely no-go, like the need for extra hardware or impossible architecture)- but you don't really need us pre-approving anything. Just start coding :-)

This was asked and documented to avoid misunderstandings, there is no "hard process of approval", just an early heads up. Normally we just want to be added to ongoing conversations on tickets and we will shout if we see something really wrong. 0:-)

As anomie said, new tables are not schema changes, you just need a deployer or deployment rights to do those (note , however we still need another heads up beforehand to check no private data is going into labs for production deployment).

jcrespo moved this task from Triage to Done on the DBA board.Jun 6 2018, 5:13 PM

Yes and yes. We are only adding tables as part of this new feature.

As I mentioned before (See T193449#4257526) I think we should modify ipblocks with a boolean (tinyint) field that stores if the block is site-wide or not. That way an admin's granular blocks are not blown out when switching to a site-wide block. This is a product requirement right @TBolliger?

ipblocks should be easy to modify, but for schema changes we require a very specific workflow to avoid mistakes: https://wikitech.wikimedia.org/wiki/Schema_changes#Workflow_of_a_schema_change You will, however, be added to the priority queue of ongoing schema changes: https://phabricator.wikimedia.org/project/view/1494/

Yes and yes. We are only adding tables as part of this new feature.

As I mentioned before (See T193449#4257526) I think we should modify ipblocks with a boolean (tinyint) field that stores if the block is site-wide or not. That way an admin's granular blocks are not blown out when switching to a site-wide block. This is a product requirement right @TBolliger?

Correct — we expect there to be overlapping blocks in the near future (2018) which will allow a granular block to have a different expiration from a full site block.

Anomie added a comment.Jun 6 2018, 5:56 PM

Correct — we expect there to be overlapping blocks in the near future (2018) which will allow a granular block to have a different expiration from a full site block.

FYI: If you want overlapping blocks, you'll not only have to add the proposed boolean field but you'll also have to include it in the ipb_address unique index.

FYI: If you want overlapping blocks, you'll not only have to add the proposed boolean field but you'll also have to include it in the ipb_address unique index.

Yep. We figured that would be the case. Supporting multiple blocks is a separate project (See T194697).

I have to agree with @dbarratt that Option B is the more solid implementation and like you said, multiple tables is preferred for things that are different.

Except that they aren't actually that different. Currently all 3 proposed block option tables have the exact same schema. So according to the Rule of Three, they should probably be consolidated, i.e. Option A. Do any of the future use cases require a different schema?

Except that they aren't actually that different. Currently all 3 proposed block option tables have the exact same schema. So according to the Rule of Three, they should probably be consolidated, i.e. Option A. Do any of the future use cases require a different schema?

I think the problem is is that Option A gets vastly expanded (i.e. it will start storing User Agent strings, etc.)

dmaza added a comment.Jun 6 2018, 7:26 PM

We've decided to start the work with a modified version of Option A.

ipblocks_restrictions
+-------------+-----------+
| Field       | Type      |
+-------------+-----------+
| ir_block_id | int       |
| ir_type     | varbinary |
| ir_value    | int       |
+-------------+-----------+

ir_type will be one of three values (page, namespace, category) and it will unlikely support anything else in a foreseeable future
We'll see how this goes and make any adjustments as we progress.

Oh, and this T193449#4261799

ir_type | varbinary

No, don't do this. If anomie has a veto on enums that's fine to me, but I have a veto on varbinaries/varchars for 3-option storage, and that is a hard no for me. You were told by anomie already what is the good alternative to enums. Anything else can be done, but don't store the same string all the time- we are literally moving away from that model everywhere else right now.

dbarratt added a comment.EditedJun 7 2018, 2:20 PM

No, don't do this. If anomie has a veto on enums that's fine to me, but I have a veto on varbinaries/varchars for 3-option storage, and that is a hard no for me. You were told by anomie already what is the good alternative to enums. Anything else can be done, but don't store the same string all the time- we are literally moving away from that model everywhere else right now.

Stepping back a bit... we came to the agreement in T193449#4262252 after talking to @kaldari who provided this logic (if I am getting this correct):
Since we do not use foreign keys, then the schema in Option B is identical and should be combined.

I agreed, as long as the scope of the table was restricted too... well restrictions (i.e. ir_value should be an int).

We talked about the ir_type a bit, if we ought not use ENUM (which I'm not a fan of anyways), then the alternative would be to use a key table. However, since don't use foreign keys, then there isn't really a point for the key table to exist because it's not enforcing anything and we would never be querying it.

I would prefer to have semantic keys, however, if that's a hard no and we should use a tinyint that is fine, but I don't think the mapping for the int → key should exist in the database at all (following the same logic as above), it should only exist in code (which is the only place it's going to be used anyways).

To clarify the positions: Option B has semantic meaning in the database and the modified Option A does not. While Option B duplicates the schema in order to maintain the semantic meaning.

it should only exist in code

That is ok to me.

Anomie added a comment.Jun 7 2018, 2:29 PM

I don't claim a veto, but the cons seem to outweigh the pros. NameTableStore is basically a "manual" implementation of enums that will work the same on all five dbms's we support, and the reason for the ordering working as it does is clear.

We talked about the ir_type a bit, if we ought not use ENUM (which I'm not a fan of anyways), then the alternative would be to use a key table. However, since don't use foreign keys, then there isn't really a point for the key table to exist because it's not enforcing anything and we would never be querying it.

It takes MySQL 1 byte to store a tinyint, 2 bytes to store a smallint, and 4 to store a regular int. Storing "page" in a varbinary(<=255) column takes 5 bytes, "category" takes 9 bytes, and "namespace" takes 10 bytes. Multiply that by millions of rows and copying into indexes and it adds up. I wouldn't be surprised if ints are faster to read/write than varbinary too.

I would prefer to have semantic keys, however, if that's a hard no and we should use a tinyint that is fine, but I don't think the mapping for the int → key should exist in the database at all (following the same logic as above), it should only exist in code (which is the only place it's going to be used anyways).

Having the name table in the database keeps the semantic keys in the database, but int-valued constants in PHP work fine for a case like this where the set of possible values isn't likely to be arbitrarily extended. Go for it.

jcrespo added a comment.EditedJun 7 2018, 2:36 PM

I don't claim a veto, but the cons seem to outweigh the pros

I meant veto as a figure of speech (we highly discourage a particular implementation by default). I think you are mostly worried about the maintainability of the code and avoiding bugs, plus support of 3rd party users. I am mostly worried about the performance, maintenance and storage requirements of WMF needs. The good news is that we can have both- and I think we agree on the basics! :-)

Multiply that by millions of rows and copying into indexes and it adds up

For context for others, on top of what anomie says:

Multiply that again by 200 (number of instances we have of mysql -and double that for RAID 10- and note that the main issue is what doesn't fit on memory will go to disk, and that is around a 1000x speed up for the worse case (not a disk space problem).

dmaza added a comment.Jun 7 2018, 3:50 PM

tinyint and PHP constant it is then. Thank you all for the help and prompt response.

dmaza closed this task as Resolved.Jun 7 2018, 3:50 PM
dmaza moved this task from In progress to Done on the Anti-Harassment (AHT Sprint 23) board.
Vvjjkkii renamed this task from Draft a proposal for granular blocks table schema(s), submit for DBA review to qxdaaaaaaa.Jul 1 2018, 1:13 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed dmaza as the assignee of this task.
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed the point value for this task.
Vvjjkkii edited subscribers, added: dmaza; removed: Aklapper.
CommunityTechBot renamed this task from qxdaaaaaaa to Draft a proposal for granular blocks table schema(s), submit for DBA review.Jul 1 2018, 2:06 PM
CommunityTechBot closed this task as Resolved.
CommunityTechBot assigned this task to dmaza.
CommunityTechBot set the point value for this task to 5.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot edited subscribers, added: Aklapper; removed: dmaza.
CommunityTechBot raised the priority of this task from High to Needs Triage.Jul 3 2018, 1:51 AM
dbarratt updated the task description. (Show Details)Jul 18 2018, 4:52 PM
dbarratt updated the task description. (Show Details)