Page MenuHomePhabricator

Create the a new DB schema to store the worklist data
Closed, ResolvedPublic

Description

AC: Create the new schema below
cc: @Iflorez

CREATE TABLE /*_*/ce_invitation_lists (
  ceil_id BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
  ceil_name VARBINARY(255) NOT NULL,
  ceil_event_id BIGINT UNSIGNED DEFAULT NULL,
  ceil_status INT NOT NULL,
  ceil_user_id INT NOT NULL,
  ceil_wiki VARBINARY(64) NOT NULL,
  ceil_created_at BINARY(14) NOT NULL,
  INDEX ce_invitation_lists_event_id (ceil_event_id),
  INDEX ce_invitation_lists_wiki (ceil_wiki),
  INDEX ce_invitation_lists_user_wiki (ceil_user_id, ceil_wiki),
  PRIMARY KEY(ceil_id)
) /*$wgDBTableOptions*/;
CREATE TABLE /*_*/ce_worklist_articles (
  cewa_id BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
  cewa_page_id INT UNSIGNED NOT NULL,
  cewa_page_title VARBINARY(255) NOT NULL,
  cewa_ceil_id BIGINT UNSIGNED NOT NULL,
  INDEX ce_worklist_articles_ceil_id (cewa_ceil_id),
  PRIMARY KEY(cewa_id)
) /*$wgDBTableOptions*/;
CREATE TABLE /*_*/ce_invitation_list_users (
  ceilu_id BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
  ceilu_user_id INT NOT NULL,
  ceilu_ceil_id BIGINT UNSIGNED NOT NULL,
  ceilu_score INT NOT NULL,
  INDEX ce_invitation_list_users_ceil_id (ceilu_ceil_id),
  INDEX ce_invitation_list_users_ceilu_user_id (ceilu_user_id),
  PRIMARY KEY(ceilu_id)
) /*$wgDBTableOptions*/;

The schema for ce_worklist_articles table stores both cewa_page_id and cewa_page_title to handle page deletions and moves effectively.

This approach ensures that:

  • Page Deletions: We can display redlinks for deleted pages using the cewa_page_title.
  • Page Moves: We track pages across moves with the cewa_page_id.

While this takes up more space, it ensures data integrity and accurate tracking.

Exclusion of Page Namespace

  • We are not storing the page namespace because we are only supporting mainspace pages for now.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
cmelo updated the task description. (Show Details)
cmelo renamed this task from [ Draft ] Create the a new DB schema to store the worklist data to Create the a new DB schema to store the worklist data.Jun 4 2024, 6:25 AM
cmelo updated the task description. (Show Details)

Change #1039331 had a related patch set uploaded (by Cmelo; author: Cmelo):

[mediawiki/extensions/CampaignEvents@master] Add new schema to store worklists data

https://gerrit.wikimedia.org/r/1039331

Hi @Iflorez, wondering what do you think about the new DB schema to store the worklists data, does it sound good to you?

Change #1039331 merged by jenkins-bot:

[mediawiki/extensions/CampaignEvents@master] Add new schema to store worklists data

https://gerrit.wikimedia.org/r/1039331

Change #1039751 had a related patch set uploaded (by Daimona Eaytoy; author: Daimona Eaytoy):

[mediawiki/extensions/CampaignEvents@master] Reapply "Add new schema to store worklists data"

https://gerrit.wikimedia.org/r/1039751

@cmelo is this task meant to be for DBAS? There is no mention to the storage you want to use, is this request for a MariaDB database in our misc sections? If so please follow: https://wikitech.wikimedia.org/wiki/MariaDB#Database_creation_template

@cmelo is this task meant to be for DBAS? There is no mention to the storage you want to use, is this request for a MariaDB database in our misc sections? If so please follow: https://wikitech.wikimedia.org/wiki/MariaDB#Database_creation_template

Hi @Marostegui, thank you for bringing up that point.
We will store this data on X1.
I've added the DBAS tag to seek feedback on the database structure and indexes.
Apologies for not providing this information earlier.

@Ladsgroup could you review this request? Thanks!
@cmelo a quick glance shows that you are not having any single PK on any of the tables, is that correct? Or should we assume cew_id cewa_id cewuba_id would be PK for each table?
If you could provide the full "show create table" for each table, that would be clearer to review.

@Ladsgroup could you review this request? Thanks!
@cmelo a quick glance shows that you are not having any single PK on any of the tables, is that correct? Or should we assume cew_id cewa_id cewuba_id would be PK for each table?
If you could provide the full "show create table" for each table, that would be clearer to review.

Yes, you can assume cew_id cewa_id cewuba_id would be PK for each table.
Here is where you can see the "create tables":
https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CampaignEvents/+/1039751/3/db_patches/mysql/tables-generated.sql

Hi @ifried, @VPuffetMichel, in this new schema to store the worklist date, I added some columns to relate a user with articles in a worklist.

This means that we will be able to know how many articles a user appears in, on a given worklist.

There is a question in the code review to determine if this is useful or not, so I am wondering if we want this or not.

Storing data this way allows us to inform organizers how many articles a user has appeared in, on their worklist.

Do you think this is valuable?

It is valuable to understand how many times a user appears in various invitation lists.

I do not know if knowing how many times a user appears in a list of articles (a worklist) would be useful. The algorithm will compute the invitation list based on several criteria and this one is only one of them.

I might be missing something here.

+1 to what Val wrote. I think the following things are useful:

  • How many invitation lists are generated? This is useful, since we want to collect data on usage of the feature, which we have documented as a metric to track in T365292.
  • How many organizers have generated more than one Invitation List? This is useful, since we want to collect data on usage of the feature, which we have documented as a metric to track in T365292.
  • How many times does a user appear in various invitation lists? This is useful because, as we are currently discussing in T363022, we want to prevent a situation in which the same people are showing up in invitation lists all the time and are therefore getting over-invited, which could lead to frustration and burnout.

@cmelo: Can you explain a bit more about the thinking behind 'how many times a user appears in a list of articles (worklist)'? Isn't it the work of the scoring model (rather than the storage of the worklist data) to look at things like that?

Update, @cmelo: Perhaps you are asking because of this comment in T363022: "Maybe for each worklist, you also collect a small random sample of articles (e.g., 20) and filter out anyone who also appears heavily in those?"

Is this the inspiration for the question?

Thanks @VPuffetMichel and @ifried.

@cmelo: Can you explain a bit more about the thinking behind 'how many times a user appears in a list of articles (worklist)'? Isn't it the work of the scoring model (rather than the storage of the worklist data) to look at things like that?

Sure, And yes I think the score take it into account, I added this piece of information thinking that we may want to know which articles returned each users, it would stored the data like:
user_id | article_ID | worklist_ID | score

So let's say you have created a worklist with 3 articles

  • Article A
  • Article B
  • Article C

And the return was

  • user 1
  • user 2
  • user 3
  • user 4
  • user 5

In order to know which article returned each user, we would need to store the user ID and article ID, but if we do not plan to use this information we can remove it (I mean do not store the article id on the worklist_users table).

This information could be used for:

  • Organizers would be able to see which user an article returned
    • E.g:
      • Article A returned users [ user 1, user 2, user 3]
      • Article B returned users [ user 1, user 4]
      • Article C returned users [ user 1, user 5]
  • We would be able to say user 1 have edits on articles: A,B,C

All this is take into account when mounting the score, but if we want do give more details like above we would need to store it, of course it means storing more data rather than only the user id, worklist id and score, but I just want to confirm that we don't want to have this level of detail, and if we don't we can remove this article_id column and just store:
user_id | worklist_ID | score

My recomendation would be to not store it, unless we really plan to show the details I mention above.

Ahhhh, thank you for clarifying, @cmelo!

In the earlier versions of Event Invitation designs, we did show which article generated which user in the invitation list. However, after consulting with @Sadads, we heard that this would probably not be very useful, whereas more data about the users themselves (such as their edit count) would be more useful. Now, I do not know if we will get a request in the future to show which users came from which article, but we have so far discussed that it may not be very valuable.

However, there is one use case that I can think of, which is the following: Let's say that an organizer does not like their Invitation List because they think that many of the users in the list don't seem like good candidates to invite. For this reason, they may want to generate a new invitation list with a revised worklist. In this case, it would be helpful to know which users came from which articles, so they could say, "Okay, this article DID generate a list of users who I would want to invite, so I'll include that article again, but this article didn't, so I won't include this article." Then, if we later allow users to export an invitation list or to select users to email from the invitation list, they could use that finalized version of the invitation list, rather than needing to pick and choose various users from multiple lists.

This makes me think: Data on which user is from which article will not be useful for the MVP, but it could be useful for later versions, if we provide support for potential actions that an organizer may take after generating an invitation list.

What do other people think? Would also love feedback from @gonyeahialam.

Ahhhh, thank you for clarifying, @cmelo!

In the earlier versions of Event Invitation designs, we did show which article generated which user in the invitation list. However, after consulting with @Sadads, we heard that this would probably not be very useful, whereas more data about the users themselves (such as their edit count) would be more useful. Now, I do not know if we will get a request in the future to show which users came from which article, but we have so far discussed that it may not be very valuable.

OK, thank you, In this case I will remove it.

However, there is one use case that I can think of, which is the following: Let's say that an organizer does not like their Invitation List because they think that many of the users in the list don't seem like good candidates to invite. For this reason, they may want to generate a new invitation list with a revised worklist. In this case, it would be helpful to know which users came from which articles, so they could say, "Okay, this article DID generate a list of users who I would want to invite, so I'll include that article again, but this article didn't, so I won't include this article." Then, if we later allow users to export an invitation list or to select users to email from the invitation list, they could use that finalized version of the invitation list, rather than needing to pick and choose various users from multiple lists.

This makes me think: Data on which user is from which article will not be useful for the MVP, but it could be useful for later versions, if we provide support for potential actions that an organizer may take after generating an invitation list.

What do other people think? Would also love feedback from @gonyeahialam.

I think it would be nice to have, we can add this data later when and if needed.

Also, curious to hear other folks thoughts on this.

cc: @Daimona @MHorsey-WMF @gonyeahialam @VPuffetMichel

Once the discussion is over on the team side, let me know to review the schema, thank you!

Yes, @cmelo, let's not store data on which article is associated with which user in the invitation list for the MVP, but we can consider it for later if we hear requests from organizers. Thank you!

@cmelo Another thing I realized while we were discussing T356683: we need to store the wiki where the invitation list was generated. This would allow us to only display local invitation lists in Special:MyInvitationLists. The current schema only stores the wiki of the event page, but that field might be absent.

@cmelo The current schema only stores the wiki of the event page, but that field might be absent.

(Small correction: it doesn't. We're only storing the event ID, which can be used to derive the event wiki. Still, the event ID is optional, so my original comment stands)

@cmelo The current schema only stores the wiki of the event page, but that field might be absent.

(Small correction: it doesn't. We're only storing the event ID, which can be used to derive the event wiki. Still, the event ID is optional, so my original comment stands)

Yes, thank you!!!
I have just added it on gerrit, and also update the AC
Let me know what you think, and then we can mark this as ready for DBAs to review @Daimona @MHorsey-WMF

I was going through this again, and I have a few questions/thoughts:

  • AIUI, "worklist" is just the list of articles. "Invitation list" is the list of users to invite, and we're also using this term to identify the combination of worklists + users. But then, shouldn't ce_worklists actually be called ce_invitation_lists? ce_worklists_articles is maybe fine, though I'm unsure about the plural in "worklists". ce_worklists_users would presumably be ce_invitation_list_users?
  • Who should be able to view a given invitation list? I thought it would only be the person who created the list itself (not the event). If so, we should store the creator of the list in the ce_worklists table, else we don't know who each list belongs to.
  • Thinking about indices. In SpecialMyInvitationLists, we'll need to filter by user and, optionally, wiki. So, I think we'll need a composite index on user + wiki (in this order). I don't think the application needs an index on the wiki alone, but unsure about whether it's needed for analytics. The other indices seem sufficient (I think we'll only need primary keys and the indices on invitation list IDs).
  • There's also the page ID question that was discussed in code review, which should go with the DBA review.

I was going through this again, and I have a few questions/thoughts:

  • AIUI, "worklist" is just the list of articles. "Invitation list" is the list of users to invite, and we're also using this term to identify the combination of worklists + users. But then, shouldn't ce_worklists actually be called ce_invitation_lists? ce_worklists_articles is maybe fine, though I'm unsure about the plural in "worklists". ce_worklists_users would presumably be ce_invitation_list_users?
  • Who should be able to view a given invitation list? I thought it would only be the person who created the list itself (not the event). If so, we should store the creator of the list in the ce_worklists table, else we don't know who each list belongs to.
  • Thinking about indices. In SpecialMyInvitationLists, we'll need to filter by user and, optionally, wiki. So, I think we'll need a composite index on user + wiki (in this order). I don't think the application needs an index on the wiki alone, but unsure about whether it's needed for analytics. The other indices seem sufficient (I think we'll only need primary keys and the indices on invitation list IDs).
  • There's also the page ID question that was discussed in code review, which should go with the DBA review.

Thank you @Daimona these are all good points, I think I have covered them all in this patch

Thank you. Is this ready for DBA review then?

Thank you. Is this ready for DBA review then?

Yes

Once the discussion is over on the team side, let me know to review the schema, thank you!

Hi @Ladsgroup, this is ready to review now, thank you!

From the schema point of view: My only note is that cewa_page_wiki seems to be redundant. If you have FK to ce_invitation_lists and that table has the wiki, you don't need to store it again here.

Overall, I would like to know how many rows you are estimating to be there. How much read is going to happen (in terms of selects)? How much write (update/insert/delete)?

And correct me if I'm wrong, but the plan is to deploy this to x1?

@cmelo I just realized that the ce_invitation_list_users table does not currently have a primary key.

@cmelo I just realized that the ce_invitation_list_users table does not currently have a primary key.

@Daimona, yes, I removed the primary key from ce_invitation_list_users because of what you mention, and I also think it is not needed:
"I think we'll only need primary keys and the indices on invitation list IDs"

https://phabricator.wikimedia.org/T366354#9890136

FWIW, having PK is always encouraged, it helps with data integrity and replication but it doesn't need to be a dedicated auto_increment id column. If tables are getting rather big, give it a dedicated column since it helps with backups and checksum runs. If it's not, you can group a couple of integer columns that could make a unique index and call it PK. If table is going to be monstrous (like pagelinks level), then adding column costs outweighs the backup benefits. i.e.

  • If small: Group a couple of integer columns and call it PK
  • If large: Give it a dedicated auto_increment column
  • If really large: Back to number 1, group of couple of integer columns and call it PK

FWIW, having PK is always encouraged, it helps with data integrity and replication but it doesn't need to be a dedicated auto_increment id column. If tables are getting rather big, give it a dedicated column since it helps with backups and checksum runs. If it's not, you can group a couple of integer columns that could make a unique index and call it PK. If table is going to be monstrous (like pagelinks level), then adding column costs outweighs the backup benefits. i.e.

  • If small: Group a couple of integer columns and call it PK
  • If large: Give it a dedicated auto_increment column
  • If really large: Back to number 1, group of couple of integer columns and call it PK

Thank you @Ladsgroup, this table will get bigger, so I will add the auto_increment id column again

Thanks @Ladsgroup.

From the schema point of view: My only note is that cewa_page_wiki seems to be redundant. If you have FK to ce_invitation_lists and that table has the wiki, you don't need to store it again here.

Yes, thanks, the wiki id on invitation list represents the wiki where the worklist was created, the cewa_page_wiki would be used in case we allow multi-wiki worklists meaning that you could add articles from other wikis, but since we don't have plans to do it for now (I have just checked it with @ifried), I will remove it. @Daimona @MHorsey-WMF do you know if there is another reason? If not, we can remove it.

Overall, I would like to know how many rows you are estimating to be there. How much read is going to happen (in terms of selects)? How much write (update/insert/delete)?

I am checking with folks on this, I will get back to you ASAP

And correct me if I'm wrong, but the plan is to deploy this to x1?

Yes, it is!

Also wondering what is your opinion about this comment on gerrit:

@Daimona, yes, I removed the primary key from ce_invitation_list_users because of what you mention, and I also think it is not needed:
"I think we'll only need primary keys and the indices on invitation list IDs"

I think there was a misunderstanding. I meant that in terms of indices, we will only need the primary keys, and also indices on the invitation list ID, for each table.

From the schema point of view: My only note is that cewa_page_wiki seems to be redundant. If you have FK to ce_invitation_lists and that table has the wiki, you don't need to store it again here.

Yes, thanks, the wiki id on invitation list represents the wiki where the worklist was created, the cewa_page_wiki would be used in case we allow multi-wiki worklists meaning that you could add articles from other wikis, but since we don't have plans to do it for now (I have just checked it with @ifried), I will remove it. @Daimona @MHorsey-WMF do you know if there is another reason? If not, we can remove it.

Your understanding is correct. The one thing I don't (didn't?) know is how realistic the plan of allowing interwikis is. I believe we all agree it's not going to be part of the initial version, nor the first batch of improvements, but I thought we'd still do that at some point. If not, the column can be removed, and it can later be backfilled by querying the ce_invitation_lists table, if needed.

@cmelo & @Daimona: We *may* want to do interwiki worklists in the future, but we have no concrete plans. I think it is safe to assume that it is a maybe, but not a definite. The reason why it is that, while some organizers certainly do create interwiki worklists for their events, I do not yet know if/how we will want to accommodate scenarios in which we need to confirm that an organizer has the event organizer right on multiple wikis and then do the work to generate interwiki invitation lists. This is a potential feature for later, depending on the feedback we receive from users after the MVP release, the technical scoping & challenges around such work, and the priority of such a feature enhancement overall in relation to other features. There would be a lot to consider, so it's a TBD, but it's not a priority for now.

If you decide to eventually switch to interwiki articles, I suggest normalizing the wiki field via a NameTableStore backed table but we work on it once we get there (it's not much work)

I think there was a misunderstanding. I meant that in terms of indices, we will only need the primary keys, and also indices on the invitation list ID, for each table.

Ok, thank you

Overall, I would like to know how many rows you are estimating to be there. How much read is going to happen (in terms of selects)? How much write (update/insert/delete)?

Hi @Ladsgroup we expect to have about 200 invitation lists per month, but it can increase and probably will over time.
It means:

inserts

  • 200 inserts on ce_invitation_lists
  • 200 * 300 = 60,000 inserts on ce_worklist_articles (Each invitation list can have a max of 300 articles, so this would be the max scenario given 200 invitation lists per month)
  • 200 * 200 = 40,000 inserts on ce_invitation_lists_users (Each invitation list can have a max of 200 users, so this would be the max scenario given 200 invitation lists per month)

updates

  • The only update we have for now is to update the status of the invitation list from processing to ready, which should happen just once per invitation list.

delete

  • None for now

selects

  • SELECT * FROM ce_invitation_lists, to list all the invitation lists created by an organizer (with pagination), and the number of users of each invitation list, which means a COUNT on ce_invitation_list_users.
  • SELECT * FROM ce_invitation_list_users , to display the users in an invitation list when the organizer opens the invitation list results page, and also a SELECT * FROM ce_worklist_articles to show all the articles of the invitation list.

Overall this seems fine to me. Something to consider is that this means it will grow without bound and that's not a problem now but it'd be nice if you have some plans to remove really old invitation list articles (e.g. older than five years) to keep it in bound.

Overall this seems fine to me. Something to consider is that this means it will grow without bound and that's not a problem now but it'd be nice if you have some plans to remove really old invitation list articles (e.g. older than five years) to keep it in bound.

Thank you, @Ladsgroup, we do have plans to delete old invitation list after a certain amount of time, we are still defining it, but it will be something between 3 and 5 years.

Sounds good to me!

Hi @Ladsgroup we did a small change on ce_worklist_articles table, it will store cewa_page_id and cewa_page_title to handle page deletions and moves effectively, and cewa_page_namespace was removed because we will only support the main namespace for now.

This approach ensures that:

  • Page Deletions: We can display redlinks for deleted pages using the cewa_page_title.
  • Page Moves: We track pages across moves with the cewa_page_id.

While this takes up more space, it ensures data integrity and accurate tracking.

Exclusion of Page Namespace

  • We are not storing the page namespace because we are only supporting mainspace pages for now.

Hi @Ladsgroup,
Have you had a chance to review the recent changes?
If they are ok, please let me know so we can close this task.

Thanks in advance for your feedback.

Hi, sorry. We have been dealing with aftermath of two major outages. I need a bit of time. Thank you!

Hi, sorry. We have been dealing with aftermath of two major outages. I need a bit of time. Thank you!

Ok, thanks for letting us know @Ladsgroup.

I wish we could use linktarget ids instead but given their complexity and the fact that page title won't be repeated much I think it's fine for now. It's good to go.

I wish we could use linktarget ids instead but given their complexity and the fact that page title won't be repeated much I think it's fine for now. It's good to go.

Thanks!

Change #1039751 merged by jenkins-bot:

[mediawiki/extensions/CampaignEvents@master] Reapply "Add new schema to store invitation lists data"

https://gerrit.wikimedia.org/r/1039751

vaughnwalters subscribed.

ce_invitation_lists

Screenshot 2024-07-09 at 1.55.53 PM.png (380×1 px, 96 KB)

Screenshot 2024-07-09 at 1.54.16 PM.png (362×3 px, 142 KB)


ce_worklist_articles;

Screenshot 2024-07-09 at 1.56.51 PM.png (326×1 px, 82 KB)

Screenshot 2024-07-09 at 1.58.19 PM.png (268×3 px, 90 KB)


ce_invitation_list_users;

Screenshot 2024-07-09 at 1.59.34 PM.png (316×1 px, 77 KB)

Screenshot 2024-07-09 at 2.00.16 PM.png (300×3 px, 109 KB)


✅ Schema has been created according to the AC.

Sending this to product sign off.

Nothing user-facing to test and the feature has been released to the beta cluster, so I'm marking this as Done.