Page MenuHomePhabricator

Investigation: Figure out the database tables for grant metrics
Closed, ResolvedPublic3 Estimated Story Points

Description

Grant metrics tool (https://meta.wikimedia.org/wiki/Community_Tech/Grant_metrics_tool) needs databases. Let's figure out what the tables are going to look like before we start the coding stuff. We need to think about what kinds of queries we'll be doing on these tables to be sure it's well optimized.

Event Timeline

DannyH triaged this task as Medium priority.Sep 19 2017, 10:42 PM
DannyH set the point value for this task to 3.

Based on the wireframes, here's my proposed schema:

--
-- Represents a group of events, with its own title.
--
CREATE TABLE /*_*/program (
  -- Primary key.
  program_id INT unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,

  -- Title of the program. This should be unique.
  program_title VARCHAR(255) binary NOT NULL DEFAULT '',
);
CREATE UNIQUE INDEX /*i*/program_title ON /*_*/program (program_title);


--
-- Table of users who are organizing a program.
--
CREATE TABLE /*_*/organizer (
  -- Primary key.
  org_id INT unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,

  -- Foreign key to the user's ID in the `loginwiki` database.
  -- Organizers can create programs that apply to any number of wikis,
  -- and we want a way to tie back to the Wikimedia databases, so
  -- we will use the `user_id` on `loginwiki.user`.
  org_user_id INT unsigned NOT NULL,

  -- Foreign key corresponding to program_id.
  org_program_id INT unsigned NOT NULL
);

-- For looking up programs that a user organizes.
CREATE INDEX /*i*/org_user ON /*_*/organizer (org_user_id);

-- For looking up organizers of a program.
CREATE INDEX /*i*/org_program ON /*_*/organizer (org_program_id);


--
-- An event belongs to a program, and contains dedicated metrics for
-- 30-day retention, pages created and pages improved. These statistics
-- are periodically imported from MediaWiki and stored in this table.
-- The number of participants, and how many of which are new editors,
-- are computed with joining this table on `participant`.
--
CREATE TABLE /*_*/event (
  -- Primary key.
  event_id INT unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,

  -- Foreign key to program.
  event_program_id INT unsigned NOT NULL,

  -- Title of the event. This should be unique for the program.
  event_title VARCHAR(255) binary NOT NULL DEFAULT '',

  -- Start date of the event as Unix timestamp.
  event_start TIMESTAMP,

  -- End date of the event as Unix timestamp.
  event_end TIMESTAMP,

  -- Timezone as a string, corresponding to the tz database.
  -- (See https://en.wikipedia.org/wiki/Tz_database)
  event_timezone VARCHAR(64) NOT NULL DEFAULT 'UTC',

  -- Retention over a 30 day timeframe.
  event_retention INT unsigned NOT NULL DEFAULT 0,

  -- Number of pages created as part of the event.
  event_pages_created INT unsigned NOT NULL DEFAULT 0,

  -- Number of pages improved during the event.
  event_pages_improved INT unsigned NOT NULL DEFAULT 0,

  -- Last time the statistics were updated.
  event_updated_at TIMESTAMP
);

-- For finding events by time.
CREATE INDEX /*i*/event_time ON /*_*/event (event_start, event_end);

-- For finding events by title. This may be used if for example you remember
-- the event title but can't remember what program it belongs to.
CREATE INDEX /*i*/event_title ON /*_*/event (event_title);

-- For finding events within a program.
CREATE INDEX /*i*/event_program ON /*_*/event (event_program_id);

-- For finding events by title within a program, which should be uniqe.
CREATE UNIQUE INDEX /*i*/event_title ON /*_*/event (event_program_id, event_title);


--
-- Lists the wikis that an event belongs to.
--
CREATE TABLE /*_*/event_wiki (
  -- Primary key.
  ew_id INT unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,

  -- Foreign key to the event.
  ew_event_id INT unsigned NOT NULL,

  -- Database name of the wiki. Corresponds to `dbname` in `meta`.`wiki`
  ew_dbname VARCHAR(32) NOT NULL
);

-- For finding wikis by event.
CREATE INDEX /*i*/ew_event ON /*_*/event_wiki (ew_event_id);

-- For finding events by wiki.
CREATE INDEX /*i*/ew_wiki ON /*_*/event_wiki (ew_dbname);


--
-- Participants belong to an event.
--
CREATE TABLE /*_*/participant (
  -- Primary key, same as the `user_id` in `loginwiki`.`user`
  par_user_id INT unsigned NOT NULL PRIMARY KEY DEFAULT 0,

  -- Foreign key to the event.
  par_event_id INT unsigned NOT NULL,

  -- Whether or not they are considered a new editor,
  -- as of the time of the event.
  par_new_editor BOOLEAN NOT NULL DEFAULT 0
);

-- For finding participants by event.
CREATE INDEX /*i*/par_event ON /*_*/participant (par_event_id);

-- For finding events by participant.
CREATE INDEX /*i*/par_user ON /*_*/participant (par_user_id);

This does not account for "categories" and "templates". I don't know how those are supposed to work.

Also, I was thinking users should be stored by their user ID. This would mean we don't need to worry about renamed users. I figure we'll use the user_id in the loginwiki database because every user has an account there. With that, we can get their user_name, and get the necessary data from the wikis the event belongs to.

-- Retention over a 30 day timeframe.
 event_retention INT unsigned NOT NULL DEFAULT 0,
 -- Number of pages created as part of the event.
event_pages_created INT unsigned NOT NULL DEFAULT 0,
-- Number of pages improved during the event.
event_pages_improved INT unsigned NOT NULL DEFAULT 0,

In my experience, fields like these have a tendency to change over time. For example, after a while people may not see "user retention" as an important metric and want to switch to "data retention" for the data that came out of the event.
So what do you think about having this data in a separate event-data table? Something like -

CREATE TABLE /*_*/event_data (
  -- Event metric ('retention' or 'pages created' or 'pages improved' etc.)
  ed_metric VARCHAR NOT NULL,

 -- Value of the given metric for given event
  ed_metric_value INT unsigned NOT NULL DEFAULT 0, 

  -- Foreign key to the event.
  ed_event_id INT unsigned NOT NULL
);

Also, I was thinking users should be stored by their user ID. This would mean we don't need to worry about renamed users. I figure we'll use the user_id in the loginwiki database because every user has an account there. With that, we can get their user_name, and get the necessary data from the wikis the event belongs to.

We'll have to join on tables in the local wikis often. For that purpose it will take longer to join with loginwiki and get local IDs/usernames and then join using that. Let's save usernames and global IDs both?

--
-- Lists the wikis that an event belongs to.
--
CREATE TABLE /*_*/event_wiki (
  -- Primary key.
  ew_id INT unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,

  -- Foreign key to the event.
  ew_event_id INT unsigned NOT NULL,

  -- Database name of the wiki. Corresponds to `dbname` in `meta`.`wiki`
  ew_dbname VARCHAR(32) NOT NULL
);

I'm wondering if it's worth having this in it's own table. What if we save the list of wikis in the event table in a field in a comma-separated format? Is there an advantage/disadvantage to having it this way?
We know that the list of wikis do not go "too big". Most events are on a single wiki. I expect the list to cap at about 10 wikis maximum. We can have it as a BLOB to be safe if we do it that way.
Also note that there will be an option for "all wikis" for events, which would mean we're computing metrics across all 800+ wikis.

what do you think about having this data in a separate event-data table?

That could work!

We'll have to join on tables in the local wikis often. For that purpose it will take longer to join with loginwiki and get local IDs/usernames and then join using that. Let's save usernames and global IDs both?

It should go pretty fast, especially with the new replicas. Say we have to look up 50 users when we only have their IDs (I just did this test with the past 50 users created):

MariaDB [enwiki_p]> SELECT user_name FROM loginwiki_p.user WHERE user_id IN (25887661,25887660,25887659,25887658,25887657,25887656,25887655,25887654,25887653,25887651,25887650,25887649,25887648,25887647,25887646,25887645,25887635,25887634,25887624,25887623,25887621,25887620,25887619,25887618,25887617,25887616,25887610,25887609,25887608,25887607,25887606,25887605,25887604,25887589,25887588,25887581,25887580,25887579,25887578,25887577,25887576,25887575,25887574,25887573,25887572,25887571,25887570,25887569,25887567,25887566);

...

50 rows in set (0.00 sec)

The only reason I'm pushing for this at all is to account for user renames. Those don't happen too often, but when they do it will be a mess and yield unexpected results, throw off the metrics, etc. We could store usernames too, but we still have the same problem of needing to check if they've been renamed, and then update all the relevant rows in participant and/or organizer.

What if we save the list of wikis in the event table in a field in a comma-separated format? Is there an advantage/disadvantage to having it this way?

Yeah we could do a serialized array. The only issue I foresee is that this would make it difficult to search for events by wiki. Not sure if we'd want to do that.

Also note that there will be an option for "all wikis" for events, which would mean we're computing metrics across all 800+ wikis.

For that I guess the wikis column could just be NULL (or if there are no corresponding rows in event_wiki, if we go that route). It's painful to think about this... Computing metrics across all wikis will not be fun. It's slow enough just getting a single user's global contributions. Do any of the other outreach tools do this?

We'll have to join on tables in the local wikis often. For that purpose it will take longer to join with loginwiki and get local IDs/usernames and then join using that. Let's save usernames and global IDs both?

It should go pretty fast, especially with the new replicas. Say we have to look up 50 users when we only have their IDs (I just did this test with the past 50 users created):

MariaDB [enwiki_p]> SELECT user_name FROM loginwiki_p.user WHERE user_id IN (25887661,25887660,25887659,25887658,25887657,25887656,25887655,25887654,25887653,25887651,25887650,25887649,25887648,25887647,25887646,25887645,25887635,25887634,25887624,25887623,25887621,25887620,25887619,25887618,25887617,25887616,25887610,25887609,25887608,25887607,25887606,25887605,25887604,25887589,25887588,25887581,25887580,25887579,25887578,25887577,25887576,25887575,25887574,25887573,25887572,25887571,25887570,25887569,25887567,25887566);

...

50 rows in set (0.00 sec)

The only reason I'm pushing for this at all is to account for user renames. Those don't happen too often, but when they do it will be a mess and yield unexpected results, throw off the metrics, etc. We could store usernames too, but we still have the same problem of needing to check if they've been renamed, and then update all the relevant rows in participant and/or organizer.

When we ask the organizer(s) to enter participant(s), they enter usernames. It's up to them to check if the participant has been renamed. User renames are very very infrequent and I don't think it's worth the hassle here. The number of participants for an event can be pretty high and if every time the organizer wants to view participants, we join on the database, it could get slow. @kaldari - thoughts?

Another thing, the new replicas don't allow tools which have their own databases, do they? So we have to host this on the old ones? Are those guaranteed to be supported in future?

What if we save the list of wikis in the event table in a field in a comma-separated format? Is there an advantage/disadvantage to having it this way?

Yeah we could do a serialized array. The only issue I foresee is that this would make it difficult to search for events by wiki. Not sure if we'd want to do that.

Na, we won't be searching by wiki at any point.

Also note that there will be an option for "all wikis" for events, which would mean we're computing metrics across all 800+ wikis.

For that I guess the wikis column could just be NULL (or if there are no corresponding rows in event_wiki, if we go that route). It's painful to think about this... Computing metrics across all wikis will not be fun. It's slow enough just getting a single user's global contributions. Do any of the other outreach tools do this?

I think Programs Dashboard does. And Grant Metrics does it for sure.
I think the catch here is that not everyone edits all wikis. There's gotta be an API we can connect to for getting the SUL info for a user and then we will only have to query 5-10 wikis.

I was looking at https://xtools.wmflabs.org/ec-latestglobal/en.wikipedia.org/NKohli%20%28WMF%29 -- that is cool. Is there a way we could tap into that for getting every participant's edits across all/given set of wikis? Is there an API for it?

This could especially come handy for when we generate the table for exporting data - https://meta.wikimedia.org/wiki/Community_Tech/Grant_metrics_tool#/media/File:Export_data.png

Another thing - for all participants and organizers, we should have another column for whether they are valid or not. And a similar field for the event as a whole.

Use case -

  1. Organizer creates a new event, pastes in a list of 1000 users and hits save and user starts working on something else/closes window.
  2. Validation checks need to be run in the background after user hits save. Next time they come online, we can tell them exactly which users are invalid so we don't have to check the whole list over.

For events, if anything in an event is invalid (could be a participant name or category/template name in v2), we don't let them calculate totals. So we need that info available at hand.

When we ask the organizer(s) to enter participant(s), they enter usernames. It's up to them to check if the participant has been renamed. User renames are very very infrequent and I don't think it's worth the hassle here. The number of participants for an event can be pretty high and if every time the organizer wants to view participants, we join on the database, it could get slow.

We'll definitely cache the results, so it won't be querying every time. It's worth mentioning that renames are more common with new users, so I would be prepared for this scenario. Not a huge deal, but I don't think fetching usernames will slow anything down noticeably. We are in the early stages so if it doesn't work out we can change our strategy. I'm just thinking ahead, and would hate for the organizer to be unaware of the renames, refreshing statistics, and are completely oblivious that the stats are off. It'd be a real hassle trying to figure out who was renamed, what their new username is, and manually updating the list of participants, etc., right? If we are able to do the work for them, I think we should.

Another thing, the new replicas don't allow tools which have their own databases, do they? So we have to host this on the old ones? Are those guaranteed to be supported in future?

That's my understanding. I believe we are safe to continue using tools-db for our own data, but I almost want to push for a VPS instance and our own database. We could do backups just like we do with Outreach Dashboard.

Na, we won't be searching by wiki at any point.

What about statistics? It'd be cool to say, in the past year there were N number of events on frwiki, etc. Is there a disadvantage to having a separate table? To me it plays into the relational database architecture, and would allow for more things to be associated with a wiki in the future should we need it. With the index having a separate table shouldn't slow anything down, and finally I don't think MySQL is really meant to handle array data types.

I think the catch here is that not everyone edits all wikis. There's gotta be an API we can connect to for getting the SUL info for a user and then we will only have to query 5-10 wikis.

That makes me feel better!

That's my understanding. I believe we are safe to continue using tools-db for our own data, but I almost want to push for a VPS instance and our own database. We could do backups just like we do with Outreach Dashboard.

Why do you think we would need our own VPS for this? The more VPSs we set up, the more we have to maintain and the greater the learning curve for new contributors. I would prefer that we not set up a VPS unless we really need to.

The only reason I'm pushing for this at all is to account for user renames...

I tend to agree with Niharika on this one. One of the reasons we implemented SUL was so that we wouldn't have to do cross-database gymnastics like this. We know that usernames are consistent across all wikis (and that user renames are extremely rare), so I think we should just store the usernames.

That's my understanding. I believe we are safe to continue using tools-db for our own data, but I almost want to push for a VPS instance and our own database. We could do backups just like we do with Outreach Dashboard.

Why do you think we would need our own VPS for this? The more VPSs we set up, the more we have to maintain and the greater the learning curve for new contributors. I would prefer that we not set up a VPS unless we really need to.

Definitely don't need VPS. I understand concerns around a learning curve but if we go with the XTools setup, contributors need only to push and everything deploys on its own. Even without automation, the deployment process would be no different than Toolforge (pull, clear cache, rebuild assets). Building the environment certainly isn't as straightforward, but we will document it. If we really don't want to use VPS that's cool, but I'd like to highlight the advantages:

  • We can get our own subdomain! And perhaps even takeover https://metrics.wmflabs.org/
  • We can use PHP 7. Things like Doctrine and phpunit are slowly dropping support for PHP 5.6. It'd be really cool to stay up-to-date, not to mention all the new and nifty syntax that PHP 7 offers. See also T178850
  • We can still use tools-db, meaning the data will be backed up.

There's a laundry list of arguments against it, too, so we could either way. Just sharing my thoughts :)

The only reason I'm pushing for this at all is to account for user renames...

I tend to agree with Niharika on this one. One of the reasons we implemented SUL was so that we wouldn't have to do cross-database gymnastics like this. We know that usernames are consistent across all wikis (and that user renames are extremely rare), so I think we should just store the usernames.

  • What about organizers? If they rename, and log-in via OAuth, they are no longer tied to all the other programs they created. Would we have to manually update our database for them?
  • What happens when a participant is renamed, and the organizer updates it in the interface? The code will have to update all the rows across all the tables, knowing this was a rename and not just changing to a different user. If you have hundreds of participants, how are you meant to notice if one of the accounts was renamed?

I would have a look at meta:Special:Log/gblrename (roughly 30 a day) and note they are predominately new users. If we want to store usernames that's fine, but we shouldn't ignore the possibility that they will change, and we should be able to update our data accordingly without manual intervention.

So it sounds like you basically want a VPS for the subdomain :) Ideally, there should be some way for us to point subdomains at toolforge directories. Maybe @bd808 would know. I'm not totally opposed to the idea of a separate VPS, but the argument for it so far seems weak.

What about organizers? If they rename, and log-in via OAuth, they are no longer tied to all the other programs they created. Would we have to manually update our database for them?

True, it might be worth it for event organizers, but I'm not convinced it would be worth it for event participants (which will be the bulk of the queries).

What happens when a participant is renamed, and the organizer updates it in the interface?

Same as if an organizer added a new participant. Doesn't seem like a big deal.

If you have hundreds of participants, how are you meant to notice if one of the accounts was renamed?

It's far more common for new users to simply abandon old accounts and create new ones, which I also don't feel like we need to solve for. If you have hundreds of participants, one of them being renamed isn't likely to affect your stats that much. Besides, these days, the emphasis for editing events is more on actual impact from the event rather than editor retention. It's been proven many times over that editing events have poor retention and aren't a cost-effective means of recruiting new editors, so adding a lot of complexity to accommodate an edge case that's only relevant for retention stats seems like a poor use of time (compared to the other things we could be working on).

So it sounds like you basically want a VPS for the subdomain :)

And PHP 7+, which I think will be more future-proof, and I'd love to see the VersionEye report in green and not yellow or red. For example, PHPUnit 5.7 is the most recent version that runs on PHP 5.6, and support for it will end in Februrary. We should aim to live in the future, and not linger in the past :)

If you have hundreds of participants, how are you meant to notice if one of the accounts was renamed?

It's far more common for new users to simply abandon old accounts and create new ones, which I also don't feel like we need to solve for. If you have hundreds of participants, one of them being renamed isn't likely to affect your stats that much. Besides, these days, the emphasis for editing events is more on actual impact from the event rather than editor retention. It's been proven many times over that editing events have poor retention and aren't a cost-effective means of recruiting new editors, so adding a lot of complexity to accommodate an edge case that's only relevant for retention stats seems like a poor use of time (compared to the other things we could be working on).

Let's take for example https://xtools.wmflabs.org/adminstats/en.wikipedia.org:

SELECT user_name, user_id,
    SUM(IF( (log_type = 'delete'  AND log_action != 'restore'),1,0)) AS 'delete',
    SUM(IF( (log_type = 'delete'  AND log_action  = 'restore'),1,0)) AS 'restore',
    SUM(IF( (log_type = 'block'   AND log_action != 'unblock'),1,0)) AS 'block',
    SUM(IF( (log_type = 'block'   AND log_action  = 'unblock'),1,0)) AS 'unblock',
    SUM(IF( (log_type = 'protect' AND log_action != 'unprotect'),1,0)) AS 'protect',
    SUM(IF( (log_type = 'protect' AND log_action  = 'unprotect'),1,0)) AS 'unprotect',
    SUM(IF( log_type  = 'rights',1,0)) AS 'rights',
    SUM(IF( log_type  = 'import',1,0)) AS 'import',
    SUM(IF( log_type != '',1,0)) AS 'total'
FROM logging_userindex
JOIN user ON user_id = log_user
WHERE log_timestamp > '20170924000000' AND log_timestamp <= '20171024235959'
  AND log_type IS NOT NULL
  AND log_action IS NOT NULL
  AND log_type IN ('block', 'delete', 'protect', 'import', 'rights')
GROUP BY user_name
HAVING 'delete' > 0 OR user_id IN (
  -- List of user IDs
)
ORDER BY 'total' DESC

So this does some computations on the logging table, in addition to getting the username. For enwiki it runs on around 1200 users, and finishes in ~3 seconds uncached.

We'll likely be doing something similar. Our data lives in a private database that can't be joined with the replicas, so we'll first have to fetch all the usernames, then put those in user_name IN ( ... ). It would then seem trivial to instead store user IDs on our end and do user_id IN ( ... ) like AdminStats does. That's about as complex as it gets...

For merely displaying user names on our end, consider:

MariaDB [enwiki_p]>SELECT user_name
FROM user
WHERE user_id IN (8760229,30498975,5075409,3188090,6188256,13975403,1215485,4842600,161142,7852030,506179,1104175,14882394,17449355,1749459,491706,1025532,20508333,6774658,55327,4711431,1492328,10728040,1639942,11508456,3174456,14596827,365454,5183450,1909773,7573298,119438,503067,664832,7872,348521,31530,2460180,680693,9661331,196446,7529026,24080262,3965251,25420061,10440891,5243703,7007500,1799532,13259281,1028844,1188594,195987,3742946,211728,5305553,19527,4076676,85193,1453997,2157954,7586,5527268,2325890,17843555,7997523,359256,10345689,985645,24619723,6869042,1839637,7098284,1219,1304678,3102801,1570001,244263,4358157,9336033,8731061,1755837,8638440,11916782,10734192,92123,1607861,1812441,10006598,1763353,86247,524544,1521335,461300,108244,6028820,502540,128551,38427,2512606,7274040,1646408,294180,6284,544955,1879566,5759406,14730724,149061,11801436,712163,194203,654492,130326,974788,42464,54890,461626,24902,11250185,8363265,8005368,7646755,99037,764407,1255878,4968133,1611404,2306142,4511460,62058,4850051,76,1843661,2051880,21166720,196153,84417,129409,1640548,157530,7166633,14383484,4045206,18984730,32005,3810835,15425179,665998,4088067,327289,4069147,3002016,410906,153741,14836860,155651,2387872,1960810,713860,2300502,4919722,4454776,73920,726851,1938651,96897,145855,855858,4057701,13662,10051232,7583140,10689882,8515513,150564,1597144,12109580,54809,4894531,54381,8846,44062,553121,390477,319064,2378092,11851222,2532707,15690513,33452,782009,834412,8447749,8330928,90672,15130,7167267,20800274,64853,174430,210954,83491,842922,764861,195596,203786,1779282,9612106,4642289,262163,1614547,372693,29695,551300,1879095,3210133,26801,305478,1130897,2803535,42630,416811,415269,937705,117878,822615,1328636,2267950,15991542,131286,373827,1623918,82432,55983,23999,3022076,278352,590476,2278355,327763,98883,75230,1211616,4066002,1005449,861759,93027,3596390,563447,1224855,7181920,94871,990214,1210941,1520345,753665,48143,3609594,1746167,673412,114828,325166,3855926,2093199,5950998,6457542,835,97232,394292,203066,26201,846901,702940,3048829,29210,1865063,3727527,111996,6479619,144202,214232,1100608,625916,356335,12978,1997572,58781,2411536,6153,2530149,15708,17171,160367,204282,145394,9752287,408438,921080,138516,136745,793681,4564448,20855,4071608,145927,3381536,9867,372290,2175,2092344,14504098,676502,8544547,22619,1243744,5019622,216295,3138265,531146,7643,206678,3637572,1234297,488996,6404,374195,3516226,719387,7777104,212526,4626,3072955,341215,5894,302229,212671,57939,1902918,17956451,182902,5069051,589268,11386661,314721,1667683,1475157,16980,7407280,7777,10289486,3778104,128863,244887,7694628,146242,44750,31,21887,10056298,4288359,149211,42168,3158744,1381068,473485,5875630,5798,7044616,380571,1712026,10059,4517086,9792575,12914589,161735,805775,129927,20318,27822,319203,7291611,22805,451766,76027,851859,169027,446342,21319995,1015008,57658,437689,21731094,540053,353669,61929,205136,3115846,63286,911178,12961884,206571,2047841,86857,820190,2793164,2592184,2106820,14365232,270413,320236,158443,28522,8776,2968397,55582,43152,301903,51464,301395,409043,11203371,227287,238154,18776,10512,1752,197557,251289,295294,2106498,2327349,112889,1514772,163732,159620,7830073,4444,68411,7377492,206667,1123985,18931,1409745,3808,153314,780463,2172222,507787,1792248,5333,816961,21804,4388,240994,12469203,87355,2124930,5561369,63672,1325,509743,58193,23320,95756,154294,2053937,221935,375502,964805,934052,4560805,90397,338839,34459,61482,539598,888512,9806,20152,13661,244888,5697725,41723,17790,1292752,350890,87543,236191,1196752,5771,1168880,6912611,34256,449918,2821524,475878,425079,81121,10639536,1944879,84330,1248138,21013,4145,2267145,243281,5062955,7676170,313197,536375,29539620,773061,1255637,76424,17946415,151693,46160,51413,141808,1222043,7128128,43680,710296,24373,113803,125125,91656,1875163,355698,35807,532775,1123,59118,57108,2403679,406178,11756775,110252,483858,244130,9156,139004,21902,3016821,627347,13680988,330864,146726,422965,34978,5094,19404073,318707,305208,2320974,1391607,534915,7634339,360638,95631,4,4328878,100,157381,4024233,166784,1664932,754622,8716330,20329,42062,225576,75525,107439,1134822,139489,600804,26692,290472,295955,1916097,258000,411305,8780,11,182971,291765,991331,109882,93622,3105794,74131,143982,2240,529595,337167,166829,136104,70727,548527,1473712,7534,257849,794857,58947,7279,5839083,6362721,2853673,327592,81333,32750,28065,171631,4708675,5386167,2742610,12070,17287,19833,3378,3790390,28190,26606,25641,6159787,134690,526659,1089298,35752,4341446,8352,693102,15020596,285310,239610,128976,167187,31618,880249,717672,880970,34289,682412,29986,18493089,555187,454640,1564706,12416439,34467,230800,1003379,20151,10003,155918,8356159,983563,13213325,94202,1078,417165,131807,266534,222607,269651,524140,759211,208472,6672742,11082368,83184,33145,715048,164776,68509,2910303,1951636,13051,148639,219723,22055418,372629,1066809,170507,12499,8590318,158361,61500,576259,5870894,4638733,29607,243982,4635,1635280,15941,161478,468757,169558,5924818,19556,545027,583819,146418,672,4860300,747470,1795359,4156549,3010110,6197,159678,1476994,126881,84325,43844,3634417,224287,15789,4739,597658,169932,82889,3436,6468,745687,14551343,131372,191420,352579,235249,282780,123395,84229,878293,182472,14405,1337698,14002,6413170,59,90192,46136,339108,84893,810366,82612,6617523,466673,400286,237462,2173320,458237,7032414,722151,7418060,903149,11169,116587,27196,65017,7023396,152389,677067,444875,1925080,107200,1370416,16168,134595,3863102,3726169,1387875,51623,3052066,2943446,1774010,21834,382447,3096019,7250,201228,430123,1764360,9709,154046,94806,299408,84458,439456,2362410,1411,96,232051,40903,325157,1127502,13815457,123079,68446,23242188,337225,1272505,547547,170083,23809005,201597,2600463,94147,6438,3647652,154721,1957748,7918,23260917,156441,398127,211117,141088,5532431,158840,20772,12090896,153365,7792240,6670376,247414,142435,404352,334792,277234,9446,48791,751903,45998,759365,40487,49312,203358,849713,107930,870223,795147,260314,961280,8668323,1591,11207933,100426,133298,5462680,1676211,285145,288643,3315180,35090,390688,86583,3693888,134937,8817,6827,7410,1371581,93957,288788,170745,3783630,5711305,4447115,97276,981967,140084,35390,245519,322337,98785,9993,896572,831038,318771,3861,19965,26995,2091313,1313011,38501,5498498,23407,16303,80866,346386,29394,1077900,3170107,333490,51,50095,101451,754619,264914,83784,44476,988917,165613,142035,409564,1506993,162151,32272,9801177,2269,3113,469625,2720564,101637,36171,32337,38020,140005,1689,22449,107928,8834,85186,496183,36389,145202,335672,364144,309128,1522285,14410837,13774,33,23958,728178,1893265,476500,195366,2570766,156171,236591,1004750,1365432,6511,35616,12013,42878,4015543,169380,251183,11061,725438,186578,1285828,1430004,14940,81016,5455,138883,5998007,14441039,11745509,260097,25667,219031,90640,141948,210028,37284,1792859,9992,1318272,415892,78345,49523,250388,50958,429625,13732,126457,1225236,203730,18039079,2539917,7230848,157842,50882,20340,898126,4430968,8551,221315,3093230,2381665,335180,30528,11320073,2133469,308914,1000453,111640,3030741,2717701,127474,13008,642191,385137,3932527,326243,1086908,294809,615502,7580,168201,2715594,1344935,847895,104523,244607,16019980,48653,456058,32452,3597755,2378905,48553,264323,2031364,219828,417929,755833,1531408,97951,167399,321557,183812,325844,42424,620047,57817,9475,200762,382591,48025,2259739,4915101,124135,2829979,1891004,531901,94080,26195,412498,84951,1724762,94794,7891,1776566,315485,97711,561016,176814,21737,541218,137600,5907738,2,128326,1300609,125012,2412089,881880,26928504,132185,41667,1719239,2552569,10837,14010,2489738,1992122,291073,16057,7513,1806135,1062840,25920939,16897,73471,22105,2822014,19285809,2092487,725898,318837,198939,1965659,1732934,379243,11907,754264,16226,21390,175197,211795,141654,52387,33103,994084,3622922,780337,22694295,4670490,237086,1481857,57761,92632,210096,111359,934377,44845,393210,2954,20055551,10561041,3602,18048,245880,150159,194836,15111,113336,69412,316703,51317,2467184,7878375,89179,201441,4264700,124371,300735,31092785,438,88976,197953,4814780,2446095,71699,2301320,210455,548089,2604600,69614,217750,12028,21860,921446,4128317,13769342,2899122,48734,14784,1141672,33352,82380,18024,101140,4558,21916,28107,3114,46555,10049,24,275655,2831870,38016,284810,6396742,213194,1343284,2102073,1395162,1104,2457735,40082,438626,8625,551385,442169,94900,83667,358536,15082923,4879465,375132,273788,5499713,223308,5010,10109,44020,201260,294421,8165010,13273253,3297,514304,88149,321944,21036,144261,374244,3583224,752893,683453,302207,4768044,25167,7457,1738140,92203,4261,201466,4497767,1807070,19009,51336,1920826,132013,1257855,13947331,1347436,6454287,809057,459815,2090713,358954,1368726,925434,4244026,383311,2886,97993,232133,38984,10948993,253590,11222388,17226,4064894,9468817,50261,142208,3483166,25065,9928228,665350,22631,5190,4945379,6849511,453528,39197,212435,224548,604902,301308,17163,1235435,140345,735031,296804,1839933,397881,164796,1051513,2163094,293907,44846,1286)
...
1286 rows in set (0.02 sec)

(1,286 users, 0.02 secs), and we could also cache this.

I assume most of our computations will be against the revision table. So let's have a look at this query from XTools that fetches revisions made by a user, including the page title and diff size:

EXPLAIN SELECT
    revs.rev_id AS id,
    revs.rev_timestamp AS timestamp,
    UNIX_TIMESTAMP(revs.rev_timestamp) AS unix_timestamp,
    revs.rev_minor_edit AS minor,
    revs.rev_deleted AS deleted,
    revs.rev_len AS length,
    (CAST(revs.rev_len AS SIGNED) - IFNULL(parentrevs.rev_len, 0)) AS length_change,
    revs.rev_parent_id AS parent_id,
    revs.rev_comment AS comment,
    revs.rev_user_text AS username,
    page.page_title,
    page.page_namespace
FROM enwiki_p.revision_userindex AS revs
    JOIN page AS page ON (rev_page = page_id)
    LEFT JOIN revision_userindex AS parentrevs ON (revs.rev_parent_id = parentrevs.rev_id)
WHERE revs.rev_user = 14882394
ORDER BY revs.rev_timestamp DESC

this scans 190,966 rows (or that's the query plan, anyway). If I instead use revs.rev_user_text = 'MusikAnimal' it scans 204,120 rows. Not a big difference, I just wanted to point out that using the ID can be faster. And to be fair, I think in some cases it's the other way around, where using the username is faster (depends on the indexing), but for revision both rev_user_name and rev_user are indexed (usertext_timestamp and user_timestamp, respectively) -- and again getting the usernames if needed via a subquery is very fast, as evidenced by the see second example above.

I disagree that we should knowingly leave a loophole open that will sacrifice the integrity of the data, inconvenience the organizers, or worse, cause runtime errors -- just because we think it will be hard. I think the first step would be to at least try... :) If it doesn't work out, that's OK. We are in the early stages and can adjust our schema and strategy as needed.

@MusikAnimal: Yeah, but don't we also have to look-up the local user ID for each wiki on every query, which would be a completely separate look-up on a separate database?

Maybe I don't really understand your plan. For example, why would you store the user_id from loginwiki? It seems like you would want to store the global_id from the centralauth database (which would then allow you to look-up the local user IDs for the wikis they've edited on).

@MusikAnimal: Yeah, but don't we also have to look-up the local user ID for each wiki on every query, which would be a completely separate look-up on a separate database?

For sure, I was just about to hit save on this comment... My plan would be to fetch the usernames from loginwiki first, then query for username. So nevermind the bit about user ID being faster (it is only marginally from my tests). Nonetheless, we're looking at less than 100ms overhead of having to first get the usernames, and that's assuming we're dealing with thousands of users.

Another thing - for all participants and organizers, we should have another column for whether they are valid or not. And a similar field for the event as a whole.

Same situation, if we're just talking about validating usernames (regardless of whether we store ID or not). It should finish in well under a second. The concern around making sure they are valid totally makes sense, but I don't think we'll need a separate column as this we'd be able to do this validation on-demand, without the need for a background job.

Maybe I don't really understand your plan. For example, why would you store the user_id from loginwiki? It seems like you would want to store the global_id from the centralauth database (which would then allow you to look-up the local user IDs for the wikis they've edited on).

Ah yes! I couldn't figure out where the CentralAuth data lived, it seems it would make more sense to go by that!! Whether we should query for the local user ID from there, I'm not sure, as that'd still mean a subquery for each other query we make. I bet it goes just as fast, though. So either way works for me, but I like the idea of using centralauth and storing the global ID more than loginwiki's user ID.

Another thing - for all participants and organizers, we should have another column for whether they are valid or not. And a similar field for the event as a whole.

Same situation, if we're just talking about validating usernames (regardless of whether we store ID or not). It should finish in well under a second. The concern around making sure they are valid totally makes sense, but I don't think we'll need a separate column as this we'd be able to do this validation on-demand, without the need for a background job.

When we add categories and templates per event in the next version, we'd need to ensure that those categories/templates are valid for given wikis - which may just be all wikis. I'd insist on a 'valid' column for the entire event row, if not for all participants too because we do need that data to enable/disable the "calculate totals" option. It's a boolean anyway so it's not like we're adding a major chunk of data.
Going by my experience, having more data never hurts. But if the app turns out to be "too slow", that's a real problem.

I disagree that we should knowingly leave a loophole open that will sacrifice the integrity of the data, inconvenience the organizers, or worse, cause runtime errors -- just because we think it will be hard. I think the first step would be to at least try... :) If it doesn't work out, that's OK. We are in the early stages and can adjust our schema and strategy as needed.

Well, I certainly don't think it will be "hard". I think this scenario is not very likely to occur.

What happens when a participant is renamed, and the organizer updates it in the interface? The code will have to update all the rows across all the tables, knowing this was a rename and not just changing to a different user. If you have hundreds of participants, how are you meant to notice if one of the accounts was renamed?

Typically in events like the ones we're designing for, the organizer would basically have a sign-up sheet of sorts. And most of these events are limited to under 100 participants.

There are many many tools out there which operate under the assumption that accounts that use them will not be renamed. Or if they are, they're treated as new. Including CopyPatrol. I have not heard this be a problem for any of them.
I'm uncomfortable with the thought of us having to do a lookup for all participants every time someone opens the event page. Don't forget multiple organizers for multiple events under multiple programs could simultaneously be using the app. But okay, we can go ahead with IDs only and change it to usernames if it seems too slow.

When we add categories and templates per event in the next version, we'd need to ensure that those categories/templates are valid for given wikis - which may just be all wikis. I'd insist on a 'valid' column for the entire event row, if not for all participants too because we do need that data to enable/disable the "calculate totals" option. It's a boolean anyway so it's not like we're adding a major chunk of data.
Going by my experience, having more data never hurts. But if the app turns out to be "too slow", that's a real problem.

No problem. I don't think validating titles will prove to be slow, but we could benefit from a "valid" column in one way or another. As you say, they might step away, close the tab, etc., and it'd be nice to have the invalid entries still there so they can fix them.

There are many many tools out there which operate under the assumption that accounts that use them will not be renamed. Or if they are, they're treated as new. Including CopyPatrol. I have not heard this be a problem for any of them.

Totally. However the worst that could happen with CopyPatrol was they'd lose their place on the leaderboard. Here our computations depend on being able to reliably fetch data from the replicas, amongst a large number of users, so to me that seemed fragile. Sorry for being so persistent, I'm admittedly obsessed with perfecting this app! Which I hope is not bad...

I'm uncomfortable with the thought of us having to do a lookup for all participants every time someone opens the event page. Don't forget multiple organizers for multiple events under multiple programs could simultaneously be using the app. But okay, we can go ahead with IDs only and change it to usernames if it seems too slow.

Worth a shot! We can definitely take advantage of caching. I doubt we'll need to, but we can also get a higher quota for concurrent db connections, as we had to do with XTools.

I will submit a revised schema, noting that it certainly isn't set in stone.

Regarding the VPS, I want to emphasize that we have very, very little time to work on this project. We basically have 2 months, which includes Christmas, Thanksgiving, and the Community Wishlist Survey. If we don't have a working prototype by the end of the quarter, it's likely we will kill the project rather than continue working on it, as we will have a slew of higher-priority work to do in January. We can't afford to just tack on extra quarters with this project like we did with XTools. I don't really see making it future-proof as a compelling concern for an MVP, and I imagine that once PHP 5.6 is no longer viable there's a good chance that Grant Metrics will have been replaced by some totally new system anyway. My vote is to keep this project simple and go ahead and build it on Toolforge as a regular project.

Regarding the user ID, I don't have a strong opinion either way, but I hope whatever we implement can be built quickly. If using the user IDs doesn't seem like it's going to slow things down, I'm OK with it. If it's going to add substantial work (or slow down the queries), I would favor using usernames instead.

"Simplicity—the art of maximizing the amount of work not done—is essential." — Principles behind the Agile Manifesto

Technically that's a hostname, not a subdomain. See also: T125589: Allow each tool to have its own subdomain for browser sandbox/cookie isolation

  • We can still use tools-db, meaning the data will be backed up.

tools.db.svc.eqiad.wmflabs (aka tools-db) is not backed up. Most databases it hosts are replicated to a warm spare server, but this is not the same thing as a backup.

Regarding the VPS, I want to emphasize that we have very, very little time to work on this project. We basically have 2 months, which includes Christmas, Thanksgiving, and the Community Wishlist Survey. If we don't have a working prototype by the end of the quarter, it's likely we will kill the project rather than continue working on it, as we will have a slew of higher-priority work to do in January.

Ouch... I was not aware of this. Are we sure this is realistic? I think I can go pretty fast, but yeah, between the Wishlist and the holidays (when I'll be abroad), I am not certain. A prototype, probably, but I wouldn't expect much more than that :/

I imagine that once PHP 5.6 is no longer viable there's a good chance that Grant Metrics will have been replaced by some totally new system anyway. My vote is to keep this project simple and go ahead and build it on Toolforge as a regular project.

This is disheartening. I was planning to put a lot of focus on longevity and code quality. I was under the impression that since we've had other iterations of similar tools, we'd aim to make this one last. If that is not the case, we should definitely scrap all the extras, long discussions on implementation approaches (mostly my fault), and get something working. Sorry, the time sensitivity was not clear to me. Using Toolforge is fine. I don't even have PHP 7+ installed on my local yet, I was just excited to build a new project from the ground up, going by the latest and greatest, best practices, aiming for 90+% test coverage, etc. I don't think all of those things will happen with the time constraints, but I'll do my best!

tools.db.svc.eqiad.wmflabs (aka tools-db) is not backed up. Most databases it hosts are replicated to a warm spare server, but this is not the same thing as a backup.

Got it, thank you. I guess the spare server is what I was thinking of. I recall an incident with tools-db becoming corrupted a year or two ago, and the Labs team spent a few days successfully restoring the data. Whether or not we'll make actual backups in addition to this, I'm not sure.

Updated schema:

--
-- Represents a group of events, with its own title.
--
CREATE TABLE /*_*/program (
  -- Primary key.
  program_id INT unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,

  -- Title of the program. This should be unique.
  program_title VARCHAR(255) binary NOT NULL DEFAULT '',
);
CREATE UNIQUE INDEX /*i*/program_title ON /*_*/program (program_title);


--
-- Table of users who are organizing a program.
--
CREATE TABLE /*_*/organizer (
  -- Primary key.
  org_id INT unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,

  -- Foreign key to the user's ID in the `centralauth` database.
  -- Organizers can create programs that apply to any number of wikis,
  -- and we want a way to tie back to the Wikimedia databases, so
  -- we will use the `gu_id` on `centralauth`.`globaluser`
  org_user_id INT unsigned NOT NULL,

  -- Foreign key corresponding to program_id.
  org_program_id INT unsigned NOT NULL
);

-- For looking up programs that a user organizes.
CREATE INDEX /*i*/org_user ON /*_*/organizer (org_user_id);

-- For looking up organizers of a program.
CREATE INDEX /*i*/org_program ON /*_*/organizer (org_program_id);


--
-- An event belongs to a program, and contains dedicated metrics for
-- 30-day retention, pages created and pages improved. These statistics
-- are periodically imported from MediaWiki and stored in this table.
-- The number of participants, and how many of which are new editors,
-- are computed with joining this table on `participant`.
--
CREATE TABLE /*_*/event (
  -- Primary key.
  event_id INT unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,

  -- Foreign key to program.
  event_program_id INT unsigned NOT NULL,

  -- Title of the event. This should be unique for the program.
  event_title VARCHAR(255) binary NOT NULL DEFAULT '',

  -- Start date of the event as Unix timestamp.
  event_start TIMESTAMP,

  -- End date of the event as Unix timestamp.
  event_end TIMESTAMP,

  -- Timezone as a string, corresponding to the tz database.
  -- (See https://en.wikipedia.org/wiki/Tz_database)
  event_timezone VARCHAR(64) NOT NULL DEFAULT 'UTC',

  -- Last time the statistics were updated.
  event_updated_at TIMESTAMP,

  -- Whether the event has passed validity checks.
  event_valid BOOLEAN NOT NULL DEFAULT 0
);

-- For finding events by time.
CREATE INDEX /*i*/event_time ON /*_*/event (event_start, event_end);

-- For finding events by title. This may be used if for example you remember
-- the event title but can't remember what program it belongs to.
CREATE INDEX /*i*/event_title ON /*_*/event (event_title);

-- For finding events within a program.
CREATE INDEX /*i*/event_program ON /*_*/event (event_program_id);

-- For finding events by title within a program, which should be uniqe.
CREATE UNIQUE INDEX /*i*/event_title ON /*_*/event (event_program_id, event_title);


--
-- Lists the wikis that an event belongs to.
--
CREATE TABLE /*_*/event_wiki (
  -- Primary key.
  ew_id INT unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,

  -- Foreign key to the event.
  ew_event_id INT unsigned NOT NULL,

  -- Database name of the wiki. Corresponds to `dbname` in `meta`.`wiki`
  ew_dbname VARCHAR(32) NOT NULL
);

-- For finding wikis by event.
CREATE INDEX /*i*/ew_event ON /*_*/event_wiki (ew_event_id);

-- For finding events by wiki.
CREATE INDEX /*i*/ew_wiki ON /*_*/event_wiki (ew_dbname);


---
--- Statistics about an event.
---
CREATE TABLE /*_*/event_stats (
  -- Primary key.
  es_id INT unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,

  -- Event metric ('retention' or 'pages created' or 'pages improved' etc.)
  es_metric VARCHAR(32) NOT NULL,

  -- Value of the given metric for given event
  es_metric_value INT unsigned NOT NULL DEFAULT 0,

  -- Foreign key to the event.
  es_event_id INT unsigned NOT NULL
);

-- For finding metrics that were computed for a given event.
CREATE INDEX /*i*/es_metrics ON /*_*/event_stats (es_event_id);

-- For finding all statistics about an event.
-- The combination of event ID and a metric should be unique.
CREATE UNIQUE INDEX /*i*/es_event ON /*_*/event_stats (es_event_id, es_metric);


--
-- Participants belong to an event.
--
CREATE TABLE /*_*/participant (
  -- Primary key, same as the `user_id` in `loginwiki`.`user`
  par_user_id INT unsigned NOT NULL PRIMARY KEY DEFAULT 0,

  -- Foreign key to the event.
  par_event_id INT unsigned NOT NULL,

  -- Whether or not they are considered a new editor,
  -- as of the time of the event.
  par_new_editor BOOLEAN NOT NULL DEFAULT 0
);

-- For finding participants by event.
CREATE INDEX /*i*/par_event ON /*_*/participant (par_event_id);

-- For finding events by participant.
CREATE INDEX /*i*/par_user ON /*_*/participant (par_user_id);

We don't really know how things will perform right now, so I personally think we should move forward and let this evolve as needed.

This is disheartening. I was planning to put a lot of focus on longevity and code quality. I was under the impression that since we've had other iterations of similar tools, we'd aim to make this one last. If that is not the case, we should definitely scrap all the extras, long discussions on implementation approaches (mostly my fault), and get something working. Sorry, the time sensitivity was not clear to me.

Yeah, that's probably my fault. One of our quarterly goals for this quarter is to have a usable MVP version of the Grant Metrics tool, but I'm not sure how realistic it is either. If it's clear that event organizers are actually interested in it (and willing to switch from the tools they are using now), we can probably justify investing more than a quarter on it, but I really don't want us to get bogged down on this.

@MusikAnimal, @Niharika: I'm wondering if we might want to support a many to many relationship between programs and organizers. In other words, allowing an organizer to be associated with multiple programs (and a program to be associated with multiple organizers). If so, we'll need a new bridge table that just records program IDs and organizer IDs. Not sure if that fits with the existing UI plan though. Does that seem like a good idea or unnecessary complication?

@MusikAnimal, @Niharika: I'm wondering if we might want to support a many to many relationship between programs and organizers. In other words, allowing an organizer to be associated with multiple programs (and a program to be associated with multiple organizers). If so, we'll need a new bridge table that just records program IDs and organizer IDs. Not sure if that fits with the existing UI plan though. Does that seem like a good idea or unnecessary complication?

Yes that makes sense. I sort of have that now: There's a separate row for in organizer for each program they manage. Each row in program has an organizer ID. There are also other attributes on program that'd we have to duplicate, so indeed a through table seems most appropriate. I'm mapping this all out in Doctrine ORM now, and discovering they have some nifty helpers to work with relationships like this, so it shouldn't be very complicated.

@MusikAnimal, @Niharika: I'm wondering if we might want to support a many to many relationship between programs and organizers. In other words, allowing an organizer to be associated with multiple programs (and a program to be associated with multiple organizers). If so, we'll need a new bridge table that just records program IDs and organizer IDs. Not sure if that fits with the existing UI plan though. Does that seem like a good idea or unnecessary complication?

Seems unnecessary to me. The current organizer table has a column for program_id, so yeah, we can have same organizer associated with multiple programs. A separate table seems overkill.

It may seem complicated, but check out http://docs.doctrine-project.org/projects/doctrine-orm/en/latest/reference/association-mapping.html

It's a matter of describing the many-to-many relationship in the notation, then the rest is invisible. From what I can tell, we don't actually have to write any SQL -- ever. You could do something like $organizer->getPrograms() and it will just work. Pretty neat!

It may seem complicated, but check out http://docs.doctrine-project.org/projects/doctrine-orm/en/latest/reference/association-mapping.html

It's a matter of describing the many-to-many relationship in the notation, then the rest is invisible. From what I can tell, we don't actually have to write any SQL -- ever. You could do something like $organizer->getPrograms() and it will just work. Pretty neat!

I'm not sure I follow...No SQL? How does it know what tables, what columns?

I'm not sure I follow...No SQL? How does it know what tables, what columns?

We are still specifying the table and column names in the annotations, just not the join tables. It infers this and builds it based on the relationship you define. When it comes time to query them, Doctrine does the SQL for you. I'm still learning, but for example: https://doctrine-orm.readthedocs.io/en/latest/reference/working-with-associations.html#filtering-collections where $group->getUsers(); is analagous to $organizer->getPrograms(). So you don't write the JOIN clauses or whatever, or have to prepare your statements, and the "filtering API can work on the SQL level to make optimized access to large collections".

This gives me (mostly pleasant) flashbacks to the days I worked with Ruby on Rails and ActiveRecord. Some people really don't like ORM, but I think it will work well for us in dealing with our own database. For the replicas we will probably have to write some raw SQL, though.

The Doctrine-generated schema for what I have so far is at: https://github.com/wikimedia/grantmetrics/commit/585207c43d64cd9666b75697f6789e459745b8a9#diff-5f06c870def505220c94affeca73a45dR16

The rest of the code you see is very basic setup for the models, a mostly unappealing homepage, and some tests. Currently we're at 99% test coverage! :D

The schema might could be tweaked some, but I think we've got all the pieces in place, so if it's OK I'd like to get started.

Schema looks reasonable to me.

Looks good to me too. We can start.

Niharika moved this task from Needs Review/Feedback to Q1 2018-19 on the Community-Tech-Sprint board.
Niharika moved this task from In progress to Done on the Grant-Metrics board.