Page MenuHomePhabricator

Database Query improvements for the Newsletter extension
Open, LowPublic

Description

Review of 631882e3779 of Newsletter extension (Jan 29, 2017)

  • includes/NewsletterDb.php" line 503 - SELECT FOR UPDATE without a transaction. From what I understand this would cause an implicit transaction to be opened, with the locks being held for the duration of the request, which is not ideal. Should probably have an explicit transaction (as the code comment notes). - https://gerrit.wikimedia.org/r/#/c/354930/
  • Interaction with content handler?
  • Is deleteInactiveNewsletters actually expected to be run in Wikimedia? There's no batching, no waiting for slaves, etc.
    • Should also just use an associative array for IN (1,2,3,..) statements.
  • Not sure about having nl_description as a very large blob of text just floating there in the table (I know a long time ago it was me who suggested making it bigger. Sorry for the reversal). Should perhaps get DBA's opinion on this. Also, the table isn't really the canonical source (the wikipage is), so its kind of unnecessary to store the whole thing when the special page cuts it off after 644 bytes - thus I would perhaps suggest using a VARCHAR(644) instead. In any case, double check with a DBA about this. Perhaps its all fine since the newsletter table will be relatively small.
  • "NewsletterTablePager.php" line 61 - There is a unique constraint on nl_id, so I don't understand the use of DISTINCT nl_id here. Additionally, I don't believe that's the correct way of setting the DISTINCT option (the field name is not included when using as an "option" in the abstraction layer), so I don't think it would do anything anyways.
  • NewsletterTablePager.php line 59 - The whole subscriber count thing is of questionable scalability. Maybe ok for now if things are small. I don't know. Maybe should have caching or something. Maybe should ask DBA - after https://gerrit.wikimedia.org/r/#/c/354756/
  • NewsletterTablePager.php line 69 (The query) - `SELECT nl_name,nl_desc,nl_id,( SELECT COUNT(*) FROM nl_subscriptions WHERE nls_newsletter_id = nl_id ) AS subscribers,'1' IN (SELECT nls_subscriber_id FROM nl_subscriptions WHERE nls_newsletter_id = nl_id ) AS current_user_subscribed FROM nl_newsletters WHERE (nl_active = 1) ORDER BY current_user_subscribed DESC LIMIT 51 ;`
    • Using a subquery of the form <user_id> in (SELECT all subscribers to current newsletter) seems like a really icky way to check if the current user is subscribed. Fortunately MySQL seems to be able to optiize this correctly. Nonetheless for human sanity, I would suggest rewriting this as a left join. This also applies to the filter which could be implemented as a condition on the current_user_subscribed column instead of an identical subquery.
    • The "subscribers" subquery is probably not ok, although check with a DBA (Maybe you could get away with it on the assumption the table is fairly small).
    • sorting by the subscribers column causes this to be a full table scan filesort, which is probably not ok.
    • For the last two points, probably what needs to happen is there to be a nl_subscribers column that keeps count of how many subscribers, and is properly indexed for sorting.
  • "NewsletterTablePager.php" line 86 - Calling Newsletter::newFromID( (int)$id ); in formatValue() results in 4 db queries per every newsletter returned (For a total of 200 db queries). This seems a little excessive. At the very least it (probably the Newsletter class) should cache things so it doesn't query the same newsletter multiple times. The ideal scenario would be to reuse the results of the main special page query so that no additional queries had to be done (e.g. Some sort of non-private version of NewsletterDb::getNewslettersFromResult()). - https://gerrit.wikimedia.org/r/#/c/354749/
  • "NewsletterContent.php" line 115 - It doesn't make sense to call User::newFromName(...)->getId() on all the publishers and then call UserArray::newFromIDs. This would most likely cause a db query to be issued for each user individually, and then a db query for all the users together. Instad just user UserArray::newFromNames which should only use a single db query.

Event Timeline

Change 354636 had a related patch set uploaded (by 01tonythomas; owner: 01tonythomas):
[mediawiki/extensions/Newsletter@master] Remove bogus DISTINCT nl_id on already UNIQUE nl_id field

https://gerrit.wikimedia.org/r/354636

Change 354749 had a related patch set uploaded (by 01tonythomas; owner: 01tonythomas):
[mediawiki/extensions/Newsletter@master] Reduce the database query load by caching all newsletters locally

https://gerrit.wikimedia.org/r/354749

Change 354636 merged by Brian Wolff:
[mediawiki/extensions/Newsletter@master] Improve DB query on Special:Newsletters

https://gerrit.wikimedia.org/r/354636

Change 354749 merged by jenkins-bot:
[mediawiki/extensions/Newsletter@master] Reduce the database query load by caching all newsletters locally

https://gerrit.wikimedia.org/r/354749

Change 354751 had a related patch set uploaded (by 01tonythomas; owner: 01tonythomas):
[mediawiki/extensions/Newsletter@master] use UserArray::newFromNames instead of manually doing this

https://gerrit.wikimedia.org/r/354751

Change 354751 merged by jenkins-bot:
[mediawiki/extensions/Newsletter@master] use UserArray::newFromNames instead of manually doing this

https://gerrit.wikimedia.org/r/354751

Change 355006 had a related patch set uploaded (by 01tonythomas; owner: 01tonythomas):
[mediawiki/extensions/Newsletter@master] Use in the 'nl_subscriber_count' to compute subscriber count

https://gerrit.wikimedia.org/r/355006

Change 355006 merged by jenkins-bot:
[mediawiki/extensions/Newsletter@master] Use in the 'nl_subscriber_count' to compute subscriber count

https://gerrit.wikimedia.org/r/355006

Now that we are done with the security review, I guess the DB improvements at https://phabricator.wikimedia.org/T159083 looks like a blocker for deployment as per https://phabricator.wikimedia.org/T159081#3203545

Mmm... who decides whether this is a blocker or not? Are these database performance problems critical for a first deployment limited to new newsletters in MediaWiki.org (which is the current plan)?

Now that we are done with the security review, I guess the DB improvements at https://phabricator.wikimedia.org/T159083 looks like a blocker for deployment as per https://phabricator.wikimedia.org/T159081#3203545

Mmm... who decides whether this is a blocker or not? Are these database performance problems critical for a first deployment limited to new newsletters in MediaWiki.org (which is the current plan)?

Apparently there was one critical performance drag, and we fixed it late yesterday. The above comment is hence obsolete and what we have left should be trivial.

Qgil triaged this task as Low priority.Sep 6 2017, 9:03 PM
Qgil moved this task from Confirmed to Backlog on the MediaWiki-extensions-Newsletter board.

what we have left should be trivial.

Triaging accordingly.