Page MenuHomePhabricator

Rewrite Wishlist Survey bot
Open, HighPublic

Description

Background

The WishlistSurvey task that has powered the Community Wishlist Surveys from 2016 to 2023 badly needs a rewrite. It is essentially 7 years of hacks built on top of each other and has a number of unnecessary dependencies, limitations, and other issues including but not limited to:

  • Written in Ruby (not widely used by Community Tech)
  • Must live on the musikbot tool on Toolforge
  • Requires manual runs to take advantage of various scripts
  • etc.

The bot should be rewritten so that people other than just MusikAnimal can maintain it, too.

Acceptance criteria

  • It should be written in Node.js, sharing some logic with the intake form such as parsing wish content
  • It should do all the things the old bot did, but match the new designs

More requirements and/or subtasks TBD as we get further into the project

Details

TitleReferenceAuthorSource BranchDest Branch
Allow the bot to run continuouslyrepos/commtech/wishlist-intake!78tstarlingwork/tstarling/continuous-botmain
Have the bot make a table of recent wishes for the dashboardrepos/commtech/wishlist-intake!72tstarlingwork/tstarling/recent-wishes-botmain
update-indexes: RC and cachingrepos/commtech/wishlist-intake!48tstarlingwork/tstarling/update-indexesmain
OOP CLI commandsrepos/commtech/wishlist-intake!35tstarlingwork/tstarling/cli-refactormain
Add more wishlist pages, templates and basic bot runrepos/commtech/wishlist-intake!22musikanimaltemplatesmain
Add basic entrypoint for the botrepos/commtech/wishlist-intake!20musikanimalbotmain
Use wiki page when editing wishes, add centralized config and Util classrepos/commtech/wishlist-intake!17musikanimalutilmain
Customize query in GitLab

Event Timeline

musikanimal merged https://gitlab.wikimedia.org/repos/commtech/wishlist-intake/-/merge_requests/17

Use wiki page when editing wishes, add centralized config and Util class

Proposed implementation plan:

  • Use sqlite3 for data persistence, using the filesystem to avoid the need for an SSH tunnel to a Cloud Services database (or local MySQL-like dependency)
  • Bot runs every 10 minutes, and we keep track of the last run in the db (because sometimes Toolforge jobs fail)
  • On each run, it loops through all wishes, finding those that have been edited since the last run
  • Parse each of the new proposals, updating the db
  • Update each of the index pages by running the appropriate query and building the output
    • Maybe have a placeholder i.e <div id="wishlist-intake-wish-table"> so the bot knows where it should do its editing.
    • Utilize on-wiki templates as much as possible. I.e. the "Focus area" cards or each row in the "Wishes" table would be a template.

My plan originally was to save database stuff for later, but given how the index pages are laid out, it's almost easier to put in a db rather than do all the sorting and what not in JavaScript.

Going a step further, maybe the gadget should run the queries and update the index pages? This would require use having a Vite server for the backend with a simple CORS compliant API. That's more work, but moves presentation logic to the frontend where it belongs (as Vue templates).

Thoughts? cc @tstarling

Proposed implementation plan:

  • Use sqlite3 for data persistence, using the filesystem to avoid the need for an SSH tunnel to a Cloud Services database (or local MySQL-like dependency)

I would prefer MySQL over SQLite+NFS. I agree that it's good to avoid the need for an SSH tunnel, but a local database with a few cache tables and no wiki views should be easy enough to set up. The SQLite FAQ warns against running it over NFS, since if two copies of the job are somehow run, the database file could be permanently corrupted, requiring human intervention.

Going a step further, maybe the gadget should run the queries and update the index pages? This would require use having a Vite server for the backend with a simple CORS compliant API. That's more work, but moves presentation logic to the frontend where it belongs (as Vue templates).

I thought we were going to allow changes to the proposal pages that were not done by the gadget. Like API and no-JS edits.

I would prefer MySQL over SQLite+NFS. I agree that it's good to avoid the need for an SSH tunnel, but a local database with a few cache tables and no wiki views should be easy enough to set up. The SQLite FAQ warns against running it over NFS, since if two copies of the job are somehow run, the database file could be permanently corrupted, requiring human intervention.

Great point about NFS! That makes SQLite pretty much a no-go. Unless we jump the gun and go for VPS? There's zero need for that with a simple Node.js bot, but if we start offering an API for the gadget to read from (see below), then we might eventually stretch our limits on Toolforge and have to move to VPS – as is the story of many CommTech tools.

I'm fine with MySQL/MariaDB too if that makes more sense, and that's VPS future-proof as well. In credentials.json, each entry could have their own respective db credentials, such that you could always use tools-db if you wanted, etc. Sounds nice!

Going a step further, maybe the gadget should run the queries and update the index pages? This would require use having a Vite server for the backend with a simple CORS compliant API. That's more work, but moves presentation logic to the frontend where it belongs (as Vue templates).

I thought we were going to allow changes to the proposal pages that were not done by the gadget. Like API and no-JS edits.

I meant more for the "index" pages (i.e. the dashboard), not individual proposals. Manual edits to proposals are exactly why we need a bot! What I'm questioning I guess is the time investment in making an API that the gadget can read from when it builds the index pages. We can certainly save that effort for later, at least.

The index pages are going to be one per focus area aren't they? And the focus areas will be dynamically definable by e.g. creating a page with a {{Community Wishlist/Focus area}} template (with other metadata such as descriptions)? All of which is pretty much like the old categories system.

And that all sounds great for focus areas because I assume we'll aim to never have more than a dozen proposals in each or whatever, but will we at some point be ending up with index pages (e.g. the list of focus areas) that are too large for a single page? In that situation, do we build a manual pagination system (akin to archive pages) that the bot can manage?

We had talked about having a dynamic index page, that looks up the data live from a toolforge API, but I'm guessing that's out of scope now?

(Sorry, maybe that's all just premature optimisation for when there are a thousand proposals!)

Using MariaDB sounds good, with the connection config living alongside the wiki API details in credentials.json, sounds great. Another advantage of having the data in a toolsdb is that (after T348407) it'll be quereable in Quarry (or maybe Superset soon; I thought there was a Community Wishlist Survey task for that but I can't find it now).

The index pages are going to be one per focus area aren't they? And the focus areas will be dynamically definable by e.g. creating a page with a {{Community Wishlist/Focus area}} template (with other metadata such as descriptions)? All of which is pretty much like the old categories system.

That sounds correct, but FYI by "index page" I also refer to the "Wishes" page that lists all wishes.

And that all sounds great for focus areas because I assume we'll aim to never have more than a dozen proposals in each or whatever, but will we at some point be ending up with index pages (e.g. the list of focus areas) that are too large for a single page? In that situation, do we build a manual pagination system (akin to archive pages) that the bot can manage?

TBD. For now I say we just throw it all on one page; it will be a while before it becomes a problem.

We had talked about having a dynamic index page, that looks up the data live from a toolforge API, but I'm guessing that's out of scope now?

If we're storing the data in MariaDB, then it's not too far off to go ahead and make a lightweight server. I am looking into using Vite for this, which would also replace Rollup and possibly also the need for the dedicated Node server. If that all works out, it should be relatively simple to add an Express API app to serve the data to wishlist intake.

I suggest renaming bot.js to update-indexes.js.

Also, if it was up to me, it would all be OOP, I wouldn't put the bulk of the logic in the file scope. I'd just have a couple of lines of boilerplate in bin/ and reorganise src/ so that the bot-related classes can go there. Reorganising is going to be painful when we have 4 people editing the same files every day, but maybe it's not too bad if we aggressively merge all outstanding work into the main branch. Keep in mind that the main branch doesn't have to be production quality at this stage.

Regarding the main query: currently the script uses generator=categorymembers feeding into prop=revisions. It would be possible instead to use generator=recentchanges feeding into prop=categories with clcategories=Category:Community_wishlist. Meta gets about 7 RC rows per minute in total, so if you run the bot once per minute, generator=recentchanges becomes the more efficient query strategy once you have more than 7 wishes.

However, generator=recentchanges requires some extra work to process removal from a category, page moves and deletion.

Either way, you need to follow up with a prop=revisions query to get the changed page text.

I would lean towards regenerating the whole index page each time the metadata suggests there will be a change in it, rather than trying to identify and preserve unchanged wikitext.

In the cache DB I imagine there will be the broken-down template parameters from WishlistTemplate.getData(), and also any necessary metadata such as the latest timestamp.

With !59 I'm proposing the following for bot automated pages:

  • Template:Community Requests/Wishes/All – Contains rows for all wishes (bot already does this)
  • Template:Community Requests/Wishes/Recent – last N proposals sorted by date
  • Template:Community Requests/Focus areas – Contains overview cards of all focus areas, populated with templates similar to what we do for Wishes/All
  • Template:Community Requests/Focus areas/Num_votes – Contains the number of votes across all focus area. This is needed in the i18n messages as it seems they're likely to live on-wiki at least initially (T365471)
  • Template:Community Requests/Focus areas/<focus-area-slug> – Contains rows for wishes in a focus area.
  • Template:Community Requests/Focus areas/<focus-area-slug>/Num_votes – Contains the number of votes in a focus area, again for us in i18n messages.

Random tech debt thing -- we have a few terms for whatever the subpage name for a wish/focus area is. Is slug a good name and if so maybe the db and backend things could be renamed? Or vice versa :)

With !59 I'm proposing the following for bot automated pages:

I should mention, the other option of course is build a wrapper API and let the gadget talk to it. But between CSP warnings and whatever else I'm assuming the old fashioned way will do fine enough for now.

Random tech debt thing -- we have a few terms for whatever the subpage name for a wish/focus area is. Is slug a good name and if so maybe the db and backend things could be renamed? Or vice versa :)

Is this WordPress jargon? To be honest I had to look it up. In the template we have "area" which is in the database as wish_area. I would just call it the area name. The URL form of the area name would have underscores and percent-encoding per normal MW conventions. "Focus" seems undecided since the renaming page calls them "problem areas" -- that's the same thing right?

Random tech debt thing -- we have a few terms for whatever the subpage name for a wish/focus area is. Is slug a good name and if so maybe the db and backend things could be renamed? Or vice versa :)

Is this WordPress jargon? To be honest I had to look it up. In the template we have "area" which is in the database as wish_area. I would just call it the area name. The URL form of the area name would have underscores and percent-encoding per normal MW conventions. "Focus" seems undecided since the renaming page calls them "problem areas" -- that's the same thing right?

Eh I don't know, surprised you hadn't heard of it! I've used the term extensively over the years, and I've never worked on anything WordPress.

Anyway, by "slug" I just mean the unique identifier in the URL path. One thing we're doing from previous surveys is we're not moving pages when wishes (or focus areas) are renamed. Instead, there's a permanent page title. The subpage name of that is what I'm referring to by "slug". So the "area" name in that case would need to be the subpage name, and not the "area" title. I just figured calling it "area_slug" or something (and likewise for the wish table) would keep things consistent, but it doesn't really matter much.

"Focus" seems undecided since the renaming page calls them "problem areas" -- that's the same thing right?

Yes, focus area, topic area, and problem area are all the same thing. Naming won't be finalized until the rename vote has finished.