Page MenuHomePhabricator

Add Link engineering: Provide a mechanism for storing data about which link recommendations were rejected by the user
Closed, ResolvedPublic

Description

There are a few concerns in this task.

From an instrumentation perspective, we will log information about number of rejected / approved / skipped link suggestions when a user makes an edit.

At a more granular level, it would be useful to know which links were rejected so that the algorithm can be improved. The rejected link and the rejection reason will only need to be stored in EventLogging, because the software will not need those details in order to show the user anything. The software just needs to be able to know how many links were rejected by the user in a given article. The only exception is the edit summary (T269657), which shows an accounting of which links the user rejected and skipped.

For the first version of "add a link", we do not need to worry about showing a second user a link suggestion that was rejected by a previous user. We will rely on the small number of users and large number of articles to minimize collisions. This is something we will want to analyze and decide whether to address in the future, so we should plan to be able to query how many articles get re-suggested after receiving rejected links.

Related Objects

Event Timeline

@MMiller_WMF I think this needs some more product requirements before we can discuss ideas for technical implementation. If a single user says "No" to a suggestion does that mean it should not show again? If we require multiple "No" votes, how many? Do we require the user to save or otherwise complete the session before registering that "No"? It seems like this could get pretty complicated fairly quickly.

@MGerlach do you have any thoughts on this? Specifically, on your capacity to implement a block list in research/mwaddlink in this quarter? I was thinking that when we call the link recommendation service, in addition to the wikitext, we would also provide a list of known link recommendations that we want excluded from the return value.

On the other hand, we could also filter out link recommendations we want excluded after we get results back from the link recommendation service, so you wouldn't necessarily need to do anything on your end.

As mentioned in T261411#6578439, one idea for recording the rejections/skip/accepted set is to use the existing row in the growthexperiments_link_recommendations table. Rather than removing this row when an article in the queue is edited via a link suggestion, we could update it to include the set of rejected/accepted/skipped link recommendations. Then, if the same article is selected in the future for new link recommendation generation, we could include the set of rejected data to the link recommendation service where it could be included in the list of links to exclude from suggestions.

@MGerlach do you have any thoughts on this? Specifically, on your capacity to implement a block list in research/mwaddlink in this quarter? I was thinking that when we call the link recommendation service, in addition to the wikitext, we would also provide a list of known link recommendations that we want excluded from the return value.

This is possible. When generating recommendations for an individual article, we already have an exclude-list of links consisting of i) the already existing links extracted from the current wikitext in order to avoid recommending existing links, ii) the already recommended links in order to avoid recommending the same link several times (i.e. we grow the exclude-list as we generate recommendations).
Thus if we were able to query a table of already rejected links, we could simply add those the exclude-list. This would ensure we do not recommend the links again.

The high-level decision here is wheter to use EventGate (easy to do, storage size is not an issue, easy to integrate with analytics tooling, not possible to use in MediaWiki e.g. to suppress recommendations in real time) or the MediaWiki DB (more effort, storage size is a concern although not a huge one, not sure about analytics, can be reused immediately). Either works for instrumentation, and for creating periodic dumps for machine learning feedback, but only a DB-based solution works for immediately removing recommendations after when they get rejected.

I guess EventGate could also feed the data into some standalone database that's used by the service for filtering. I don't have a good grasp of how difficult that would be (it sounds easy, but that might just be my ignorance talking).

In the MediaWiki DB, I can think of three options:

  • logging table. This would imply that the data is public. It's a flat structure so it'd probably have to be one row per link added/rejected; doesn't seem like a good choice.
  • JADE. Not quite sure of the future of that component with the changes in Scoring team, we should probably check in with them.
  • custom table(s).

One option is to have some kind of rejection counter: (page_id, target_title, rejection_count). (I assume we only care about the target of rejected links, and not their link text.) Another is to try to handle this and T266473: Add Link engineering: Provide a mechanism for recording credit to a user if they review all link recommendations with "no" or "skip" in the same table - we'd then end up with something like (page_id, user_id, target_title, user_action) and an index on (user_action, page_id, target_title) for blacklisting and (user_id, page_id) for contributions. Assuming an average of 10 recommendations / page and 10x the current scale, the latter would be about 10M rows per year. That's significant but not tragic, I think.

The high-level decision here is wheter to use EventGate (easy to do, storage size is not an issue, easy to integrate with analytics tooling, not possible to use in MediaWiki e.g. to suppress recommendations in real time) or the MediaWiki DB (more effort, storage size is a concern although not a huge one, not sure about analytics, can be reused immediately). Either works for instrumentation, and for creating periodic dumps for machine learning feedback, but only a DB-based solution works for immediately removing recommendations after when they get rejected.

I guess EventGate could also feed the data into some standalone database that's used by the service for filtering. I don't have a good grasp of how difficult that would be (it sounds easy, but that might just be my ignorance talking).

In the MediaWiki DB, I can think of three options:

  • logging table. This would imply that the data is public. It's a flat structure so it'd probably have to be one row per link added/rejected; doesn't seem like a good choice.
  • JADE. Not quite sure of the future of that component with the changes in Scoring team, we should probably check in with them.
  • custom table(s).

I was thinking we would have both, since we have two different use cases. I was thinking that we would use EventGate for our analytics: when users reject / skip / accept suggestions we'd log that data alongside the rest of the things we've instrumented in the suggested edits flow.

And then for keeping track of the rejected links on a page, I was thinking that we could reuse the table that is proposed in the patch for T261410. We would add one more column, like:

{
	"name": "gelr_rejected_links",
	"comment": "Link recommendation data rejected links as an arbitrary JSON object.",
	"type": "binary",
	"options": { "length": 16777215, "notnull": true }
}

Then the workflow would look like:

  1. User saves an edit with some accepted and some rejected links
  2. We update the row in the growthexperiments_link_recommendations table with the JSON shaped rejected links.

Then, if this article is selected again by the maintenance script (or via job queue, if we decide to do that) for regenerating links, the maintenance script checks to see if there is an entry in growthexperiments_link_recommendations with rejected links data, and it sends those along in the POST request to the link recommendation service.

Previously we talked about dropping rows from growthexperiments_link_recommendations table when an article is edited. We would still do that if there is no data in the gelr_rejected_links column; if there is data in the column we should retain it. We would not be at risk of showing outdated link recommendations to users since 1) the search index knows which articles have up-to-date recommendations and 2) we can cross reference the revision_id to double-check that we are looking at the latest revision.

Do we need to differentiate between links rejected by N users (which are blacklisted forever, for a given page) and links rejected by the current user (which we don't want to show to that user again)? Or we will never give the same user the same article again so it doesn't matter?

I would avoid mixing cache data and primary data in the same table, there is little benefit (both are identified by page/revision so it's easy to denormalize) and makes migrations, model changes and the like a lot more complicated. (Which reminds me, we should probably put a model version somewhere in T261410. Not sure if needs to be a column or just a JSON field.) I'm not sure we really need to preserve the recommendation for the user rejection info to make sense, either. All the ML system really needs is a teaching set (list of links labeled with good/bad), right?

Change 653651 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] Add Link: API endpoint for submitting the user's choices

https://gerrit.wikimedia.org/r/653651

Change 654034 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] Add Link: Store user reviews of link recommendations

https://gerrit.wikimedia.org/r/654034

Uses a new DB table:

REATE TABLE /*_*/growthexperiments_link_submissions (
  gels_revision INT UNSIGNED NOT NULL,
  gels_target INT UNSIGNED NOT NULL,
  gels_page INT UNSIGNED NOT NULL,
  gels_user INT UNSIGNED NOT NULL,
  gels_feedback VARCHAR(1) NOT NULL,
  INDEX gels_page_feedback_target (
    gels_page, gels_feedback, gels_target
  ),
  PRIMARY KEY(gels_revision, gels_target)
);

Takes about 20 bytes per row, we'll have about 10 links per task, we have about 3000 suggested edits a week, and hope to scale up about 10x over time. So that's about 300MB per year, pretty negligible. I used page IDs instead of title strings to keep size low but maybe that was overkill.

@MMiller_WMF @nettrom_WMF are there product or analytics reasons to store store specific rejection reasons for links (from T269647) in the database? I imagine we will record those rejection reasons in our event logging data, so we would be duplicating that data if we stored it in the growthexperiments_link_submissions database table, but we could do it if there's a useful reason to have it there.

@kostajh -- I read back through the whole task, and I noticed that, back in October, the comments were talking about whether/how we would suppress rejected links from future recommendations. I think we decided outside of Phabricator that we would not worry about that functionality for now because the randomness in the system would make it rare that a user bumped into the same article more than once. That's correct, right? If so, I will update the task description to say so.

Regarding the question that you asked about storing rejection reasons: I think you're saying that if we only want them for asynchronous analytics purposes, EventLogging should suit us. But if there are going to be some features that require knowledge of rejection reasons (e.g. if the user were to see a summary of the rejection reasons they've given), they would need to be in the database. Is that correct? If so, no, there are no parts of the feature set that require the rejection reasons. They are just for our storage and analytics.

I will also update the task description when I hear back from you.

The probability of no three users ever getting the same page, given t finished add link tasks and p task pages, is something like e^-(t^3/6/p^2) (it's a generalized birthday problem). In other words, it starts getting significant when the number of finished tasks nears the cubic root of the square of the number of pages. For a wiki with a million pages, that's ten thousand tasks - not that much.
(I'm using three users because the specification said if two users rejected a link the third should not see it.)

@kostajh -- I read back through the whole task, and I noticed that, back in October, the comments were talking about whether/how we would suppress rejected links from future recommendations. I think we decided outside of Phabricator that we would not worry about that functionality for now because the randomness in the system would make it rare that a user bumped into the same article more than once. That's correct, right? If so, I will update the task description to say so.

Yes, I think that's what we talked about.

Regarding the question that you asked about storing rejection reasons: I think you're saying that if we only want them for asynchronous analytics purposes, EventLogging should suit us. But if there are going to be some features that require knowledge of rejection reasons (e.g. if the user were to see a summary of the rejection reasons they've given), they would need to be in the database. Is that correct? If so, no, there are no parts of the feature set that require the rejection reasons. They are just for our storage and analytics.

Yes, that's an accurate summary of what I'm saying. OK, so we will not store the specific rejection reasons in the database, but we will include this in event logging.

The probability of no three users ever getting the same page, given t finished add link tasks and p task pages, is something like e^-(t^3/6/p^2) (it's a generalized birthday problem). In other words, it starts getting significant when the number of finished tasks nears the cubic root of the square of the number of pages. For a wiki with a million pages, that's ten thousand tasks - not that much.
(I'm using three users because the specification said if two users rejected a link the third should not see it.)

Well, it's not just no three users ever getting the same page, it's users getting the same page that had rejected links. We don't know what percentage of tasks in practice will contain rejected links. My inclination is to run some queries after the feature is live so we could see how often this occurs in practice, then decide how to prioritize the work needed to avoid this scenario from coming up.

It doesn't seem like it would be too difficult to include some code for this now in the service and in our maintenance script, I'd just prefer to push that until after the initial release. @MMiller_WMF please let me know what your preference is.

Change 653651 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Add Link: API endpoint for submitting the user's choices

https://gerrit.wikimedia.org/r/653651

It doesn't seem like it would be too difficult to include some code for this now in the service and in our maintenance script, I'd just prefer to push that until after the initial release. @MMiller_WMF please let me know what your preference is.

The current patch does already include it for the maintenance script.

Flagging this for DBA review. The table is proposed in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/654034 (json, sql). T266446#6718421 has a size estimate. The table would exist for the same wikis as the one discussed in T261410: Add a link engineering: Create MySQL table for caching link recommendations (and presumably be in the same DB cluster?).

Change 655807 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/WikimediaMaintenance@master] Add growthexperiments_link_submissions.sql table for GrowthExperiments

https://gerrit.wikimedia.org/r/655807

It doesn't seem like it would be too difficult to include some code for this now in the service and in our maintenance script, I'd just prefer to push that until after the initial release. @MMiller_WMF please let me know what your preference is.

The current patch does already include it for the maintenance script.

Ah, well that makes it an easy choice then :) I hadn't gotten to that patch before commenting.

Flagging this for DBA review. The table is proposed in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/654034 (json, sql). T266446#6718421 has a size estimate. The table would exist for the same wikis as the one discussed in T261410: Add a link engineering: Create MySQL table for caching link recommendations (and presumably be in the same DB cluster?).

Thanks for the ping @Tgr and also for the size estimations - I agree, 300MB per year is pretty tiny. I would assume this table won't get many reads too, no?

I would assume this table won't get many reads too, no?

Indeed. It will be used in three ways:

  • Prevent duplicate submissions. This is a straightforward primary key prefix check, just before the table would be written (so maybe a few ten thousand times a week).
  • Prevent links which have already been rejected a certain amount of time from being suggested again. That's one query per generated recommendation; recommendation generation will probably happen a few times a second, in a controlled manner on a job runner. The query is a range scan alongside the gels_page_feedback_target index; I expect range to be scanned (the number of reviewed links per page) to be very low.
  • Display a user's contributions in some compact form (e.g. count the number of links they have reviewed). This is not yet specified and won't be in the initial version. It will be some kind of a range scan over all contributions of the user. We'll have a lot of freedom to make sure this doesn't cause performance problems - it is a caching-friendly use case, and we can probably use limits liberally as our focus is supporting new users who don't have a lot of edits yet.

Thanks @Tgr for the detailed explanations, I am ok with that and we can further explore the user contributions feature once the time has arrived but given how small that table is, I am not sure we'll see performance problems there anyways, but we can check the optimizer trace once the time has arrived.
Thanks again

Change 654034 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Add Link: Store user reviews of link recommendations

https://gerrit.wikimedia.org/r/654034

Change 656834 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] [WIP] Record link offset on link recommendation submission

https://gerrit.wikimedia.org/r/656834

Change 656834 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Record link offset on link recommendation submission

https://gerrit.wikimedia.org/r/656834

I don't think there is anything to QA here, but leaving open for now so @Etonkovidova can decide.

Change 655807 merged by jenkins-bot:
[mediawiki/extensions/WikimediaMaintenance@master] Add growthexperiments_link_submissions.sql table for GrowthExperiments

https://gerrit.wikimedia.org/r/655807

@Tgr, @kostajh - if the info below doesn't require further actions, the task may be Closed, although I expected to see information on users' rejection reasons would be stored in betalabs db.

Checked in betalabs.
The table has been created, but it's empty. Users' actions on Add links are not stored there.

MariaDB [cswiki]> show table status from cswiki where name='growthexperiments_link_submissions'\G
*************************** 1. row ***************************
            Name: growthexperiments_link_submissions
          Engine: InnoDB
         Version: 10
      Row_format: Compact
            Rows: 0
  Avg_row_length: 0
     Data_length: 16384
 Max_data_length: 0
    Index_length: 16384
       Data_free: 0
  Auto_increment: NULL
     Create_time: 2021-03-10 15:52:48
     Update_time: NULL
      Check_time: NULL
       Collation: binary
        Checksum: NULL
  Create_options: 
         Comment: 
Max_index_length: 0
       Temporary: N
1 row in set (0.00 sec)
MariaDB [cswiki]> SELECT  table_name AS `Table`, table_schema AS `wiki`, round(((data_length + index_length) / 1024 / 1024), 2) `Size in MB`  FROM information_schema.TABLES  WHERE  table_name in ( "growthexperiments_link_recommendations", "growthexperiments_link_submissions");
+----------------------------------------+------------+------------+
| Table                                  | wiki       | Size in MB |
+----------------------------------------+------------+------------+
| growthexperiments_link_recommendations | cswiki     |       0.41 |
| growthexperiments_link_submissions     | cswiki     |       0.03 |
| growthexperiments_link_recommendations | hiwiki     |       0.03 |
| growthexperiments_link_submissions     | hiwiki     |       0.03 |
| growthexperiments_link_recommendations | cawiki     |       0.03 |
| growthexperiments_link_recommendations | ruwiki     |       0.03 |
| growthexperiments_link_submissions     | ruwiki     |       0.03 |
| growthexperiments_link_recommendations | hewiki     |       0.03 |
| growthexperiments_link_submissions     | hewiki     |       0.03 |

Submitting reviews of recommendations is not implemented yet.

Submitting reviews of recommendations is not implemented yet.

Thanks! If it's a part of this task, the task should stay in QA or to be moved?

The backend part that this part is about is implemented, but the frontend part is not (T269657: Add a link: edit summary and publish). I'd leave it in QA if that's OK with you.

Re-checked betalabs growthexperiments_link_submissions tables - they are still empty.

growthexperiments_link_submissions is not a public tables, so I cannot check it on quarry. Querying by Suggested: add links gives info only about submitted links. Only GrowthExperiments log will show the overall number of links and the number of how were accepted, rejected, or skipped. The following task T269657: Add a link: edit summary and publish is reviewed and marked as done. Closing this task also as Resolved.

mysql:research@dbstore1005.eqiad.wmnet [cswiki]> select count(*) from growthexperiments_link_submissions;
+----------+
| count(*) |
+----------+
|     6763 |
+----------+
1 row in set (0.002 sec)

Not sure if it's broken on beta or just no one is submitting anything there. Seems to be working in production though.