Page MenuHomePhabricator

Add a link engineering: Determine format for accessing and storing link recommendations
Closed, ResolvedPublic

Description

This task informs T261410: Add a link engineering: Create MySQL table for caching link recommendations.

As noted there, we might be storing:

  • the raw wikitext as is
  • annotated wikitext that distinguishes the algorithm-added links from the already existing links
  • a diff of the old wikitext and new wikitext

It's possible we could coordinate with @MGerlach to get more structured output from the mwaddlink tool; for example while it currently just provides a new version of the article in wikitext with additional links added, the tool could provide a structured response with:

  • number of links added
  • a list of new proposed links (the text)
  • annotated wikitext to highlight what the recommended links are

So this task should encompass both tuning the output of the mwaddlink tool as well as specifying how we'll store the output in our MySQL table.

Event Timeline

Does this really block T261410? We'll likely end up with a TEXT or BLOB field for the recommendation, no matter which option we choose.

As for choosing,

  • The raw wikitext doesn't seem useful. (How would we tell what the recommendations are?) The recommender tool needs to be updated to return something useful.
  • Annotated wikitext seems straightforward. mwaddlink replaces foo bar baz with foo <link-recommendation target="Bar (unit)" score="0.21348575">bar</link-recommendation> baz. (mwparserfromhell can handle custom extension tags so this is a trivial change.) We than either declare it as a proper wikitext tag and use Parsoid or the core parser to turn it into the appropriate HTML, or (if we end up with a UX that's completely detached from editing and the user submits wikitext manipulated by us) we can easily parse it by regular expressions or an XML parser or the preprocessor's AST even.
  • A diff is just a compression mechanism for one of the above options. It doesn't really affect anything (you need to load the old revision to be able to uncompress it, but that's not much of a constraint). So it's nice for cutting storage costs but overall a minor detail.
  • Something diff-like but machine readable (the most straightforward would be a (word start offset, word end offset, link target) triplet as Roan recommended somewhere) would be convenient for the non-VE-based workflow: it can be used directly to generate the dialogs, it can be applied to the wikitext to generate the annotated HTML, and it can be piecewise applied to the wikitext to add the links the user has approved. (For the VE based workflow, it seems roughly equivalent to the annotated format, but more hassle to generate.) mwparserfromhell doesn't seem to provide offsets, so generating this format would not be straightforward, but possible (insert unique tokens and search for them, or some ugly hack like that).

Does this really block T261410?

"Block" is a strong word, what I meant was more like "inform", in that the various fields mentioned in T261410 are somewhat dependent on what kind of output is returned by the tool.

The raw wikitext doesn't seem useful. (How would we tell what the recommendations are?) The recommender tool needs to be updated to return something useful.

You could store the revision ID used for generating the recommendations (although in theory if you're finding it in the table, you know it's the latest one) and you could diff the full content in the table with that revision.

I don't think this is necessarily a good approach but wanted to list it as a possibility.

Something diff-like but machine readable (the most straightforward would be a (word start offset, word end offset, link target) triplet as Roan recommended somewhere) would be convenient for the non-VE-based workflow: it can be used directly to generate the dialogs, it can be applied to the wikitext to generate the annotated HTML, and it can be piecewise applied to the wikitext to add the links the user has approved. (For the VE based workflow, it seems roughly equivalent to the annotated format, but more hassle to generate.) mwparserfromhell doesn't seem to provide offsets, so generating this format would not be straightforward, but possible (insert unique tokens and search for them, or some ugly hack like that).

@MGerlach do you have thoughts on the feasibility of doing this?

Joe added a subscriber: Joe.

A couple clarifying questions:

  • your idea is to store this in external storage?
  • did you estimated the amount of storage required?

If you plan not to use external storage for this, then I'm a bit wary of the idea of storing wikitext for some information that can be easily normalized.

Adding DBA so that this task gets proper visibility.

(Edited for brainfart: I wrote about parsercache instead of ES).

In addition to @Joe's questions I would like to know a bit more about how this table would be used, especially if the thought is to use external store for it - in particular I am interested in the amount of writes and reads you'd be expecting.
Further, what would be the write patterns for this table if stored on external store given that external store is append only (we only do INSERTs there)?

your idea is to store this in external storage?

I'm not sure external storage is the right solution here. For notifications / StructuredDiscussions this seems to make sense since you want to be able to access that content "forever" but the link recommendations can change as revision content changes, so we'd end up with a bunch of out of date content stored.

did you estimated the amount of storage required?

  • For each wiki our extension is deployed to (currently <20 but we want that to be 100 by end of fiscal year):

Based on the discussion so far, it sounds like instead of storing raw wikitext we would be better of storing a small snippet containing just the essential data -- word start offset, word end offset, link target ID. E.g. for https://addlink-simple.toolforge.org/api/v1/addlink?title=Fernand_L%C3%A9ger (demo API), we'd get something like:

[
    {
        "startOffset": 898
        "endOffset": 906
        "linkTarget": 119786 // Or we could use the DB key
        "probability": 0.9999547
    },
    // More entries
]

Either the mw-addlink tool which generates the link recommendations could output this format, or we could take the raw wikitext returned from the API and format it ourselves on the MediaWiki side before storing it.

I'm kind of wondering if we should just use cache for this, given that it's not a huge amount of data and it's OK if it could get lost. But using MySQL would probably be nicer for reliability because we are also updating the search index with metadata on which articles have link recommendations, and if the search index says that an article has link recommendations, we would want the storage to reliably return them, which wouldn't be guaranteed with the cache.

I would like to know a bit more about how this table would be used, especially if the thought is to use external store for it - in particular I am interested in the amount of writes and reads you'd be expecting.

There is a sketch of how this table would be used at https://wikitech.wikimedia.org/wiki/Add_Link, in particular this document linked at the bottom provides some more details. But briefly, the idea would be to initially populate data for each wiki with a maintenance script (so there would be a high volume of writes happening up front), then data would be written to this table periodically in order to maintain a certain minimum number of link recommendation articles per ORES topic; that will depend on editing patterns for the wiki, so I'm not really sure what the churn will be like. But we could initially populate the table with articles that haven't been edited in a few weeks in order to increase the likelihood that the link recommendations remain relevant for longer.

As for reads, it will be pretty low for a while, at least. The link recommendations will be seen by a percentage of new users on the wikis we deploy GrowthExperiments to (currently <20 but will scale to 100 wikis); those users are prompted to visit Special:Homepage, and from there link recommendations will be one of several suggested tasks they can choose from. Depending on how the design is finalized, we might need to query the table for the link recommendation content on Special:Homepage, or we might not need it until the user visits the article that they want to add link recommendations to. We currently have ~6000 visitors to Special:Homepage per week across all wikis, and about ~1,000 visitors who interact in some way with the suggested edits module on Special:Homepage (where we will present the link recommendation tasks to users).

Further, what would be the write patterns for this table if stored on external store given that external store is append only (we only do INSERTs there)?

As commented on earlier, I don't think external store makes sense for this data given the lack of a delete/update ability, but if I'm missing something, please let me know :)

Thanks for the additional info. Some comments inline

your idea is to store this in external storage?

I'm not sure external storage is the right solution here. For notifications / StructuredDiscussions this seems to make sense since you want to be able to access that content "forever" but the link recommendations can change as revision content changes, so we'd end up with a bunch of out of date content stored.

did you estimated the amount of storage required?

  • For each wiki our extension is deployed to (currently <20 but we want that to be 100 by end of fiscal year):

So that's a static number or will it grow/shrink or would you just update the existing rows? if so, how often would you do that?
I guess it will be done from a maintenance script?

If so:

  • What if the script doesn't run/fails?
  • How often would that run?
  • How would you decide which rows to update on every run?

Based on the discussion so far, it sounds like instead of storing raw wikitext we would be better of storing a small snippet containing just the essential data -- word start offset, word end offset, link target ID. E.g. for https://addlink-simple.toolforge.org/api/v1/addlink?title=Fernand_L%C3%A9ger (demo API), we'd get something like:

[
    {
        "startOffset": 898
        "endOffset": 906
        "linkTarget": 119786 // Or we could use the DB key
        "probability": 0.9999547
    },
    // More entries
]

Either the mw-addlink tool which generates the link recommendations could output this format, or we could take the raw wikitext returned from the API and format it ourselves on the MediaWiki side before storing it.

I'm kind of wondering if we should just use cache for this, given that it's not a huge amount of data and it's OK if it could get lost. But using MySQL would probably be nicer for reliability because we are also updating the search index with metadata on which articles have link recommendations, and if the search index says that an article has link recommendations, we would want the storage to reliably return them, which wouldn't be guaranteed with the cache.

My personal preference would be not to store things that are volatile on MySQL. As that can create even more problems, ie: what if either the cache and/or the data are out of sync? We could be showing different results depending on what source you are hitting.

I would like to know a bit more about how this table would be used, especially if the thought is to use external store for it - in particular I am interested in the amount of writes and reads you'd be expecting.

There is a sketch of how this table would be used at https://wikitech.wikimedia.org/wiki/Add_Link, in particular this document linked at the bottom provides some more details. But briefly, the idea would be to initially populate data for each wiki with a maintenance script (so there would be a high volume of writes happening up front), then data would be written to this table periodically in order to maintain a certain minimum number of link recommendation articles per ORES topic; that will depend on editing patterns for the wiki, so I'm not really sure what the churn will be like. But we could initially populate the table with articles that haven't been edited in a few weeks in order to increase the likelihood that the link recommendations remain relevant for longer.

Can you provide more details on what periodically means? Once a day? Once an hour?
If this finally goes to MySQL, we need to make sure that initial population script is done in chunks, has wait for replication and all those measures to avoid putting excessive high pressure on the database and the replication channel (to avoid creating lag).

As for reads, it will be pretty low for a while, at least. The link recommendations will be seen by a percentage of new users on the wikis we deploy GrowthExperiments to (currently <20 but will scale to 100 wikis); those users are prompted to visit Special:Homepage, and from there link recommendations will be one of several suggested tasks they can choose from. Depending on how the design is finalized, we might need to query the table for the link recommendation content on Special:Homepage, or we might not need it until the user visits the article that they want to add link recommendations to. We currently have ~6000 visitors to Special:Homepage per week across all wikis, and about ~1,000 visitors who interact in some way with the suggested edits module on Special:Homepage (where we will present the link recommendation tasks to users).

Further, what would be the write patterns for this table if stored on external store given that external store is append only (we only do INSERTs there)?

As commented on earlier, I don't think external store makes sense for this data given the lack of a delete/update ability, but if I'm missing something, please let me know :)

Yeah, I don't think external store would be the place to store this.
But again, as mentioned earlier, if this can be considered volatile, I am not sure MySQL is the right place to keep this as a "second cache". But even if it is, what would you have in mind? wikishared on x1?

Thanks for the additional info. Some comments inline
So that's a static number or will it grow/shrink or would you just update the existing rows? if so, how often would you do that?
I guess it will be done from a maintenance script?

Yes, it should be a static number, for the initial release in Q2. For right now, we are trying to validate the concept from a product perspective. If it proves successful, then there will probably be a desire to scale out the storage to include link recommendation data for every possible article in the wiki, rather than just a few hundred per ORES topic. (As a side note, these two early prototypes might help with visualizing the user-facing product https://g4rp9j.axshare.com/ https://hnagph.axshare.com/)

The maintenance script would run probably every hour. It would iterate over each ORES topic (60 of them) and perform a search query like articletopic:{topic} hasrecommendations:links and see if the count is below the threshold we have defined for that wiki. Let's assume the count is 300. If there are 290 results from the search index, then the script would select a set of articles based on some heuristics (not yet defined, but I could imagine those including things like: not edited in last 4 weeks; not being over a certain size; not being too short; exclude articles from certain categories, etc) and for each one of those articles, query the link recommendation service. The link recommendation service won't return results for all articles, because it might not have anything to recommend. If recommendations are returned, then they are processed (e.g. transformed from the raw wikitext the service currently provides into a JSON serializable array with word start/end offsets and link target IDs) and then stored "somewhere", either MySQL or cache. When we get to 10 (we started with 290, we want 300) then we finish with that topic and move on to the next one.

Also, after writing, we generate an event that the Search Platform team will listen to, and will know to update the ElasticSearch document for a particular article so that queries for "hasrecommendations:links" could locate it.

So, that will all be running regularly, let's say hourly, via a maintenance script. On top of that, on article edit we are thinking to enqueue a job that will 1) query the link recommendation service to generate fresh link recommendations 2) process and place the data in storage 3) generate an event to update the search index. And finally, on article deletion we would just remove data from storage.

If so:

  • What if the script doesn't run/fails?

Then we will have some problems :) We will probably have some kind of alerts or a way to know if we don't have enough link recommendations stored for each ORES topic. For example eventually we'll be adding some kind of monitoring in T249987 but we might need alerts so we don't have to manually check each wiki for this.

  • How often would that run?

Currently I am thinking hourly, but maybe it doesn't need to be so frequent.

  • How would you decide which rows to update on every run?

The maintenance script should only be doing inserts. Updates would happen via the job queue on article edit.

I'm kind of wondering if we should just use cache for this, given that it's not a huge amount of data and it's OK if it could get lost. But using MySQL would probably be nicer for reliability because we are also updating the search index with metadata on which articles have link recommendations, and if the search index says that an article has link recommendations, we would want the storage to reliably return them, which wouldn't be guaranteed with the cache.

My personal preference would be not to store things that are volatile on MySQL. As that can create even more problems, ie: what if either the cache and/or the data are out of sync? We could be showing different results depending on what source you are hitting.

That seems OK for the initial rollout with ~18,000 (300 tasks times 60 ORES topics) but would that scale OK if later in the year we expand the link recommendation storage to include every article possible on the wiki?

Yeah, I don't think external store would be the place to store this.
But again, as mentioned earlier, if this can be considered volatile, I am not sure MySQL is the right place to keep this as a "second cache". But even if it is, what would you have in mind? wikishared on x1?

I don't know enough about this. Where could I read about wikishared (I don't see any obvious looking documentation pages on wikitech)? The link recommendation data is local to each language wiki; whether there are advantages to using wikishared versus the local wiki's DB is unclear to me. Do you have advice here?

Yeah, I don't think external store would be the place to store this.
But again, as mentioned earlier, if this can be considered volatile, I am not sure MySQL is the right place to keep this as a "second cache". But even if it is, what would you have in mind? wikishared on x1?

I don't know enough about this. Where could I read about wikishared (I don't see any obvious looking documentation pages on wikitech)? The link recommendation data is local to each language wiki; whether there are advantages to using wikishared versus the local wiki's DB is unclear to me. Do you have advice here?

Since the data is per-wiki, I don't think wikishared makes sense, but using x1 could. That's what we do for the Echo tables: per-wiki databases, but on the x1 cluster.

There is also the object stash (which on the Wikiemdia cluster resolves to Redis), but I guess we want to use that as little as possible? Conceptually, it seems appropriate - it is secondary data, we can recover from losing it, but would prefer if it did not expire; and we only need direct key lookups.

In an RDBMS, we would have a few ten thousand lines per wiki, a modest amount of reads (the feature is for recently registered logged-in editors doing a kind of training workflow) and in the very worst case one extra write per content edit (by default a content edit is something like ten writes) so it seems to me like this could fit fine into the normal wiki DB as well, assuming some compact format. Since it's throwaway data, migrating to a different storage implementation later won't be much hassle.

If so:

  • What if the script doesn't run/fails?

Then we will have some problems :) We will probably have some kind of alerts or a way to know if we don't have enough link recommendations stored for each ORES topic. For example eventually we'll be adding some kind of monitoring in T249987 but we might need alerts so we don't have to manually check each wiki for this.

I think we need to be ready for this script to fail, and in fact, plan for this scenario. The script can fail any time, and for lots of different issues that we cannot control, ie:

  • the server that runs it, crashes
  • code gets broken and we don't notice
  • database being unavailable (maintenance, outages, network glitches etc).

My personal preference would be not to store things that are volatile on MySQL. As that can create even more problems, ie: what if either the cache and/or the data are out of sync? We could be showing different results depending on what source you are hitting.

That seems OK for the initial rollout with ~18,000 (300 tasks times 60 ORES topics) but would that scale OK if later in the year we expand the link recommendation storage to include every article possible on the wiki?

Not sure I get your question. You mean that you'd feel better having this data stored on MySQL for the future growth but not for the initial rollout?

Yeah, I don't think external store would be the place to store this.
But again, as mentioned earlier, if this can be considered volatile, I am not sure MySQL is the right place to keep this as a "second cache". But even if it is, what would you have in mind? wikishared on x1?

I don't know enough about this. Where could I read about wikishared (I don't see any obvious looking documentation pages on wikitech)? The link recommendation data is local to each language wiki; whether there are advantages to using wikishared versus the local wiki's DB is unclear to me. Do you have advice here?

I meant x1 (wikishared lives there) - see below. My mind played tricks on me :)

Since the data is per-wiki, I don't think wikishared makes sense, but using x1 could. That's what we do for the Echo tables: per-wiki databases, but on the x1 cluster.

Sorry yeah, my brain translated x1 into wikishared automatically, but yeah, I was wondering about x1.
However, I still think that if this data is going to be volatile, it doesn't make much sense to have it on MySQL just in case. If it can be re-build easily, I don't see the point on having it stored here.

There is also the object stash (which on the Wikiemdia cluster resolves to Redis), but I guess we want to use that as little as possible? Conceptually, it seems appropriate - it is secondary data, we can recover from losing it, but would prefer if it did not expire; and we only need direct key lookups.

In an RDBMS, we would have a few ten thousand lines per wiki, a modest amount of reads (the feature is for recently registered logged-in editors doing a kind of training workflow) and in the very worst case one extra write per content edit (by default a content edit is something like ten writes) so it seems to me like this could fit fine into the normal wiki DB as well, assuming some compact format. Since it's throwaway data, migrating to a different storage implementation later won't be much hassle.

My worry is that if we finally decide to go for MySQL, we'd never migrate away I reckon :-)
What I don't really see is us storing this throwaway data on MySQL. What is the benefit of doing it, if it can be reconstructed "anytime".
By doing that we're introducing another problem which is what I mentioned at T261411#6479294, what if that cache and what is on mysql gets out of sync, (for whatever reason). To me it seems we are introducing unnecessary complexity by storing this on MySQL.

Thanks everyone!

What I don't really see is us storing this throwaway data on MySQL. What is the benefit of doing it, if it can be reconstructed "anytime".

There isn't really any other place for storing throwaway data that's time-consuming to generate and is needed with a low latency during some pageviews. That's why pretty much all such data is in MySQL, e.g. the parser cache or page props. The only currently existing alternatives are Redis (as I said above, it would work for us but I doubt it would make ops happier), RESTBase (is used as a semi-permanent cache for a number of similar features, e.g. article extracts, but we are moving away from it) or having a caching webservice with its own storage instead of a lambda service (what ORES did for a similar use case, but that was the central piece of software for their team and this is just one of many smaller features for us so it's really not a place where we'd want to be). I think MySQL is far superior to all of those. Maybe the new parser cache service will provide a place for data that's derived from the page content asynchronously (it would have to, if it wants to fully replace RESTBase), and at that point we can migrate there, but that's still far away.

By doing that we're introducing another problem which is what I mentioned at T261411#6479294, what if that cache and what is on mysql gets out of sync, (for whatever reason).

For revision-derived data that's not really a problem: you just store the revision id, and discard the data if the id doesn't match. Revisions don't change, so there is no way the data could be out of sync (well, there is revdelete, but it doesn't affect our use case).

What I don't really see is us storing this throwaway data on MySQL. What is the benefit of doing it, if it can be reconstructed "anytime".

There isn't really any other place for storing throwaway data that's time-consuming to generate and is needed with a low latency during some pageviews. That's why pretty much all such data is in MySQL, e.g. the parser cache or page props. The only currently existing alternatives are Redis (as I said above, it would work for us but I doubt it would make ops happier), RESTBase (is used as a semi-permanent cache for a number of similar features, e.g. article extracts, but we are moving away from it) or having a caching webservice with its own storage instead of a lambda service (what ORES did for a similar use case, but that was the central piece of software for their team and this is just one of many smaller features for us so it's really not a place where we'd want to be). I think MySQL is far superior to all of those. Maybe the new parser cache service will provide a place for data that's derived from the page content asynchronously (it would have to, if it wants to fully replace RESTBase), and at that point we can migrate there, but that's still far away.

@Joe do you have any thoughts on the above?

By doing that we're introducing another problem which is what I mentioned at T261411#6479294, what if that cache and what is on mysql gets out of sync, (for whatever reason).

For revision-derived data that's not really a problem: you just store the revision id, and discard the data if the id doesn't match. Revisions don't change, so there is no way the data could be out of sync (well, there is revdelete, but it doesn't affect our use case).

Got it - thank you!

@MGerlach and I spoke on Friday and he will adjust the output of his tool to provide structured data, e.g. instead of raw wikitext containing new links added to the output, there will be a JSON string containing character offset start, character offset end, link target ID, and probability score.

So instead of

His boldly simplified treatment of modern subject [[Matter|matter|pr=0.99998546]]

We would have:

[
  {
    "charOffsetStart": 20,
    "charOffsetEnd": 40,
    "linkTargetID": 123456,
    "probability": 0.99998546
  },
  // more link recommendations for the article
]

If we use the cache, we would use the article ID in generating the cache key, and each value would contain the array of data.

If we use MySQL, I think we'd want to have a table with fields for: articleID, charOffsetStart, charOffsetEnd, linkTargetID, probability. Then each link recommendation would form a row in the table.

One advantage of storing data this way is that it opens up more options for how the product is implemented; currently we plan to show users on article A that there is text we recommend linking to articles B, C and D; but in the future we might want to tell users who are on article D that we recommend they visit article A and link some text back to article D. That would be easily done with the normalized approach with MySQL but not with the cache storage approach as we've defined it here.

@Joe @Marostegui could you please let us know your thoughts on this (T261411#6497293) sometime in the next week?

@MGerlach and I spoke on Friday and he will adjust the output of his tool to provide structured data, e.g. instead of raw wikitext containing new links added to the output, there will be a JSON string containing character offset start, character offset end, link target ID, and probability score.

So instead of

His boldly simplified treatment of modern subject [[Matter|matter|pr=0.99998546]]

We would have:

[
  {
    "charOffsetStart": 20,
    "charOffsetEnd": 40,
    "linkTargetID": 123456,
    "probability": 0.99998546
  },
  // more link recommendations for the article
]

If we use the cache, we would use the article ID in generating the cache key, and each value would contain the array of data.

If we use MySQL, I think we'd want to have a table with fields for: articleID, charOffsetStart, charOffsetEnd, linkTargetID, probability. Then each link recommendation would form a row in the table.

Are we still talking about 18-20k rows in total for this table here as you mentioned at T261411#6479145

@Joe @Marostegui could you please let us know your thoughts on this (T261411#6497293) sometime in the next week?

I would like to hear @Joe opinion on the comment about restbase made at T261411#6492401 as I don't have a full picture of the infra there.

Are we still talking about 18-20k rows in total for this table here as you mentioned at T261411#6479145

If we use cache (or treat MySQL like a cache, by storing all link recommendations for a given article as a JSON string in a single row in a table), we will have ~20k entries per wiki (and in Q2 we plan to go live with 4 wikis).

If we create a row for each individual link recommendation entry, then we are looking N times 20,000. I don't know what the value for N would be (@MGerlach might have some idea of the average number of link recommendations across a batch of randomly selected articles but maybe not), but assuming it was something like 20 link recommendations per article, then we'd end up with 400,000 rows in the table (for four wikis).

We don't yet have a product need for the normalized approach so maybe we should just start with a single entry per article, where the data is stored as a JSON string. MySQL might have some advantages here because we could use some metadata about the link recommendation (like the revision ID it was generated from, the number of link recommendations present in the data), although if we need to use WANObjectCache instead, I suppose we could use revision ID to make up part of the cache key.

Are we still talking about 18-20k rows in total for this table here as you mentioned at T261411#6479145

If we use cache (or treat MySQL like a cache, by storing all link recommendations for a given article as a JSON string in a single row in a table), we will have ~20k entries per wiki (and in Q2 we plan to go live with 4 wikis).

And the plan is to roll that out everywhere after Q2?

If we create a row for each individual link recommendation entry, then we are looking N times 20,000. I don't know what the value for N would be (@MGerlach might have some idea of the average number of link recommendations across a batch of randomly selected articles but maybe not), but assuming it was something like 20 link recommendations per article, then we'd end up with 400,000 rows in the table (for four wikis).

That's quite a lot rows for just 4 wikis, if we keep rolling that out, that will become a potential scalability issue.
Having large tables is very problematic in MySQL:

  • Hard to operate (alter them, backup them...)
  • Hard to keep them in mysql memory cache (so they are warm)
  • And this is probably the worse: the optimizer can do unexpected anytime - things like choosing the wrong index and having a very costly query plan (we've seen this many times with large tables in our infra, we even have a tag for it https://phabricator.wikimedia.org/project/view/4313/).

And the plan is to roll that out everywhere after Q2?

Well, we currently have 16 Wikipedias using GrowthExperiments, but we have a goal to scale to 100 Wikipedias (T247507). So: end of Q2, we want to go live with four wikis; sometime in Q3 this would go out to the the Wikipedias where we have GrowthExperiments enabled (today it is 16, by Q3 maybe it will be more like 30?), and in theory by Q4 we are trying to reach 100 Wikipedias.

If we create a row for each individual link recommendation entry, then we are looking N times 20,000. I don't know what the value for N would be (@MGerlach might have some idea of the average number of link recommendations across a batch of randomly selected articles but maybe not), but assuming it was something like 20 link recommendations per article, then we'd end up with 400,000 rows in the table (for four wikis).

That's quite a lot rows for just 4 wikis, if we keep rolling that out, that will become a potential scalability issue.

Good to know. Do we have the same problem if we use the denormalized approach, where there is an upper limit of 20,000 rows per wiki x 100 wikipedias = 2,000,000 rows?

From a product perspective there is also a desire to have link recommendations generated and cached for more articles (currently we are generating only 500 per ORES topic; this is point 4 from https://wikitech.wikimedia.org/wiki/Add_Link#High-level_summary). I don't think we have to solve this problem now but if this is a successful feature then we will need to figure out what a storage solution looks like for many more articles.

And the plan is to roll that out everywhere after Q2?

Well, we currently have 16 Wikipedias using GrowthExperiments, but we have a goal to scale to 100 Wikipedias (T247507). So: end of Q2, we want to go live with four wikis; sometime in Q3 this would go out to the the Wikipedias where we have GrowthExperiments enabled (today it is 16, by Q3 maybe it will be more like 30?), and in theory by Q4 we are trying to reach 100 Wikipedias.

That means, from the above numbers, 40M rows, which is quite a lot and we can start running on the issues mentioned on my previous comment.

If we create a row for each individual link recommendation entry, then we are looking N times 20,000. I don't know what the value for N would be (@MGerlach might have some idea of the average number of link recommendations across a batch of randomly selected articles but maybe not), but assuming it was something like 20 link recommendations per article, then we'd end up with 400,000 rows in the table (for four wikis).

That's quite a lot rows for just 4 wikis, if we keep rolling that out, that will become a potential scalability issue.

Good to know. Do we have the same problem if we use the denormalized approach, where there is an upper limit of 20,000 rows per wiki x 100 wikipedias = 2,000,000 rows?

This would be a lot better, and I would feel a lot more comfortable with 2M rows for 100 wikipedias. 2M rows is a much more healthier number for a mysql table

I know this is quite far away but, 100 is the max number of wikipedias this would be enabled on?

Good to know. Do we have the same problem if we use the denormalized approach, where there is an upper limit of 20,000 rows per wiki x 100 wikipedias = 2,000,000 rows?

This would be a lot better, and I would feel a lot more comfortable with 2M rows for 100 wikipedias. 2M rows is a much more healthier number for a mysql table

I know this is quite far away but, 100 is the max number of wikipedias this would be enabled on?

Theoretically it might be enabled on all of them, but I have no idea if/when that would happen. For completeness, I will mention that although the product we are building is currently only on wikipedias, in theory the link recommendation component could be applicable to any wiki (wikitionaries for example). However, we don't have any plans at the moment for that, and I think this is pretty far out into the future, if it would ever happen.

Thanks for the comment. Then, I reckon we do need to think about the denormalized approach as the only approach we could take considering the figures provided - the other one means too many rows to handle with confidence.
With the numbers you've provided and worst case scenario (enabling it on 900 wikis) the denormalized approach would mean:

2M rows for 100 wikipedias
18M rows for 900 wikis.

Obviously those numbers are very approximated but they can give us some rough sense on what we'd be looking at in case we go full on with this feature at some point - 18M rows for all the wikis is reasonable I would say.

We are going to have separate tables for each wiki, right? And, at least with the current approach, the size of the tables doesn't grow with the size of the wiki. Half a million rows is not too large for a table. Also, we won't have the kind of queries the optimizer tends to stumble on - primary key prefix lookups only.

That said, I'd go with the JSON approach (or something similar, but preferably less verbose), it's less effort to setup and modify.

We are going to have separate tables for each wiki, right?

That wasn't entirely clear to me, if that is the case, that'd put us in a lot better position.

And, at least with the current approach, the size of the tables doesn't grow with the size of the wiki. Half a million rows is not too large for a table. Also, we won't have the kind of queries the optimizer tends to stumble on - primary key prefix lookups only.

We'd have even seen the optimizer being silly even when the PK was the obvious choice, but with 0.5M rows that'd be weird indeed.

We are going to have separate tables for each wiki, right?

That wasn't entirely clear to me, if that is the case, that'd put us in a lot better position.

Ah, sorry I was unclear about that.

And, at least with the current approach, the size of the tables doesn't grow with the size of the wiki. Half a million rows is not too large for a table.

To clarify, the table wouldn't be dependent on the size of the wiki, but on the number of articles we decide we want to have link recommendations generated for. At the moment we are targeting 19,500 articles per wiki (39 ORES topics * 500 articles per topic). Further, the product rule is that we would display to the end user up to 10 link recommendations per article (I should have realized this earlier when I was commenting in T261411#6532492, sorry).

With the normalized approach, we would end up with a maximum of 195,000 rows per wiki (10 rows per article * 500 articles per topic * 39 ORES topics), in per-wiki tables on x1. Assuming we arrive at deployment to 100 wikis, we would have 19.5 million rows (100 distinct tables * 195,000 rows). (However maybe by that point the new ML platform will have its own storage and we would use that instead of a MediaWiki DB table.) These numbers could change if a) we want to store more than 10 recommendations per article (seems unlikely) or b) we want to generate link recommendations for more articles per topic (more likely, but not currently planned and we would coordinate this change)

So to summarize:

Normalized

  1. Each wiki gets its own MySQL table on x1
  2. Table will have columns for:
    • revision_id (numeric)
    • article_id (numeric)
    • timestamp (? maybe not needed)
    • phrase_to_link (text)
    • context_before (text -- 5 characters of text that occur before the phrase to link)
    • context_after (text -- 5 characters of text that occur after the phrase to link)
    • link_target (text)
    • instance_occurrence (integer) -- number showing how many times the phrase to link appears in the editable content region before we arrive at the one to link
    • probability (boolean)
    • insertion_order (integer) -- order in which to insert the link on the page (e.g. recommendation "foo" [0] comes before recommendation "bar baz" [1], which comes before recommendation "bar" [2], etc)
  3. Total number of rows per wiki: 195,000
  4. Total number of rows across x1 in Q3: 3,900,000 (~20 wikis x 195,000 rows)
  5. Total number of rows across x1 in Q4: 19.5 million (100 wikis x 195,000 rows)

Denormalized

  1. Each wiki gets its own MySQL table on x1
  2. The table will have columns for:
    • revision_id (numeric)
    • article_id (numeric)
    • recommendations (serialized recommendation -- JSON ?LONGBLOB? LONGTEXT?), it would contain the data which is listed in the Normalized section above
    • maybe one or two other fields for metadata like the number of recommendations, timestamp of generation, info about the probability scores in the serialized recommendation data; not sure about any of those now, in theory we could place some meatadata into the search index as well (we are already using the search index to find articles which have link recommendations)
  3. Total number of rows per wiki: 19,500
  4. Total number of rows across x1 in Q3: 390,000 (~20 wikis x 19,500 rows)
  5. Total number of rows across x1 in Q4: 1.95 million (100 wikis x 19,500 rows)

Given the lack of a strong product need for the normalized approach, it looks like the denormalized setup is a safer bet in terms of resources and scalability. @Marostegui / @Tgr / @Catrope sounds OK to you?

LSobanski triaged this task as Medium priority.Oct 19 2020, 1:56 PM

Thanks for the detailed explanation @kostajh! That helps a lot, much appreciated.
Having one table per wiki changes most of my concerns on the possible scalability issues as 195k is a very affordable number of rows for a table on x1, so that is great.
From a first glance, it looks like the normalized approach might have slightly smaller rows than the denormalized approach, but again, given the tiny amount of rows the denormalized approach would have, I would still prefer to go for that option, as it is way easier to work with smaller tables (even if the rows aren't super small).

Tgr edited projects, added Growth-Team (Current Sprint); removed Growth-Team.

I think this is done, per the last two comments (please correct me if I'm overinterpreting). We'll with go MySQL, x1, one table per wiki, one row per page ("denormalized" format). Thanks all!
We'll file a separate task for the actual schema change.

I think this is done, per the last two comments (please correct me if I'm overinterpreting). We'll with go MySQL, x1, one table per wiki, one row per page ("denormalized" format). Thanks all!
We'll file a separate task for the actual schema change.

Thanks @Tgr - keep in mind that the table creation isn't something handled by DBAs, as it can be done during the normal deployment windows (https://wikitech.wikimedia.org/wiki/Schema_changes#What_is_not_a_schema_change)

Thanks @Tgr - keep in mind that the table creation isn't something handled by DBAs, as it can be done during the normal deployment windows (https://wikitech.wikimedia.org/wiki/Schema_changes#What_is_not_a_schema_change)

Creating tables for all wikis is usually handled by DBAs, I think? Granted, we are only live on twenty wikis or so - not sure what's simpler, create an empty table everywhere one time, or create tables whenever we deploy to a new wiki (which currently just takes a config patch)?

That reminds me: there is a chance this table would have to be in a separate extension (and thus with a different prefix) as the GrowthExperiments extension is really just a client for the functionality it provides, and other clients (the mobile apps) might want to use it too. I assumed it wouldn't be a big deal to rename the tables in that case, or just drop and recreate them (the data is them is just cache and it's OK to lose it) so we can ignore that eventuality for now. Do you see any problem with that?

If that needs to happen, we'd need to get some sleep between iterations to avoid replication lag, as even if the table is small, given that there will be one per wiki...there will be lots of them and that can cause some replication lag on x1.

I think this is done, per the last two comments (please correct me if I'm overinterpreting). We'll with go MySQL, x1, one table per wiki, one row per page ("denormalized" format). Thanks all!
We'll file a separate task for the actual schema change.

One more thing I wanted to add. It might be possible that we decide we want to keep the rows in the table indefinitely (or for some longer period of time, like one year) rather than deleting a row as part of cache invalidation for one specific use case, which is when a link recommendations task for an article is done. In that case, we might update the row with data about which link recommendations were accepted and which were rejected/skipped by the user.

In that case, my statement about the rows in the table being limited to 19,500 (number of ORES topics = 39) * (number of articles we want per topic = 500) wouldn't be true, it would grow as users complete the "add link" suggested task. We are currently doing about ~1200 suggested edits tasks per week across all wikis, and this new task type would fit in alongside four other existing tasks. So I don't think this would grow the table in a way that is alarming but wanted to give you a heads up about this possibility @Marostegui.

@kostajh thanks for the heads up.
This makes me wonder if the statement about the data not being "important" (meaning it can be deleted/regenerated) is still true given that you are exploring the possibility of keeping it "forever" rather than using it as a cache?

@kostajh thanks for the heads up.
This makes me wonder if the statement about the data not being "important" (meaning it can be deleted/regenerated) is still true given that you are exploring the possibility of keeping it "forever" rather than using it as a cache?

Yeah, disregard my previous comment please :)

As @Tgr noted in T266446#6581724, we shouldn't mix cache data with primary data in this table. We do intend for the table we're talking about here (specifically in this patch)to be a cache, i.e. the rows can be dropped if needed.

Sounds like this done (though I don't totally understand the work here). Thank you!