Page MenuHomePhabricator

Implement 50kb limit on file text indexing for to reduce increasing commonswiki_file on-disk size
Closed, ResolvedPublic5 Estimated Story Points

Description

Per the new data collection we put together in december, commonswiki_file has grown by 40% in the last 30 days. We need to understand the source of this so we can plan appropriately.


Summary of findings

The size of the search index has been rapidly growing -- 55% between 26 Nov 2020 and 12 Jan 2021 -- beyond our capacity to store this data on current hardware. Left unaddressed, the index will continue to grow and break/hinder search functionality: search services will start going intermittently offline; p95 latency increases to >2s; we reject hundreds of requests per second. We are currently already at capacity (the above has already happened months ago) and the Search team is spending a lot of time finding ways to free up more storage space to account for the rapid indexing growth, which means we can’t focus on new features or other bug fixes.

A large number of PDF documents -- especially those made available during Public Domain Day -- are being uploaded to Commons with OCR text. Search indexing currently looks at all of this associated text and as a result is growing rapidly. Besides introducing a massive amount of file text, OCR is imperfect and often also results in mis-identified text leading to nontrivial amounts of junk text being indexed -- we suspect both factors likely degrade search quality and create performance issues related to current storage constraints.


Plan of action:
Place a default 50kb maximum limit on the amount of file text (including OCR text, but excluding metadata and wikitext) that is indexed for search.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 7 2021, 11:17 PM

I've imported the dumps for 2020-12-07 and 2021-01-04 into hadoop so we can compare what has changed over this time span. @dcausse noticed IA books, an "extremely large book scan upload project", in the RC feeds that is uploading scanned books to commonswiki. To see if this was a likely source of our recent growth in data size I have pulled some stats that show strong growth in the amount of text we store that is extracted from files, and the number of files with more than 1MB of cntent. I haven't had a chance to look any deeper yet, but clearly file text content is significant and growing very quickly.

2020-12-072021-01-04Δ
total docs66.67M67.56M1%
sum(length(source_text))62.2G63.0G1%
sum(length(text))10.0G10.1G1%
sum(length(file_text))388.2G480.3G24%
docs with file_text > 4kb1.30M1.38M6%
docs with file_text > 1mb98.1k123k25%

Ran some more stats through the data. Considering only the category Scans from the Internet Archive. Overall file_text grew by 92GB in this time span, 84GB of that can be attributed to this category. The category has sub-categories, it seems plausible other files making up most of the remainder are found there. I think we can fairly conclusively say a signficant portion of our growth in commonswiki can be attributed to the hundreds of GB of file_text. We will have to decide if we buy hardware to support this use case, or restrict the amount of content we index from files.

2020-12-072021-01-04Δ
total docs0.92M1.12M
sum(length(source_text))0.92G1.18G
sum(length(text))16M40M
sum(length(file_text))327G411G
docs with file_text > 4kb0.84M1.0M
docs with file_text > 1mb86k108k

@MPhamWMF and I just chatted and he's going to put together a doc with a proposed communication to the community, and a list of questions we need answered (by them and/or by ourselves) before we can make a proposal to leadership on next steps. he's going to try to have that for Thursday.

EBernhardson added a comment.EditedJan 12 2021, 9:05 PM

Some information from our discussion on irc today. Note that many of the sizes of things are not measuring the same thing. Some are measuring the size of the gzip'd content, some are measuring the number of uncompressed bytes in a single field, others are the on-disk size of the search indexes holding that data.

  • Old books (old words) + OCR (less than 100% accuracy) = index sizes much larger than an equal number of bytes of wiki articles
  • In the last 30 days commonswiki_file on-disk size has increased from 8.5TB to 11TB
  • Over the last 47 days, growth in on-disk size is 54.95% https://www.irccloud.com/pastebin/Wiv3d5Au/
  • The on-disk size for all content we index combined is 25TB
  • We have to shard the index based on amount of data, and we need 1 cpu per shard to answer a query. The next step for commonswiki is 60 shards, or 2 full servers required for each concurrent commonswiki query. For comparisson enwiki, our busiest index, uses only 7 shards.
  • Based on current query rates we need to keep between 10 and 15% of the commonswiki index in memory to avoid latency spikes across all wikis. Today we need to keep between 1 and 2TB of commonswiki_file resident in file cache. We only have 96G per server, meaning we are likely using about 15 servers (full cluster is 36) worth of memory to keep this index going.

Index size over time. We don't have direct mesurements, unfortunatly ops infrastructure only has server disk usage going back ~6 months. The cirrussearch dumps used to be uploaded to Internet Archive though, so we can at least see those file sizes. This is the size of the gzip'd dump, it is not easily comparable to the on-disk index size, but the trend line should be the same. The dumps stopped being uploaded after Feb 2018, and we only retain the last 8 weeks ourselves, so we don't have any further information.

Feb 27, 201730.3GB
Jul 31, 201738.7GB
Jan 29, 201841.2GB
Feb 26, 201841.5GB
Nov 9, 2020203.9GB
Jan 4, 2021264.2GB

Number of bytes of extracted file_text content per mime type. Unlisted have less than 100MB. This is the uncompressed number of bytes of file_text

application/pdf459.5GB
image/vnd.djvu20.5GB
image/jpeg0.3GB
EBernhardson added a comment.EditedJan 13 2021, 7:33 PM

Possible mitigations (some copied from above):

  • Buy more hardware. Even if we restrict file_text content, we started considering back in september if we should have a separate elasticsearch cluster for wikidata and commonswiki. These are structured differently than the the wikis focused on written language and have very different expected growth patterns. This also accelerates our ability to put the largest multi-tb indices on servers with 10G networking that can better handle those sizes. T264053#6507156
  • Restrict the amount of file_text we index. When indexing a file we ask mediawiki for the text content, if it knows how to extract it. When it gives us 1MB of plaintext for a pdf we could cut it back to the first N bytes (50kb? unsure). This is partly a product question, but also probably our go-to mitigation if data sizes grow enough to start re-triggering the latency spikes before we've decided on a longer term solution. We have a process that continually reindexes old data, deploying this would start reducing data sizes immediatly.
  • Reduce the amount of information we store about the file_text field. We have indexed copies of the source file_text in file_text, file_text.plain, all and all.plain fields. We could stop copying file_text to the all field and instead change the spots that query the all field to hit both. This would trade off more work at query time for fewer copies of a very large data set. Overall that might be a win, but not entirely sure how to estimate. We currently store both positions and norms. Positions allow for search queries to take proximity of terms into account, this is generally important for search quality but can be turned off for relatively significant size savings (no clue how much, exactly. We could import dumps to relforge and find out). We can also turn norms off, but if i'm reading right the cost of norms is 1 byte per field per document. We need to save on the order of hundreds of GB's to a few TB's, 1 byte per field per document isn't going to get there.

Possible mitigations (some copied from above):

  • Buy more hardware. Even if we restrict file_text content, we started considering back in september if we should have a separate elasticsearch cluster for wikidata and commonswiki. These are structured differently than the the wikis focused on written language and have very different expected growth patterns. This also accelerates our ability to put the largest multi-tb indices on servers with 10G networking that can better handle those sizes. T264053#6507156

It sounds like we may need to buy more hardware to support the separate cluster regardless of the actions we take for this ticket, and in fact we already have a ticket for that (T265621).

  • Restrict the amount of file_text we index. When indexing a file we ask mediawiki for the text content, if it knows how to extract it. When it gives us 1MB of plaintext for a pdf we could cut it back to the first N bytes (50kb? unsure). This is partly a product question, but also probably our go-to mitigation if data sizes grow enough to start re-triggering the latency spikes before we've decided on a longer term solution. We have a process that continually reindexes old data, deploying this would start reducing data sizes immediately.

As a short term solution this could work to mitigate the emergency, but I don't like it as a long term solution because it would add what is essentially random content to the index unnecessarily. I'd much prefer to stop indexing text from files on Commons altogether, and let the index instead solely contain the metadata and wikitext on the file pages. I don't see a real use case for it, but this is obviously pending a discussion with the community that @MPhamWMF is putting together.

  • Reduce the amount of information we store about the file_text field. We have indexed copies of the source file_text in file_text, file_text.plain, all and all.plain fields. We could stop copying file_text to the all field and instead change the spots that query the all field to hit both. This would trade off more work at query time for fewer copies of a very large data set. Overall that might be a win, but not entirely sure how to estimate. We currently store both positions and norms. Positions allow for search queries to take proximity of terms into account, this is generally important for search quality but can be turned off for relatively significant size savings (no clue how much, exactly. We could import dumps to relforge and find out). We can also turn norms off, but if i'm reading right the cost of norms is 1 byte per field per document. We need to save on the order of hundreds of GB's to a few TB's, 1 byte per field per document isn't going to get there.

I'm not 100% sure I understand this option, but it doesn't seem like we need the text content indexed at all as I mentioned above, so this doesn't seem like it would be worth it as a middle ground.

EBernhardson added a comment.EditedJan 13 2021, 9:34 PM

Possible mitigations (some copied from above):

  • Buy more hardware. Even if we restrict file_text content, we started considering back in september if we should have a separate elasticsearch cluster for wikidata and commonswiki. These are structured differently than the the wikis focused on written language and have very different expected growth patterns. This also accelerates our ability to put the largest multi-tb indices on servers with 10G networking that can better handle those sizes. T264053#6507156

It sounds like we may need to buy more hardware to support the separate cluster regardless of the actions we take for this ticket, and in fact we already have a ticket for that (T265621).

The need for an additional cluster comes from the same issue being delt with here, from T264053. In that ticket we recognize that data growth primarily in commonswiki but also wikidata are the underlying cause of latency spikes and intermittent service availability. Mitigations provided some runway and we identified the need to understand where this growth is coming from. Our planned upgrade path could not meet the growth we were seeing, suggesting the need for a new path. I guess this is a long winded way of saying that if we resolve the now identified file_text problem, removing multiple terrabytes of index and bringing commonswiki's growth down to within previously expected levels, we may no longer have a need to go outside the already planned upgrade paths. We are already in the process of doubling the amount of memory the existing cluster has available, but this process takes 4 years and we are only in year 1. While it's possible we will need both mitigations it's far from certain. We might still split off a separate cluster to simplify operations, but if that needs more hardware or just re-defining roles of existing hardware depends on our capacity.

  • Restrict the amount of file_text we index. When indexing a file we ask mediawiki for the text content, if it knows how to extract it. When it gives us 1MB of plaintext for a pdf we could cut it back to the first N bytes (50kb? unsure). This is partly a product question, but also probably our go-to mitigation if data sizes grow enough to start re-triggering the latency spikes before we've decided on a longer term solution. We have a process that continually reindexes old data, deploying this would start reducing data sizes immediately.

As a short term solution this could work to mitigate the emergency, but I don't like it as a long term solution because it would add what is essentially random content to the index unnecessarily. I'd much prefer to stop indexing text from files on Commons altogether, and let the index instead solely contain the metadata and wikitext on the file pages. I don't see a real use case for it, but this is obviously pending a discussion with the community that @MPhamWMF is putting together.

I wasn't aware that removing the field entirely was one of the options you were considering, good to know. Operationally the difference between a 50kB cap and removing the field is similar. The 50kb cap removes roughly 77% of the text content of commonswiki. Removing the field entirely brings that up to 87%.

  • Reduce the amount of information we store about the file_text field. We have indexed copies of the source file_text in file_text, file_text.plain, all and all.plain fields. We could stop copying file_text to the all field and instead change the spots that query the all field to hit both. This would trade off more work at query time for fewer copies of a very large data set. Overall that might be a win, but not entirely sure how to estimate. We currently store both positions and norms. Positions allow for search queries to take proximity of terms into account, this is generally important for search quality but can be turned off for relatively significant size savings (no clue how much, exactly. We could import dumps to relforge and find out). We can also turn norms off, but if i'm reading right the cost of norms is 1 byte per field per document. We need to save on the order of hundreds of GB's to a few TB's, 1 byte per field per document isn't going to get there.

I'm not 100% sure I understand this option, but it doesn't seem like we need the text content indexed at all as I mentioned above, so this doesn't seem like it would be worth it as a middle ground.

Indeed, this is the option for how we might keep hundreds of gigabytes of minimally referenced content searchable while reducing the cost.

@EBernhardson, how feasible and/or useful would it be to selectively index a restricted amount of file_text by something other than first N bytes? e.g. don't index out of vocab words; grab N random words instead of first N.

My gut feeling is this would be a lot of work/difficulty for probably not any improvement in search functionality.

EBernhardson added a comment.EditedJan 14 2021, 7:41 PM

@EBernhardson, how feasible and/or useful would it be to selectively index a restricted amount of file_text by something other than first N bytes? e.g. don't index out of vocab words; grab N random words instead of first N.

My gut feeling is this would be a lot of work/difficulty for probably not any improvement in search functionality.

Technically speaking, there are 100% ways to go about this. There was a public search engine (Cliqz, ended operations mid 2020) based around the idea that they can pre-process web pages and extract a set of queries that page should be the result for. They then only index the queries and not the bulk content. That's a very complex NLP task and not something I think we'll be doing though, especially if it is primarily for these pdf's that are are not the bulk of content end users are looking for. Applying NLP of that sort to OCR'd public domain content from 100+ years ago is probably an even harder problem than Cliqz is dealing with.

It's certainly possible to do something much more naive. We could certainly randomly throw out words. @dcausse might have better ideas, but i suspect trying to decide what to index based on index statistics is going to be hard. My intuition is that if we aren't indexing words because their statistics are too low, we wont know that we threw out the same word in 10k different places.

Ok, it sounds like there's a lot more questions than answers going in the direction of selectively indexing. Given the urgency of the issue, let's treat the selective indexing option as first N bytes/words for now, because that is most straightforward. If we decide to implement this, we can decide later how to improve selective indexing for better performance/accuracy, and treat that as an iterative improvement.

I don't know anything that we we could use out of the box to clean-up the result of a bad OCR. There were discussions and some tools in very old version of lucene but nothing very compelling and this work was removed anyways. I think truncating the text is the easiest solution (haven't thought yet on how to migrate though) to put in place and is easily explainable. Removing the text completely does not seem acceptable to me as this feature is very valuable according to the number of questions we receive about how to index PDF content on mw.org from owners of third party mediawiki installation.

Ltrlg added a subscriber: Ltrlg.Jan 15 2021, 1:45 PM
MPhamWMF added a comment.EditedJan 15 2021, 2:30 PM

@dcausse , thanks for adding that -- could you provide any links on any of those conversations?

Some other questions:

  1. Also, from a platform perspective, how possible would it be to disable this (OCR) text indexing for Commons, but allow third party mediawiki installers to turn it on if they want?
  1. How much time/effort are we estimating for putting in place a 50kb file text limit, removing all other text, and reindexing?
  1. How much time/effort if we remove the full text data from index and then decide to reindex all of it if we decide to buy more hardware and re-add all of it back into the index?

Sure, here are few discussions taken from the CirrusSearch talk page related to searching on the pdf content (or file content in general):

I think it would be sane to add a system parameter that limits the amount of text that is accepted from the file_text field. This parameter would allow anyone installing CirrusSearch to decide what is the right amount of text they want to be absorbed from the media file into the search engine.

As for the right amount of text to keep (on commons) in T271493#6745981 @EBernhardson mentioned:

The 50kb cap removes roughly 77% of the text content of commonswiki. Removing the field entirely brings that up to 87%.

  1. Also, from a platform perspective, how possible would it be to disable this (OCR) text indexing for Commons, but allow third party mediawiki installers to turn it on if they want?
  1. How much time/effort are we estimating for putting in place a 50kb file text limit, removing all other text, and reindexing?

I think 1 and 2 are roughly the same functionality as in that having a parameter controlling the size to keep you could set it to 0 to remove all the text content. The work involved is I think very minimal (couple lines of code to add in the CirrusSearch extension).

  1. How much time/effort if we remove the full text data from index and then decide to reindex all of it if we decide to buy more hardware and re-add all of it back into the index?

It is still unclear to me but I don't think it will a huge effort on our side, we have a process that continuously check the consistency of the index that we could perhaps reuse/adapt to take care of this. This would certainly take a lot of time to process (several months).

  1. Also, from a platform perspective, how possible would it be to disable this (OCR) text indexing for Commons, but allow third party mediawiki installers to turn it on if they want?
  1. How much time/effort are we estimating for putting in place a 50kb file text limit, removing all other text, and reindexing?

I think 1 and 2 are roughly the same functionality as in that having a parameter controlling the size to keep you could set it to 0 to remove all the text content. The work involved is I think very minimal (couple lines of code to add in the CirrusSearch extension).

  1. How much time/effort if we remove the full text data from index and then decide to reindex all of it if we decide to buy more hardware and re-add all of it back into the index?

It is still unclear to me but I don't think it will a huge effort on our side, we have a process that continuously check the consistency of the index that we could perhaps reuse/adapt to take care of this. This would certainly take a lot of time to process (several months).

I agree, this is probably a patch of less than 100 lines, and much of that will be docs about new config. Once we deploy that the growth in index size should stop immediately, and the background process will reindex 1-2% of pages per day, giving us a continued reduction in index size for the next 8-10 weeks.

MPhamWMF updated the task description. (Show Details)Jan 15 2021, 7:49 PM
CBogen updated the task description. (Show Details)

This actually has an interesting side effect. Sometimes, I try to search for a word in wikipedia and need to search in all namespaces. Result: All filled with useless result from file namespace (and all PDF/DejaVu) and I immediately remove file from list of namespaces. It creates a lot of noise for me TBH.

This actually has an interesting side effect. Sometimes, I try to search for a word in wikipedia and need to search in all namespaces. Result: All filled with useless result from file namespace (and all PDF/DejaVu) and I immediately remove file from list of namespaces. It creates a lot of noise for me TBH.

This seems to be related to what is suggested in https://www.mediawiki.org/wiki/Topic:V9ddlo1a6cgrxmt4. Have a fine-grained control over where the search query is applied regarding such content. This probably deserves its own ticket but there's an easy workaround that may help you in these circumstances: prefixing your search query with local: will entirely skip commons from your search results.

MPhamWMF renamed this task from Determine source of increasing commonswiki_file on-disk size to Implement 50kb limit on file text indexing for to reduce increasing commonswiki_file on-disk size.Jan 19 2021, 2:31 PM
dcausse set the point value for this task to 5.

Noting here that we will be announcing that we plan to go live with this change on Monday, January 25th.

Change 657160 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/CirrusSearch@master] Add an option to limit the size of the file_text field

https://gerrit.wikimedia.org/r/657160

Change 657160 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Add an option to limit the size of the file_text field

https://gerrit.wikimedia.org/r/657160

Change 658249 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/CirrusSearch@wmf/1.36.0-wmf.27] Add an option to limit the size of the file_text field

https://gerrit.wikimedia.org/r/658249

Change 658240 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/mediawiki-config@master] [cirrus] set 50kb limit on file text indexing for commons

https://gerrit.wikimedia.org/r/658240

Change 658249 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@wmf/1.36.0-wmf.27] Add an option to limit the size of the file_text field

https://gerrit.wikimedia.org/r/658249

Mentioned in SAL (#wikimedia-operations) [2021-01-25T15:20:49Z] <dcausse@deploy1001> Synchronized php-1.36.0-wmf.27/extensions/CirrusSearch/: Add an option to limit the size of the file_text field: T271493 (duration: 00m 58s)

Mentioned in SAL (#wikimedia-operations) [2021-01-25T15:23:45Z] <dcausse@deploy1001> Synchronized php-1.36.0-wmf.27/extensions/CirrusSearch/: revert: Add an option to limit the size of the file_text field: T271493 (duration: 01m 05s)

Change 658240 merged by jenkins-bot:
[operations/mediawiki-config@master] [cirrus] set 50kb limit on file text indexing for commons

https://gerrit.wikimedia.org/r/658240

Mentioned in SAL (#wikimedia-operations) [2021-01-28T12:12:50Z] <dcausse@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T271493: [cirrus] set 50kb limit on file text indexing for commons (duration: 01m 09s)

Mentioned in SAL (#wikimedia-operations) [2021-01-28T12:32:38Z] <dcausse@deploy1001> Synchronized php-1.36.0-wmf.27/extensions/CirrusSearch/: Add an option to limit the size of the file_text field: T271493 (duration: 01m 09s)

Gehel closed this task as Resolved.Mon, Feb 1, 1:46 PM