Some mw-ocg-service logs fail to index and are being dropped
Open, Needs TriagePublic

Description

The storage service behind logstash has a limit of 32kB, when encoded to UTF8, for individual fields. Some documents coming from mw-ocg-service (a few hundred a day) are being sent with more data and as such those events are being dropped.

Example error:

[2017-05-30T06:54:17,289][WARN ][logstash.outputs.elasticsearch] Failed action. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"logstash-2017.05.30", :_type=>"mw-ocg-service", :_routing=>nil}, 2017-05-30T06:54:16.828Z ocg1001 Error: Latex Error! Check your latex string
    at /srv/deployment/ocg/ocg/node_modules/gammalatex/app.js:72:12
    at CB (/srv/deployment/ocg/ocg/node_modules/gammalatex/node_modules/rimraf/rimraf.js:68:5)
    at Object.oncomplete (fs.js:107:15)], :response=>{"index"=>{"_index"=>"logstash-2017.05.30", "_type"=>"mw-ocg-service", "_id"=>"AVxYIjp7i3IhbcQGmtbq", "status"=>400, "error"=>{"type"=>"illegal_argument_exception", "reason"=>"Document contains at least one immense term in field=\"details_log.raw\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[84, 104, 105, 115, 32, 105, 115, 32, 88, 101, 84, 101, 88, 44, 32, 86, 101, 114, 115, 105, 111, 110, 32, 51, 46, 49, 52, 49, 53, 57]...', original message: bytes can be at most 32766 in length; got 142602", "caused_by"=>{"type"=>"max_bytes_length_exceeded_exception", "reason"=>"max_bytes_length_exceeded_exception: bytes can be at most 32766 in length; got 142602"}}}}}
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 30 2017, 7:17 PM
EBernhardson updated the task description. (Show Details)May 30 2017, 7:18 PM

If this data is particularly important we could possibly explicitly map this field to not utilize the keyword type, which would allow storing terms > 32kB. In general though we have been allowing the auto-mapping to handle all but a very small set of shared fields. It might also be possible to come up with a mapping that auto-truncates to 32kB rather than erroring out, but requires some investigation.

As already announced in Tech News, OfflineContentGenerator (OCG) will not be used anymore after October 1st, 2017 on Wikimedia sites. OCG will be replaced by Electron. You can read more on mediawiki.org.