Page MenuHomePhabricator

Index out of range in revert risk multi-lingual
Closed, ResolvedPublic

Description

From https://github.com/SWViewer/swviewer-service/issues/1

Repro:

curl https://api.wikimedia.org/service/lw/inference/v1/models/revertrisk-multilingual:predict -X POST -d '{"lang": "en", "rev_id": 1162517490}'

On the Kserve side:

Traceback (most recent call last):
  File "/opt/lib/python/site-packages/uvicorn/protocols/http/h11_impl.py", line 373, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "/opt/lib/python/site-packages/uvicorn/middleware/proxy_headers.py", line 75, in __call__
    return await self.app(scope, receive, send)
  File "/opt/lib/python/site-packages/fastapi/applications.py", line 270, in __call__
    await super().__call__(scope, receive, send)
  File "/opt/lib/python/site-packages/starlette/applications.py", line 124, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/lib/python/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/opt/lib/python/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/opt/lib/python/site-packages/timing_asgi/middleware.py", line 68, in __call__
    await self.app(scope, receive, send_wrapper)
  File "/opt/lib/python/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/opt/lib/python/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/opt/lib/python/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
    raise e
  File "/opt/lib/python/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/opt/lib/python/site-packages/starlette/routing.py", line 706, in __call__
    await route.handle(scope, receive, send)
  File "/opt/lib/python/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/opt/lib/python/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/opt/lib/python/site-packages/fastapi/routing.py", line 235, in app
    raw_response = await run_endpoint_function(
  File "/opt/lib/python/site-packages/fastapi/routing.py", line 161, in run_endpoint_function
    return await dependant.call(**values)
  File "/opt/lib/python/site-packages/kserve/protocol/rest/v1_endpoints.py", line 69, in predict
    response, response_headers = await self.dataplane.infer(model_name=model_name, body=body, headers=headers)
  File "/opt/lib/python/site-packages/kserve/protocol/dataplane.py", line 276, in infer
    response = await model(body, headers=headers)
  File "/opt/lib/python/site-packages/kserve/model.py", line 117, in __call__
    else self.predict(payload, headers)
  File "/srv/revert-risk-model/model-server/model.py", line 102, in predict
    result = KI_module.classify(self.model, request["revision"])
  File "/opt/lib/python/site-packages/knowledge_integrity/models/revertrisk_multilingual/model.py", line 407, in classify
    intermediate_features = extract_features(revision, CLASSIFIER_INTERMEDIATE_FEATURES)
  File "/opt/lib/python/site-packages/knowledge_integrity/models/revertrisk_multilingual/model.py", line 381, in extract_features
    return get_features(feature_sources, features, _transformed_revertrisk_features)
  File "/opt/lib/python/site-packages/knowledge_integrity/featureset.py", line 320, in get_features
    all_features.update(transformer(all_features))
  File "/opt/lib/python/site-packages/knowledge_integrity/models/revertrisk_multilingual/model.py", line 343, in _transformed_revertrisk_features
    edit_info = get_edit_info(
  File "/opt/lib/python/site-packages/knowledge_integrity/models/revertrisk_multilingual/model.py", line 297, in get_edit_info
    actions = et.get_diff()
  File "/opt/lib/python/site-packages/mwedittypes/mwedittypes.py", line 18, in get_diff
    self.tree_diff = get_diff(self.prev_wikitext, self.curr_wikitext, lang=self.lang,
  File "/opt/lib/python/site-packages/mwedittypes/tree_differ.py", line 19, in get_diff
    result = diff.post_process(prev_tree.secname_to_text, curr_tree.secname_to_text, lang=lang)
  File "/opt/lib/python/site-packages/mwedittypes/tree_differ.py", line 525, in post_process
    self._section_mapping(sections_prev, sections_curr)
  File "/opt/lib/python/site-packages/mwedittypes/tree_differ.py", line 608, in _section_mapping
    p_to_c[prev[i]] = curr[i]
IndexError: list index out of range

Event Timeline

Since this issue is coming from mwedittypes package we'll need to either fix it on https://github.com/geohci/edit-types or do some error handling on the knowledge_integrity package. I am digging a bit more to see how we can improve this.
The specific issue occurs in _section_mapping function which maps the sections between two revisions.

The specific code segment is error prone as is does a matching of previous revision's sections related to the current revision.
In the event where the previous revision has more sections that the current one it will always result in an IndexError.
Lines 607-610 in tree_differ.py

for i in range(len(prev)):
    p_to_c[prev[i]] = curr[i]
    c_to_p[curr[i]] = prev[i]

thanks for digging this up @isarantopoulos -- I'll take a look and see if there's an easy fix in the next week

@isarantopoulos The revision mentioned in the description (diff) is a very tricky one. Among other things, the edit introduced an open bold text-formatting syntax (* '''[[European Cup Winners' Cup]]). Mediawiki seems to handle this fine but mwparserfromhell does something weird: it seems to keep looking for the closing tag (''') and while it still can detect the remaining headings in the article, it doesn't consider them to be new sections presumably because it thinks they're nested within text-formatting. I unfortunately made what felt like a very reasonable assumption in the edittypes code that every section has a heading (except for the lead section) and so treat the two as equivalent. The index error then is because I'm using mwparserfromhell's concept of sections for the section names but the library's concept of headings when it comes to ensuring that the two section mappings line up even if there are changes to the sections.

Potential next steps:

  • Do nothing because this is super complicated and maybe a rare edge case?
  • Try to remove my assumption that section == heading. This would be a huge refactoring so I'd rather not go down this path.
  • Try to introduce the fix into mwparserfromhell -- this seems to be a known issue for a few years though so not sure how easy fixing it would be.
  • The above issue led me to a potentially viable workaround of telling mwparserfromhell to ignore text formatting (skip_style_tags as described here). I'm going to look into this -- it would make it harder to distinguish between an edit that was changing bold/italics from one that was editing text but given that this is just about ''' and '' syntax, I think the workaround shouldn't be too hard.

I think I'll give the last one a try next week and if it seems to work, I'll push a new version that you all can use that will hopefully fix this.

Miriam triaged this task as Medium priority.

More logs (161) if you need (30.06-03.07):

[2023-06-29T18:21:09.472]: {error: 'IndexError : list index out of range'} | enwiki: 1162529317
[2023-06-29T19:08:53.805]: {error: 'IndexError : list index out of range'} | enwiki: 1162535933 
[2023-06-29T19:46:11.196]: {error: 'IndexError : list index out of range'} | enwiki: 116254183
[2023-06-29T19:50:25.503]: {error: 'IndexError : list index out of range'} | mswiki: 5892132 
[2023-06-29T20:04:01.898]: {error: 'IndexError : list index out of range'} | ruwiki: 131358889 
[2023-06-29T20:10:06.992]: {error: 'IndexError : list index out of range'} | enwiki: 1162545372 
[2023-06-29T21:31:22.545]: {error: 'IndexError : list index out of range'} | enwiki: 1162557253 
[2023-06-30T09:04:38.401]: {error: 'IndexError : list index out of range'} | frwiki: 25594107
[2023-06-30T09:29:00.995]: {error: 'IndexError : list index out of range'} | frwiki: 25594577
[2023-06-30T09:51:57.295]: {error: 'IndexError : list index out of range'} | eowiki: 8184372 
[2023-06-30T11:09:35.793]: {error: 'IndexError : list index out of range'} | enwiki: 1162649617 
[2023-06-30T11:09:39.396]: {error: 'IndexError : list index out of range'} | enwiki: 116264963
[2023-06-30T11:11:49.633]: {error: 'IndexError : list index out of range'} | itwiki: 134239636 
[2023-06-30T11:40:57.352]: {error: 'IndexError : list index out of range'} | enwiki: 1162653228 
[2023-06-30T12:02:13.499]: {error: 'IndexError : list index out of range'} | enwiki: 1162655713 
[2023-06-30T12:03:30.976]: {error: 'IndexError : list index out of range'} | itwiki: 13424550
[2023-06-30T12:13:31.507]: {error: 'IndexError : list index out of range'} | itwiki: 13424737
[2023-06-30T12:15:14.610]: {error: 'IndexError : list index out of range'} | dewiki: 23560637
[2023-06-30T12:31:06.606]: {error: 'IndexError : list index out of range'} | dewiki: 23560984
[2023-06-30T13:13:16.386]: {error: 'IndexError : list index out of range'} | enwiki: 116266488
[2023-06-30T13:14:13.512]: {error: 'IndexError : list index out of range'} | enwiki: 116266523
[2023-06-30T13:34:26.362]: {error: 'IndexError : list index out of range'} | enwiki: 1162668236 
[2023-06-30T13:41:40.862]: {error: 'IndexError : list index out of range'} | frwiki: 25599941
[2023-06-30T14:16:35.036]: {error: 'IndexError : list index out of range'} | enwiki: 1162674387 
[2023-06-30T14:38:18.724]: {error: 'IndexError : list index out of range'} | ruwiki: 13137386
[2023-06-30T14:57:51.215]: {error: 'IndexError : list index out of range'} | enwiki: 116268101
[2023-06-30T15:19:09.605]: {error: 'IndexError : list index out of range'} | enwiki: 116268458
[2023-06-30T15:46:37.471]: {error: 'IndexError : list index out of range'} | enwiki: 116268938
[2023-06-30T15:48:17.288]: {error: 'IndexError : list index out of range'} | enwiki: 1162689298 
[2023-06-30T15:48:40.630]: {error: 'IndexError : list index out of range'} | enwiki: 1162689355 
[2023-06-30T15:49:52.342]: {error: 'IndexError : list index out of range'} | enwiki: 1162689535 
[2023-06-30T16:05:04.812]: {error: 'IndexError : list index out of range'} | enwiki: 116269185
[2023-06-30T17:02:24.429]: {error: 'IndexError : list index out of range'} | frwiki: 25604888
[2023-06-30T18:53:42.461]: {error: 'IndexError : list index out of range'} | eswiki: 152187813 
[2023-06-30T20:26:48.762]: {error: 'IndexError : list index out of range'} | eswiki: 152189475 
[2023-06-30T20:30:59.260]: {error: 'IndexError : list index out of range'} | frwiki: 25609124
[2023-06-30T20:46:27.905]: {error: 'IndexError : list index out of range'} | eswiki: 152189858 
[2023-06-30T23:52:49.500]: {error: 'IndexError : list index out of range'} | jawiki: 95833146 
[2023-06-30T23:58:50.767]: {error: 'IndexError : list index out of range'} | frwiki: 25613769
[2023-06-30T23:58:58.729]: {error: 'IndexError : list index out of range'} | frwiki: 25613771
[2023-07-01T00:19:20.883]: {error: 'IndexError : list index out of range'} | jawiki: 95833384 
[2023-07-01T01:39:29.470]: {error: 'IndexError : list index out of range'} | enwiki: 1162767781 
[2023-07-01T02:06:41.186]: {error: 'IndexError : list index out of range'} | jawiki: 9583456
[2023-07-01T02:11:31.509]: {error: 'IndexError : list index out of range'} | enwiki: 1162771787 
[2023-07-01T02:44:29.965]: {error: 'IndexError : list index out of range'} | jawiki: 95834979 
[2023-07-01T02:44:39.844]: {error: 'IndexError : list index out of range'} | jawiki: 95834982 
[2023-07-01T02:47:10.071]: {error: 'IndexError : list index out of range'} | jawiki: 9583518
[2023-07-01T02:55:18.913]: {error: 'IndexError : list index out of range'} | jawiki: 95835135 
[2023-07-01T03:50:39.641]: {error: 'IndexError : list index out of range'} | jawiki: 95835782 
[2023-07-01T06:14:11.420]: {error: 'IndexError : list index out of range'} | bnwiki: 6719613 
[2023-07-01T06:27:01.676]: {error: 'IndexError : list index out of range'} | itwiki: 13425323
[2023-07-01T07:11:04.014]: {error: 'IndexError : list index out of range'} | jawiki: 9583785
[2023-07-01T07:59:49.530]: {error: 'IndexError : list index out of range'} | jawiki: 95838443 
[2023-07-01T08:55:32.438]: {error: 'IndexError : list index out of range'} | thwiki: 1843954
[2023-07-01T08:58:39.985]: {error: 'IndexError : list index out of range'} | thwiki: 1843960
[2023-07-01T09:00:34.580]: {error: 'IndexError : list index out of range'} | thwiki: 1843963
[2023-07-01T09:07:22.014]: {error: 'IndexError : list index out of range'} | thwiki: 1843970
[2023-07-01T09:45:44.887]: {error: 'IndexError : list index out of range'} | jawiki: 95839651 
[2023-07-01T11:02:14.020]: {error: 'IndexError : list index out of range'} | jawiki: 9584473
[2023-07-01T11:25:57.332]: {error: 'IndexError : list index out of range'} | enwiki: 116283884
[2023-07-01T11:32:02.005]: {error: 'IndexError : list index out of range'} | eswiki: 15221705
[2023-07-01T11:37:01.895]: {error: 'IndexError : list index out of range'} | frwiki: 25624019
[2023-07-01T11:40:57.972]: {error: 'IndexError : list index out of range'} | dewiki: 23587088
[2023-07-01T11:47:33.026]: {error: 'IndexError : list index out of range'} | jawiki: 9584998
[2023-07-01T11:47:36.400]: {error: 'IndexError : list index out of range'} | enwiki: 116284130
[2023-07-01T11:48:05.836]: {error: 'IndexError : list index out of range'} | frwiki: 25624241
[2023-07-01T12:16:15.381]: {error: 'IndexError : list index out of range'} | thwiki: 1844313
[2023-07-01T12:17:35.385]: {error: 'IndexError : list index out of range'} | eswiki: 15222304
[2023-07-01T12:29:56.190]: {error: 'IndexError : list index out of range'} | jawiki: 95841481 
[2023-07-01T13:02:19.541]: {error: 'IndexError : list index out of range'} | jawiki: 95841963 
[2023-07-01T13:29:21.172]: {error: 'IndexError : list index out of range'} | jawiki: 95842322 
[2023-07-01T13:51:03.180]: {error: 'IndexError : list index out of range'} | enwiki: 116285588
[2023-07-01T14:16:29.054]: {error: 'IndexError : list index out of range'} | thwiki: 1844728
[2023-07-01T14:19:19.275]: {error: 'IndexError : list index out of range'} | eswiki: 15223467
[2023-07-01T14:30:16.092]: {error: 'IndexError : list index out of range'} | jawiki: 95843282 
[2023-07-01T14:54:05.074]: {error: 'IndexError : list index out of range'} | jawiki: 9584367
[2023-07-01T14:56:15.693]: {error: 'IndexError : list index out of range'} | jawiki: 95843645 
[2023-07-01T17:29:07.684]: {error: 'IndexError : list index out of range'} | enwiki: 1162883441 
[2023-07-01T18:20:30.882]: {error: 'IndexError : list index out of range'} | bgwiki: 11846645 
[2023-07-01T18:46:08.452]: {error: 'IndexError : list index out of range'} | euwiki: 9328141 
[2023-07-01T19:17:07.659]: {error: 'IndexError : list index out of range'} | plwiki: 7767438
[2023-07-01T19:52:00.136]: {error: 'IndexError : list index out of range'} | enwiki: 116291896
[2023-07-01T20:00:46.778]: {error: 'IndexError : list index out of range'} | hrwiki: 667589
[2023-07-01T23:49:44.482]: {error: 'IndexError : list index out of range'} | jawiki: 95847459 
[2023-07-02T01:05:59.657]: {error: 'IndexError : list index out of range'} | mswiki: 5894323 
[2023-07-02T02:31:32.219]: {error: 'IndexError : list index out of range'} | jawiki: 9584956
[2023-07-02T02:36:32.151]: {error: 'IndexError : list index out of range'} | jawiki: 95849111 
[2023-07-02T02:50:18.250]: {error: 'IndexError : list index out of range'} | enwiki: 1162955847 
[2023-07-02T03:13:47.418]: {error: 'IndexError : list index out of range'} | jawiki: 9584951
[2023-07-02T05:12:46.757]: {error: 'IndexError : list index out of range'} | thwiki: 1845737
[2023-07-02T05:15:48.715]: {error: 'IndexError : list index out of range'} | thwiki: 1845741
[2023-07-02T06:06:22.214]: {error: 'IndexError : list index out of range'} | jawiki: 95851365 
[2023-07-02T06:12:02.014]: {error: 'IndexError : list index out of range'} | itwiki: 13427944
[2023-07-02T06:36:53.229]: {error: 'IndexError : list index out of range'} | jawiki: 95851711 
[2023-07-02T07:25:55.450]: {error: 'IndexError : list index out of range'} | hiwiki: 5897646 
[2023-07-02T08:26:20.072]: {error: 'IndexError : list index out of range'} | enwiki: 1162989885 
[2023-07-02T08:36:22.622]: {error: 'IndexError : list index out of range'} | thwiki: 1846671
[2023-07-02T08:53:28.634]: {error: 'IndexError : list index out of range'} | ruwiki: 13147340
[2023-07-02T09:31:05.147]: {error: 'IndexError : list index out of range'} | thwiki: 1846817
[2023-07-02T09:31:05.783]: {error: 'IndexError : list index out of range'} | jawiki: 9585411
[2023-07-02T09:32:35.272]: {error: 'IndexError : list index out of range'} | ruwiki: 13148304
[2023-07-02T10:49:53.765]: {error: 'IndexError : list index out of range'} | mswiki: 5894865 
[2023-07-02T10:58:37.686]: {error: 'IndexError : list index out of range'} | ruwiki: 13141016
[2023-07-02T11:12:03.016]: {error: 'IndexError : list index out of range'} | ruwiki: 13141267
[2023-07-02T11:31:19.149]: {error: 'IndexError : list index out of range'} | svwiki: 53394835 
[2023-07-02T11:48:59.582]: {error: 'IndexError : list index out of range'} | thwiki: 1847098
[2023-07-02T11:50:57.400]: {error: 'IndexError : list index out of range'} | thwiki: 1847102
[2023-07-02T12:07:39.538]: {error: 'IndexError : list index out of range'} | mswiki: 5894956 
[2023-07-02T12:07:48.600]: {error: 'IndexError : list index out of range'} | eswiki: 152221558 
[2023-07-02T12:26:10.487]: {error: 'IndexError : list index out of range'} | itwiki: 134275622 
[2023-07-02T14:12:35.936]: {error: 'IndexError : list index out of range'} | ruwiki: 131413551 
[2023-07-02T15:33:16.672]: {error: 'IndexError : list index out of range'} | nlwiki: 6461689
[2023-07-02T15:50:14.345]: {error: 'IndexError : list index out of range'} | nlwiki: 6461798
[2023-07-02T16:04:18.914]: {error: 'IndexError : list index out of range'} | svwiki: 5339877
[2023-07-02T16:07:19.407]: {error: 'IndexError : list index out of range'} | thwiki: 1847807
[2023-07-02T17:06:34.673]: {error: 'IndexError : list index out of range'} | eswiki: 152225331 
[2023-07-02T17:26:59.599]: {error: 'IndexError : list index out of range'} | eswiki: 152225584 
[2023-07-02T17:34:07.404]: {error: 'IndexError : list index out of range'} | jawiki: 9586187
[2023-07-02T18:12:15.014]: {error: 'IndexError : list index out of range'} | mswiki: 5895355 
[2023-07-02T18:14:14.764]: {error: 'IndexError : list index out of range'} | ruwiki: 13141834
[2023-07-02T18:32:48.637]: {error: 'IndexError : list index out of range'} | svwiki: 53399884 
[2023-07-02T18:41:05.347]: {error: 'IndexError : list index out of range'} | mswiki: 5895363 
[2023-07-02T19:11:17.901]: {error: 'IndexError : list index out of range'} | ruwiki: 131419878 
[2023-07-02T19:15:46.384]: {error: 'IndexError : list index out of range'} | ruwiki: 131419956 
[2023-07-02T20:52:40.615]: {error: 'IndexError : list index out of range'} | eswiki: 152229247 
[2023-07-02T21:23:25.888]: {error: 'IndexError : list index out of range'} | eswiki: 152229888 
[2023-07-02T22:31:22.077]: {error: 'IndexError : list index out of range'} | enwiki: 116390458
[2023-07-02T22:34:22.699]: {error: 'IndexError : list index out of range'} | enwiki: 116390815
[2023-07-02T22:42:52.898]: {error: 'IndexError : list index out of range'} | mswiki: 5895416 
[2023-07-02T23:01:03.133]: {error: 'IndexError : list index out of range'} | ruwiki: 131423161 
[2023-07-02T23:17:39.161]: {error: 'IndexError : list index out of range'} | enwiki: 116394811
[2023-07-03T01:42:28.993]: {error: 'IndexError : list index out of range'} | jawiki: 95862676 
[2023-07-03T02:51:58.976]: {error: 'IndexError : list index out of range'} | jawiki: 9586336
[2023-07-03T03:42:38.304]: {error: 'IndexError : list index out of range'} | itwiki: 134286127 
[2023-07-03T04:09:07.066]: {error: 'IndexError : list index out of range'} | enwiki: 1163129961 
[2023-07-03T04:36:26.935]: {error: 'IndexError : list index out of range'} | thwiki: 1848353
[2023-07-03T04:38:59.723]: {error: 'IndexError : list index out of range'} | thwiki: 1848360
[2023-07-03T04:47:11.394]: {error: 'IndexError : list index out of range'} | thwiki: 1848377
[2023-07-03T04:51:45.831]: {error: 'IndexError : list index out of range'} | thwiki: 1848386
[2023-07-03T04:52:52.323]: {error: 'IndexError : list index out of range'} | thwiki: 1848389
[2023-07-03T05:05:13.114]: {error: 'IndexError : list index out of range'} | hiwiki: 5898468 
[2023-07-03T05:29:53.440]: {error: 'IndexError : list index out of range'} | thwiki: 1848500
[2023-07-03T05:44:23.604]: {error: 'IndexError : list index out of range'} | jawiki: 95864784 
[2023-07-03T06:58:25.351]: {error: 'IndexError : list index out of range'} | jawiki: 9586545
[2023-07-03T07:31:41.321]: {error: 'IndexError : list index out of range'} | jawiki: 9586584
[2023-07-03T07:33:08.271]: {error: 'IndexError : list index out of range'} | zhwiki: 77927165 
[2023-07-03T08:27:31.112]: {error: 'IndexError : list index out of range'} | jawiki: 95866385 
[2023-07-03T09:31:08.122]: {error: 'IndexError : list index out of range'} | frwiki: 25678533
[2023-07-03T09:41:59.288]: {error: 'IndexError : list index out of range'} | frwiki: 25678778
[2023-07-03T09:59:45.327]: {error: 'IndexError : list index out of range'} | thwiki: 1849095
[2023-07-03T10:11:46.563]: {error: 'IndexError : list index out of range'} | frwiki: 25679463
[2023-07-03T10:29:44.526]: {error: 'IndexError : list index out of range'} | jawiki: 95867617 
[2023-07-03T11:39:39.855]: {error: 'IndexError : list index out of range'} | mswiki: 5896152 
[2023-07-03T11:40:06.252]: {error: 'IndexError : list index out of range'} | mswiki: 5896153 
[2023-07-03T11:47:50.784]: {error: 'IndexError : list index out of range'} | frwiki: 25681455
[2023-07-03T12:20:17.879]: {error: 'IndexError : list index out of range'} | dewiki: 235146288 
[2023-07-03T12:25:03.313]: {error: 'IndexError : list index out of range'} | jawiki: 9586887
[2023-07-03T12:29:53.553]: {error: 'IndexError : list index out of range'} | jawiki: 95868936 
[2023-07-03T13:12:30.571]: {error: 'IndexError : list index out of range'} | jawiki: 95869664 
[2023-07-03T13:18:22.281]: {error: 'IndexError : list index out of range'} | frwiki: 25683810
[2023-07-03T13:21:52.093]: {error: 'IndexError : list index out of range'} | frwiki: 25683907

New version released that should address this: https://pypi.org/project/mwedittypes/2.1.0/

The fix for this (using the skip_style_tags option) triggered a slew of other changes. A few things to be aware:

  • I noticed that the StructuredEditTypes differ was frequently recomputing a property that is actually static in the context of this library so instead cached that and this led to a very noticeable bump in speed, especially for large diffs. Yay!
  • The behavior should otherwise be exactly the same minus no longer getting this error and getting more reasonable diffs for these types of revisions too. There is one small caveat that with the new approach for text-formatting detection, if the only change is a text-formatting span change across text -- e.g., ''bold'' text -> ''bold text'', this won't trigger any detected changes in StructuredEditTypes. I consider this an acceptable if imperfect outcome though will try to address in the future.
  • I made some other changes too but almost all should be invisible. There are some signature changes but they're deep in the library, so if you're using the top-level exposed functions in mwedittypes, this should be a 1:1 swap code-wise. If you're calling functions from more deeply in the library, you might have to make minor adjustments.
  • Given that the only changes in outputs are in these rare diffs where text-formatting was set improperly in the wikitext, I assume the drift caused by training the model on an earlier version of this library vs. running inference with the updated library is minimal but will leave that to you all to decide.

Please switch over and let me know if everything is working as expected and then I'll close this task. Thanks!

@diego @MunizaA Hi! IIUC we'd need to bump mwedittypes to 2.1.0 in knowledge_integrity, do you have time to do it?

@Isaac really nice work! Thanks!

Thanks a lot @Isaac for tackling this! Would this also solve the other revertrisk issues that crash in tree_differ.py of mwedittypes or should they be tackled independently?
I am talking about https://phabricator.wikimedia.org/T340812 and https://phabricator.wikimedia.org/T340813

Hi @elukey, the dependency contraint we have for mwedittypes in KI is "1.2.1" so unfortunately this new version is not a drop-in replacement. There are some minor API changes but more importantly, the diff processing code in get_edit_info will have to be modified in order to adapt to this new version. I can look into making these modifications but looking at the changelog for mwedittypes, it seems like there have also been some changes to how certain types of edits are captured since v1.2.1 and I'm not sure how this would impact the performance of the model so I'm discussing these changes with @Trokhymovych and will make the switch as soon as we're sure about its impact.

Would this also solve the other revertrisk issues that crash in tree_differ.py of mwedittypes or should they be tackled independently? I am talking about https://phabricator.wikimedia.org/T340812 and https://phabricator.wikimedia.org/T340813

Yep, they seem to be working now too! FYI if folks are curious to test on other revisions, I have an API and UI setup for the library that I have bumped to the new version. The UI is accessible via: https://wiki-topic.toolforge.org/diff-tagging. You just provide it the language code and revision ID and then select "Simple" for "Simple" or "Detailed" for "Structured". It's this latter one that revert risk is using. And if you want raw outputs, the API can give you Simple and Detailed/Structured in one place so you can compare. Example call: https://edit-types.wmcloud.org/diff-details?lang=en&revid=1162517490

@MunizaA if you're going to make the upgrade, let's work together to simplify down too. Most of the changes to outputs are around Text Formatting bugs so should hopefully still be relatively minor in terms of impact on the model minus fewer errors and hopefully speed improvements. Also, because it looks like you're using the library mainly for the text diffing, I think I can help you extract that info more efficiently and possibly even build in some of that functionality. For example: https://public-paws.wmcloud.org/User:Isaac_(WMF)/Edit%20Diffs/Text-Diffing.ipynb

@Isaac thanks so much for the pointers! It seems like this model is also using node edit info for some of the features but in any case, we should be able to simplify the text diffing code using functionality from mwedittypes.

Also, I tested the 161 revisions from @Iluvatar 's logs using the latest version and I was able to successfully get diffs for 128 of them! However, I still see IndexError: list index out of range for a small number of revisions:

ms: 5892132. Reason: list index out of range
eo: 8184372. Reason: list index out of range
en: 1162655713. Reason: list index out of range
en: 1162668236. Reason: list index out of range
en: 1162689298. Reason: list index out of range
en: 1162689535. Reason: list index out of range
ja: 95833146. Reason: list index out of range
en: 1162771787. Reason: list index out of range
ja: 95834979. Reason: list index out of range
ja: 95834982. Reason: list index out of range
ja: 95835135. Reason: list index out of range
ja: 95835782. Reason: list index out of range
ja: 95838443. Reason: list index out of range
ja: 95839651. Reason: list index out of range
ja: 95841481. Reason: list index out of range
ja: 95841963. Reason: list index out of range
ja: 95842322. Reason: list index out of range
ja: 95843282. Reason: list index out of range
ja: 95843645. Reason: list index out of range
en: 1162883441. Reason: list index out of range
bg: 11846645. Reason: list index out of range
ja: 95849111. Reason: list index out of range
ja: 95851365. Reason: list index out of range
ja: 95851711. Reason: list index out of range
hi: 5897646. Reason: list index out of range
sv: 53394835. Reason: list index out of range
ru: 131419878. Reason: list index out of range
ru: 131419956. Reason: list index out of range
ru: 131423161. Reason: list index out of range
ja: 95862676. Reason: list index out of range
hi: 5898468. Reason: list index out of range
ja: 95868936. Reason: list index out of range
ja: 95869664. Reason: list index out of range

I'm getting the diff by doing:

et = mwedittypes.StructuredEditTypes(prev_wikitext, curr_wikitext, lang=lang, timeout=True)
et.get_diff()

I'm able to get diffs for most of these revisions if I disable timeoutthough, so it seems like it's being caused by a different bug? I'm also able to get results for most of them using this endpoint https://edit-types.wmcloud.org/diff-details?lang=en&revid=1162668236 so I was wondering if this API also has timeout disabled? If not then the problem could just be with my test setup.

so I was wondering if this API also has timeout disabled? If not then the problem could just be with my test setup.

Yeah, the API has timeout disabled. I prefer that setting but understand it's nice to not trigger diffs that could potentially hang for a long time.

I'm able to get diffs for most of these revisions if I disable timeoutthough, so it seems like it's being caused by a different bug?

No, thanks for raising this again -- it had slipped my mind to look into this bug caused by the timeout implementation because I got sidetracked with the others. A few options:

  • If you'd like a quick solution, I can push a very simple change that seems to handle this (I retain headings when I prune so I can better track section inserts/removals): https://github.com/geohci/edit-types/pull/80/files
  • With the set you provided, there's still one revision that throws an error though because someone added a new heading in the middle of a table (why people why?!?!), which is seen as a heading but not a section in mwparserfromhell so the mismatch remains. Fixing this one would take longer I think but I can look into it and see how fixable it is. Might take several weeks though depending on how complex and when I can make time. You're welcome to take a swing at it too but I definitely apologize and don't necessarily recommend because the library could use a lot more code comments to explain the logic and this bug spans a lot of different interconnected parts.

t seems like this model is also using node edit info for some of the features

Oh nice! Well still probably a solution where you use the Simple form of the differ for those nodes and then do the text separately. I think it'll probably still be faster, especially in the worse-case, than running through the full Structured Differ.

@MunizaA -- I managed to separate the headers from sections in the code so it's much cleaner now and seems to run fine for your list of revisions with timeout set to True. We're now at version 2.1.1. If you get a chance, let me know if it's working for you though I recognize there's still the broader refactoring to happen.

I managed to separate the headers from sections in the code so it's much cleaner now and seems to run fine for your list of revisions with timeout set to True.

@Isaac that's awesome, thank you!

If you get a chance, let me know if it's working for you though I recognize there's still the broader refactoring to happen.

I rewrote some parts of RRML earlier this week to replace StructuredEditTypes with SimpleEditTypes in this MR, using pointers from your comments above. The changes are still being tested by @Trokhymovych, the original author of this code, to make sure there isn't a significant drift in the predictions made by the model but if you have some time, I'd really appreciate it if you could look it over and let us know if we're getting the info that we need using mwedittypes in the most efficient way possible. Thanks!

@MunizaA -- I took a quick pass and left some comments but generally looking good. I didn't test locally but hopefully should give you some substantial speed-ups with very minimal drift.

I have checked the proposed changes (MR: https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/merge_requests/17).

As for that, I used the random sample test dataset of 10K revisions from 47 languages + async process by batches of 20 revisions. I compared Multilingual changed (corresponds to MR), the Language Agnostic model, a stable version of Multilingual, and an experimental simplified version of Multilingual (corresponds to MR but without bert features). Also, I added ORES scores for reference.

Results are presented in the attached table.

image.png (332×1 px, 86 KB)

Summary:

  1. We observe a slight drop in the model's performance after the proposed changes. However, it is expected and acceptable, as we slightly changed the model input without retraining. We can proceed with those changes and later recollect data with the proper preprocessing procedure and retrain the model. (it is a bit long process)
  2. We observe a drop in error rate (now it is almost equal for LA and Multilingual after refactoring) and efficiency improvement as a result of MR.
  3. (Optional) We can observe the alternative simplified version of the multilingual model (without MLMs features), which shows a significant boost in performance for anonymous users and efficiency compared to LA. It requires further analysis, and some insights can be reused for improving the LA model for IP edits.

CC: @MunizaA

great news @Trokhymovych and thanks for sharing this results table! if there are any specific revids that were erroring out, don't hesitate to let me know which plus what the error was if it's in the mwedittypes library. Seems low enough that maybe not worth fixing for this round but always good to know what edge cases are out there.

I rewrote some parts of RRML earlier this week to replace StructuredEditTypes with SimpleEditTypes in this MR, using pointers from your comments above. The changes are still being tested by @Trokhymovych, the original author of this code, to make sure there isn't a significant drift in the predictions made by the model

This has now been tested and merged. Changes can be pulled in by installing the latest release. This would also require a new model file for Revert Risk Multilingual (Version 4) which @Trokhymovych mentioned he'll share soon. Thanks all!

@MunizaA @Isaac thanks so much for this! Just checking if the task is resolved?

It is almost done! Let's resolve when the new model binary is uploaded and deployed to k8s :)

Change 947875 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] revert-risk: upgrade knowledge_integrity to v0.3.0

https://gerrit.wikimedia.org/r/947875

Change 947875 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revert-risk: upgrade knowledge_integrity to v0.3.0

https://gerrit.wikimedia.org/r/947875

Change 948103 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: update revert-risk images and model binary

https://gerrit.wikimedia.org/r/948103

Change 948103 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update revert-risk images and model binary

https://gerrit.wikimedia.org/r/948103

The new model binary has been uploaded and deployed. Thank you all for working on this! The task is resolved. :)