Page MenuHomePhabricator

Issues with Reference Need and Reference Risk models
Closed, ResolvedPublic3 Estimated Story Points

Description

Enterprise is seeing some issues with the ReferenceRisk and ReferenceNeed models.

I open this task to document the steps to resolve these issues for future reference. Details can be found here.

Reference-risk

  • Unexpected issue: The expected response, included in the public model card, says the model would return the url domain that was parsed and assessed. It is not doing so.

To return additional reference metadata, set extended_output to True in the input. See the API doc.

  • Unexpected issue: The model is failing on multiple languages (listed in the chart below) that the model card says it supports. The model card also says that the model should not be used in "Evaluating a domain that has no previous usage history in the target language Wikipedia".

This is a config issue explained in T384172#10475450. Muniza has opened an MR for this.
UPDATE: Merged and deployed. The model supports all wikis now.

Reference-need

  • Receiving 500 errors ("AttributeError : 'Error' object has no attribute 'comment'")

This issue occurs because the model server didn't handle the 'Error' object returned from knowledge_integrity. (fix)
UPDATE: Merged and deployed. This kind of error now returns a proper error code and message.

  • Expected Issue: The model is running slow. Seeing anywhere between 500 ms to 8 sec. Out of 10 calls, already see 4 above 2 seconds; 1 even at 8 seconds.

According to the previous load testing results, the model is running with an average latency of ~400ms on the test set. I'll do additional load tests and report back on this.
UPDATE: T384172#10499785

Event Timeline

Looking at the supported_wikis property in the reference risk model, the supported languages are {'lt', 'ru', 'uk', 'fa', 'sv', 'vi', 'zh', 'fr', 'ne', 'tr', 'en', 'pt'}

Change #1112698 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] reference-quality: catch Error objects returned from knowledge_integrity

https://gerrit.wikimedia.org/r/1112698

Looking at the supported_wikis property in the reference risk model, the supported languages are {'lt', 'ru', 'uk', 'fa', 'sv', 'vi', 'zh', 'fr', 'ne', 'tr', 'en', 'pt'}

Hi, this is because the default list of wikis that we generate reference risk features for consists of all wikis that have a perennial sources list. The airflow dag for this pipeline currently does not override the default but I suppose it should with all wikis in canonical_data.wikis to match the model card. I'll open an MR for this shortly.

Change #1112698 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] reference-quality: catch Error objects returned from knowledge_integrity

https://gerrit.wikimedia.org/r/1112698

isarantopoulos set the point value for this task to 3.
isarantopoulos moved this task from Unsorted to In Progress on the Machine-Learning-Team board.

Change #1113742 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: update reference-quality docker image

https://gerrit.wikimedia.org/r/1113742

Change #1113742 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update reference-quality docker image

https://gerrit.wikimedia.org/r/1113742

I updated the DagProperties for this dag with the changes in this MR so that we can re-run the pipeline and generate features for all wikis (though checking in those changes makes sure that they stick around even if the airflow variable gets deleted). The November '24 metadata snapshot in swift (feature-sets/reference-risk/20241101/features.db) now contains features for all wikis listed here and this is reflected via the supported_wikis property on the ReferenceRiskModel:

>>> from knowledge_integrity.models.reference_risk.model import ReferenceRiskModel
>>> model = ReferenceRiskModel(domain_metadata_path="reference-risk/20241101/features.db")
>>> sorted(model.supported_wikis)
['ab', 'ace', 'ady', 'af', 'als', 'alt', 'am', 'ami', 'an', 'ang', 'anp', 'ar', 'arc', 'ary', 'arz', 'as', 'ast', 'atj', 'av', 'avk', 'awa', 'ay', 'az', 'azb', 'ba', 'ban', 'bar', 'bat_smg', 'bbc', 'bcl', 'be', 'be_x_old', 'bg', 'bh', 'bi', 'bjn', 'blk', 'bm', 'bn', 'bo', 'bpy', 'br', 'bs', 'bug', 'bxr', 'ca', 'cbk_zam', 'cdo', 'ce', 'ceb', 'ch', 'chr', 'chy', 'ckb', 'co', 'cr', 'crh', 'cs', 'csb', 'cu', 'cv', 'cy', 'da', 'dag', 'de', 'dga', 'din', 'diq', 'dsb', 'dty', 'dv', 'dz', 'ee', 'el', 'eml', 'en', 'eo', 'es', 'et', 'eu', 'ext', 'fa', 'fat', 'ff', 'fi', 'fiu_vro', 'fj', 'fo', 'fon', 'fr', 'frp', 'frr', 'fur', 'fy', 'ga', 'gag', 'gan', 'gcr', 'gd', 'gl', 'glk', 'gn', 'gom', 'gor', 'got', 'gpe', 'gu', 'guc', 'gur', 'guw', 'gv', 'ha', 'hak', 'haw', 'he', 'hi', 'hif', 'hr', 'hsb', 'ht', 'hu', 'hy', 'hyw', 'ia', 'id', 'ie', 'ig', 'ik', 'ilo', 'inh', 'io', 'is', 'it', 'iu', 'ja', 'jam', 'jbo', 'jv', 'ka', 'kaa', 'kab', 'kbd', 'kbp', 'kcg', 'kg', 'ki', 'kk', 'kl', 'km', 'kn', 'ko', 'koi', 'krc', 'ks', 'ksh', 'ku', 'kv', 'kw', 'ky', 'la', 'lad', 'lb', 'lbe', 'lez', 'lfn', 'lg', 'li', 'lij', 'lld', 'lmo', 'ln', 'lo', 'lt', 'ltg', 'lv', 'mad', 'mai', 'map_bms', 'mdf', 'mg', 'mhr', 'mi', 'min', 'mk', 'ml', 'mn', 'mni', 'mnw', 'mr', 'mrj', 'ms', 'mt', 'mwl', 'my', 'myv', 'mzn', 'nah', 'nap', 'nds', 'nds_nl', 'ne', 'new', 'nia', 'nl', 'nn', 'no', 'nov', 'nqo', 'nrm', 'nso', 'nv', 'ny', 'oc', 'olo', 'om', 'or', 'os', 'pa', 'pag', 'pam', 'pap', 'pcd', 'pcm', 'pdc', 'pfl', 'pi', 'pih', 'pl', 'pms', 'pnb', 'pnt', 'ps', 'pt', 'pwn', 'qu', 'rm', 'rmy', 'rn', 'ro', 'roa_rup', 'roa_tara', 'ru', 'rue', 'rw', 'sa', 'sah', 'sat', 'sc', 'scn', 'sco', 'sd', 'se', 'sg', 'sh', 'shi', 'shn', 'si', 'simple', 'sk', 'skr', 'sl', 'sm', 'smn', 'sn', 'so', 'sq', 'sr', 'srn', 'ss', 'st', 'stq', 'su', 'sv', 'sw', 'szl', 'szy', 'ta', 'tay', 'tcy', 'te', 'tet', 'tg', 'th', 'ti', 'tk', 'tl', 'tly', 'tn', 'to', 'tpi', 'tr', 'trv', 'ts', 'tt', 'tum', 'tw', 'ty', 'tyv', 'udm', 'ug', 'uk', 'ur', 'uz', 've', 'vec', 'vep', 'vi', 'vls', 'vo', 'wa', 'war', 'wo', 'wuu', 'xal', 'xh', 'xmf', 'yi', 'yo', 'za', 'zea', 'zgh', 'zh', 'zh_classical', 'zh_min_nan', 'zh_yue', 'zu']

@MunizaA thanks for taking care of this. :)

To download the feature.db from Research's swift, I used this url:

curl '<https://thanos-swift.discovery.wmnet/v1/AUTH_research/feature-sets/reference-risk/20241101/features.db>' --output features.db

Then I uploaded the new features.db to our swift:

$ s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg ls s3://wmf-ml-models/reference-quality/20250127142109/reference-risk/
2025-01-27 14:27   2163601408  s3://wmf-ml-models/reference-quality/20250127142109/reference-risk/features.db
2025-01-27 14:30          142  s3://wmf-ml-models/reference-quality/20250127142109/reference-risk/features.sha512

Since we also serve the reference need model under the same model server, I copied the model file to the same bucket:

$ s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg ls s3://wmf-ml-models/reference-quality/20250127142109/reference-need/
2025-01-27 14:35    368340605  s3://wmf-ml-models/reference-quality/20250127142109/reference-need/model.pkl
2025-01-27 14:35          140  s3://wmf-ml-models/reference-quality/20250127142109/reference-need/model.pkl.sha512

Change #1114401 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: update reference-quality storage uri

https://gerrit.wikimedia.org/r/1114401

Change #1114401 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update reference-quality storage uri

https://gerrit.wikimedia.org/r/1114401

Expected Issue: The model is running slow.

The prediction time for Reference-need model is proportional to page size—or precisely, the number of uncited sentences. The table below shows this relationship between page_size, reference_count, and prediction_time from the random test samples:

revision_idtitlepage_size (bytes)predict_msreference_count
1271268369119th United States Congress2217287703.53341102643
1271268365Etiquette in Japan828502723.50287437431
1271268393Ze'ev Revach623621303.84945869423
1271268364On the Wings of Love (TV series)64776779.5033454928
12712683882018 Winter Olympics opening ceremony76212734.58862304757
1271268359Environmental epidemiology34556474.0197658547
1271268372Yunus Khan (politician)6576198.6672878272
1271268382Double Decker City Bus Terminus, Salem777264.451456072

I also ran load tests using different test sets from recent changes in English Wikipedia. The results showed an average latency of ~900ms. (P72579)

Thanks @AikoChou - @Aitolkyn's testing showed - "75% of revisions in each language are completed within a 500ms time limit."

You can see her full report here - https://docs.google.com/document/d/1EJbSJ7fekZvor8F-FiPVl7EGlTkckKtSIGOMXb1K2FM/edit?tab=t.0 - which states that there is a relationship between latency and pageviews as well. Most failures comes from articles in the top quartile of pageviews, so the average leans toward the head of the distribution, considering the length of the longtail.

Without doing your own analysis on pageview correlation (zero expectation you would), am I reading these two results correctly?

Is it correct that any discrepancy between your analysis and Aitolkyn's could be due to a difference in running the tests in a staging v prod environment?

Despite the latency comes from the model itself, is there anything you can do to shave down the time?

Without doing your own analysis on pageview correlation (zero expectation you would), am I reading these two results correctly?

The model processing time is mainly related to article length and the number of uncited sentences, due to the model's design. While there may be some correlation with pageviews, it's not a direct factor. Fig. 4 in Aitolkyn's report also shows that the correlation between time and uncited sentences (n_sent_uncited) is 0.9; between time and page size (roughly equal to total sentence count, n_sent_tot) is 0.88, while the correlation with pageviews is only 0.28.

Is it correct that any discrepancy between your analysis and Aitolkyn's could be due to a difference in running the tests in a staging v prod environment?

The difference stems from the environments: statbox v production. The statbox environment where Aitolkyn's testing ran has 32-72 CPU cores depending on which host is used (stat1008-11), while on Lift Wing’s k8s microservices environment, we currently assign 6 CPUs to the model service.

Despite the latency comes from the model itself, is there anything you can do to shave down the time?

We can either increase CPU resources for the model service or use batch processing for model inference (but we need to first investigate if this is feasible)

Thank your for the update. Let me know how I can be helpful for the cpu resources investigation.

Change #1117585 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: increase cpu and memory for reference-quality

https://gerrit.wikimedia.org/r/1117585

After testing different resource configurations for the model service in the experimental namespace, I found the optimal setup was increasing CPU from 6 to 12 and memory from 4G to 6G. This reduced the average latency from 876ms to 680ms (P73260) on the recent changes dataset. Further increasing CPU to 16 did not yield additional latency improvements.

@FNavas-foundation I'll deploy these changes to production once the patch is merged. Could you then check if you see any latency improvements on your end and let me know whether the response times are acceptable?

Change #1117585 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: increase cpu and memory for reference-quality

https://gerrit.wikimedia.org/r/1117585

Change #1117873 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] admin_ng: bump limitranges for ml-serve's revision-models namespace

https://gerrit.wikimedia.org/r/1117873

Change #1117873 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: bump limitranges for ml-serve's revision-models namespace

https://gerrit.wikimedia.org/r/1117873