Page MenuHomePhabricator

[SPIKE] Get changeprop integration tests passing locally on node 18/20
Closed, ResolvedPublic5 Estimated Story Points

Description

Description

We need to learn more about how to upgrade node in a service that we own but seldom do work in. This is a research task to determine what needs to happen.

    1. Conditions of acceptance
  • Collect examples of similar implementations (Andrew Otto did this recently). Mainly https://wikitech.wikimedia.org/wiki/Changeprop/Memorandum_2023-11 and T348950.
  • Set up environment to run locally.
  • Get unit tests passing locally (node 18).
  • Get integration tests passing locally (node 18).
  • Get unit tests passing locally (node 20).
  • Get integration tests passing locally (node 20).

Quick docker testing reference:

### Contents of Dockerfile.node20 file placed under change-propagation repo ###
FROM node:20-buster
WORKDIR /app
 
COPY . .
 
ENV KAFKA_VERSION="1.1.1"
ENV KAFKA_HOME="/kafka"
 
RUN apt-get update && apt-get install -y wget netcat-traditional default-jre-headless redis
RUN npm install
RUN npm run install-kafka
 
EXPOSE 2181
EXPOSE 9092
EXPOSE 6379
 
CMD [ "bash" ]
### End docker file contents ###
 
 
### In change-propagation directory ###
docker build -t changeprop -f "$PWD/Dockerfile.node20" .
 
docker run --rm -it changeprop bash
 
### In running container
npm run test
npm run start-kafka && redis-server --daemonize yes
npm run integration

Event Timeline

HCoplin-WMF set the point value for this task to 5.Jan 9 2025, 4:35 PM

After fixing some createConsumer() calls in static_rules.js, I'm down to 2 failures. There is also some "Local: queue full" logging output, which perhaps can be ignored since it's an ephemeral state that is expected with kafka from time to time and is solved by retries. These look like:

{"name":"changeprop","hostname":"32d1baa99ae0","pid":40251,"level":60,"message":"Internal error in changeprop","description":"Error: Local: Queue full","stack":"Error: Local: Queue full\n    at Function.createLibrdkafkaError [as create] (/app/node_modules/node-rdkafka/lib/error.js:456:10)\n    at Client._errorWrap (/app/node_modules/node-rdkafka/lib/client.js:481:29)\n
...
"event_str":"{\"$schema\":\"mediawiki/revision/create/1.0.0\",\"meta\":{\"stream\":\"mediawiki.revision-create\",\"uri\":\"https://en.wikipedia.org/api/rest_v1/page/html/Main_Page\",\"request_id\":\"ce6ec9a0-e2c0-11ef-95f8-01a354933c44\",\"id\":\"3f4b7510-e2c1-11ef-95f8-01a354933c44\",\"dt\":\"2025-02-04T06:28:29.025Z\",\"domain\":\"en.wikipedia.org\"},\"database\":\"enwiki\",\"page_title\":\"Main_Page\",\"page_id\":12345,\"page_namespace\":0,\"performer\":{\"user_text\":\"I am a user\",\"user_groups\":[\"I am a group\"],\"user_is_bot\":false},\"page_is_redirect\":false,\"rev_id\":1234,\"rev_timestamp\":\"2025-02-04T06:28:29.025Z\",\"rev_parent_id\":1233,\"rev_content_changed\":true}

They don't seem to be for the events of the two failing tests.

Anyway, "npm run integration" for changeprop gives me:

JobQueue rules
    ✓ Should propagate updateBetaFeaturesUserCounts job (1509ms)
    1) Should propagate cdnPurge job
    ✓ Should support partitioned refreshLinks (2003ms)
    ✓ Should deduplicate based on ID (2003ms)
    ✓ Should deduplicate based on SHA1 (4003ms)
    ✓ Should deduplicate based on SHA1 and root job combination (4004ms)
    ✓ Should deduplicate base on root job (4004ms)
    2) Should support delayed jobs with re-enqueue

  Rule
    ✓ topic required
    ✓ no-op rule
    ✓ simple rule - one request
    ✓ simple rule - multiple requests
    Matching
      ✓ all
      ✓ simple value match
      ✓ simple value mismatch
      ✓ regex match
      ✓ regex match with undefined
      ✓ regex mismatch
      ✓ array match
      ✓ malformed match
      ✓ match_not
      ✓ match_not array
      ✓ matches match and match_not
      ✓ matches match but not match_not
      ✓ matches match_not but not match
      ✓ matches match but is canary event and should_discard_canary_events is true
      ✓ matches match and is canary event and should_discard_canary_events is false
      ✓ expansion
      ✓ expansion with named groups
      ✓ checks for named and unnamed groups mixing

  Sampler
    ✓ Should accept the correct number of values (65ms)

  Basic rule management
    ✓ Should call simple executor (1507ms)
    ✓ Should retry simple executor (504ms)
    ✓ Should retry simple executor no more than limit (2003ms)
debug/consumer [Function (anonymous)]
    ✓ Should emit valid retry message (2856ms)
    ✓ Should not retry if retry_on not matched (2002ms)
    ✓ Should not follow redirects (2002ms)
    ✓ Should not crash with unparsable JSON (503ms)
    ✓ Should support producing to topics on exec (2002ms)
debug/consumer [Function (anonymous)]
    ✓ Should emit valid messages to error topic (2423ms)
    ✓ Sampling should only propagate a stable subset (2003ms)
    ✓ Should support array topics (502ms)
    ✓ Should support exclude_topics stanza (2002ms)

  update rules
    ✓ Should update summary endpoint (1502ms)
    ✓ Should update summary endpoint, transcludes topic (1003ms)
    ✓ Should update summary endpoint on page images change (1001ms)
    ✓ Should not update summary for a blacklisted title (2001ms)
    ✓ Should update definition endpoint (501ms)
    ✓ Should not react to revision change event from restbase for definition endpoint (2001ms)
    ✓ Should update mobile apps endpoint (501ms)
    ✓ Should not update definition endpoint for non-main namespace (2001ms)
    ✓ Should update RESTBase on resource_change from MW (501ms)
    ✓ Should update RESTBase on revision create (1002ms)
    ✓ Should not update RESTBase on revision create for a blacklisted title (2001ms)
    ✓ Should not update RESTBase on revision create for wikidata (2001ms)
    ✓ Should update RESTBase on page delete (1504ms)
    ✓ Should update RESTBase on page undelete (500ms)
    ✓ Should update RESTBase on page move (1000ms)
    ✓ Should update RESTBase on revision visibility change (1001ms)
    ✓ Should update ORES on revision-create (4107ms)
    ✓ Should update RESTBase summary and mobile-sections on wikidata description change (3003ms)
    ✓ Should update RESTBase summary and mobile-sections on wikidata description revert (3001ms)
    ✓ Should update RESTBase summary and mobile-sections on wikidata undelete (3001ms)
    ✓ Should not ask Wikidata for info for non-main namespace titles (5002ms)
    ✓ Should not crash if wikidata description can not be found (3001ms)
    ✓ Should rerender image usages on file update (505ms)
    ✓ Should rerender transclusions on page update (504ms)
    ✓ Should process backlinks, on create (1004ms)
    ✓ Should process backlinks, on delete (505ms)
    ✓ Should process backlinks, on undelete (503ms)
    ✓ Should purge caches on resource_change coming from RESTBase
    ✓ Should purge caches on resource_change coming from Tilerator


  71 passing (4m)
  2 failing

  1) JobQueue rules
       Should propagate cdnPurge job:

      AssertionError [ERR_ASSERTION]: Mocks not yet satisfied:
POST http://jobrunner.wikipedia.org:80/wiki/Special%3ARunSingleJob
      + expected - actual

      -false
      +true
      
      at Scope.done (node_modules/nock/lib/scope.js:163:12)
      at check (test/utils/common.js:98:24)
      at tryCatcher (node_modules/bluebird/js/release/util.js:16:23)
      at Promise._settlePromiseFromHandler (node_modules/bluebird/js/release/promise.js:547:31)
      at Promise._settlePromise (node_modules/bluebird/js/release/promise.js:604:18)
      at Promise._settlePromise0 (node_modules/bluebird/js/release/promise.js:649:10)
      at Promise._settlePromises (node_modules/bluebird/js/release/promise.js:729:18)
      at Promise._fulfill (node_modules/bluebird/js/release/promise.js:673:18)
      at Timeout._onTimeout (node_modules/bluebird/js/release/timers.js:26:46)
      at listOnTimeout (node:internal/timers:569:17)
      at process.processTimers (node:internal/timers:512:7)

  2) JobQueue rules
       Should support delayed jobs with re-enqueue:

      AssertionError [ERR_ASSERTION]: Mocks not yet satisfied:
POST http://jobrunner.wikipedia.org:80/wiki/Special%3ARunSingleJob
      + expected - actual

      -false
      +true
      
      at Scope.done (node_modules/nock/lib/scope.js:163:12)
      at check (test/utils/common.js:98:24)
      at tryCatcher (node_modules/bluebird/js/release/util.js:16:23)
      at Promise._settlePromiseFromHandler (node_modules/bluebird/js/release/promise.js:547:31)
      at Promise._settlePromise (node_modules/bluebird/js/release/promise.js:604:18)
      at Promise._settlePromise0 (node_modules/bluebird/js/release/promise.js:649:10)
      at Promise._settlePromises (node_modules/bluebird/js/release/promise.js:729:18)
      at Promise._fulfill (node_modules/bluebird/js/release/promise.js:673:18)
      at Timeout._onTimeout (node_modules/bluebird/js/release/timers.js:26:46)
      at listOnTimeout (node:internal/timers:569:17)
      at process.processTimers (node:internal/timers:512:7)

The server logs under /kafka/logs/ showed some zookeeper-based warnings about cdnPurge topics being auto-created. I noticed that they where missing from utils/test_topics. I also noticed some warnings about the meta.properties file missing and some connection losses. Anyway, updating the test_topic list fixes the 2 failures I had above (both about cdnPurge jobs, one being a non-delayed job the other being a delayed job). That leaves me with:

72 passing (3m)
  1 failing

  1) update rules
       Should update RESTBase summary and mobile-sections on wikidata description change:

      AssertionError [ERR_ASSERTION]: Mocks not yet satisfied:
GET https://ru.wikipedia.org:443/api/rest_v1/page/summary/%D0%9F%D1%91%D1%82%D1%80
GET https://ru.wikipedia.org:443/api/rest_v1/page/mobile-sections/%D0%9F%D1%91%D1%82%D1%80
      + expected - actual

      -false
      +true
      
      at Scope.done (node_modules/nock/lib/scope.js:163:12)
      at check (test/utils/common.js:98:24)
      at tryCatcher (node_modules/bluebird/js/release/util.js:16:23)
      at Promise._settlePromiseFromHandler (node_modules/bluebird/js/release/promise.js:547:31)
      at Promise._settlePromise (node_modules/bluebird/js/release/promise.js:604:18)
      at Promise._settlePromise0 (node_modules/bluebird/js/release/promise.js:649:10)
      at Promise._settlePromises (node_modules/bluebird/js/release/promise.js:729:18)
      at Promise._fulfill (node_modules/bluebird/js/release/promise.js:673:18)
      at Timeout._onTimeout (node_modules/bluebird/js/release/timers.js:26:46)
      at listOnTimeout (node:internal/timers:569:17)
      at process.processTimers (node:internal/timers:512:7)

Probably related to the "queue full" console log entry that mentions 'wbeditentity-update:0':

changeprop","description":"Error: Local: Queue full","stack":"Error: Local: Queue full\n    at Function.createLibrdkafkaError [as create] (/app/node_modules/node-rdkafka/lib/error.js:456:10)\n    at Client._errorWrap (/app/node_modules/node-rdkafka/lib/client.js:481:29)\n    at Producer.produce (/app/node_modules/node-rdkafka/lib/producer.js:138:15)\n    at /app/lib/kafka_factory.js:61:38\n    at Promise._execute (/app/node_modules/bluebird/js/release/debuggability.js:384:9)\n    at Promise._resolveFromExecutor (/app/node_modules/bluebird/js/release/promise.js:518:18)\n    at new Promise (/app/node_modules/bluebird/js/release/promise.js:103:10)\n    at GuaranteedProducer.produce (/app/lib/kafka_factory.js:55:16)\n    at /app/sys/kafka.js:91:34\n    at Array.map (<anonymous>)\n    at Kafka.produce (/app/sys/kafka.js:78:31)\n    at tryCatcher (/app/node_modules/bluebird/js/release/util.js:16:23)\n    at /app/node_modules/bluebird/js/release/method.js:15:34\n    at handlerWrapper (/app/node_modules/hyperswitch/lib/hyperswitch.js:422:37)\n    at next (/app/node_modules/hyperswitch/lib/hyperswitch.js:408:42)\n    at module.exports [as filter] (/app/node_modules/hyperswitch/lib/filters/validator.js:272:12)\n    at handlerWrapper (/app/node_modules/hyperswitch/lib/hyperswitch.js:420:27)\n    at next (/app/node_modules/hyperswitch/lib/hyperswitch.js:408:42)\n    at module.exports [as filter] (/app/node_modules/hyperswitch/lib/filters/metrics.js:16:12)\n    at handlerWrapper (/app/node_modules/hyperswitch/lib/hyperswitch.js:420:27)\n    at /app/node_modules/hyperswitch/lib/hyperswitch.js:426:28\n    at /app/node_modules/hyperswitch/lib/hyperswitch.js:229:24\n    at tryCatcher (/app/node_modules/bluebird/js/release/util.js:16:23)\n    at Promise._settlePromiseFromHandler (/app/node_modules/bluebird/js/release/promise.js:547:31)\n    at Promise._settlePromise (/app/node_modules/bluebird/js/release/promise.js:604:18)\n    at Promise._settlePromiseCtx (/app/node_modules/bluebird/js/release/promise.js:641:10)\n    at _drainQueueStep (/app/node_modules/bluebird/js/release/async.js:97:12)\n    at _drainQueue (/app/node_modules/bluebird/js/release/async.js:86:9)\n    at Async._drainQueues (/app/node_modules/bluebird/js/release/async.js:102:5)\n    at Async.drainQueues (/app/node_modules/bluebird/js/release/async.js:15:14)\n    at process.processImmediate (node:internal/timers:476:21)","event_str":"{\"$schema\":\"mediawiki/revision/create/1.0.0\",\"meta\":{\"stream\":\"mediawiki.revision-create\",\"uri\":\"https://www.wikidata.org/wiki/Q1\",\"request_id\":\"86189090-e2cb-11ef-88fe-afe48b6c80b9\",\"id\":\"ea465110-e2cb-11ef-88fe-afe48b6c80b9\",\"dt\":\"2025-02-04T07:44:50.849Z\",\"domain\":\"www.wikidata.org\"},\"database\":\"enwiki\",\"page_title\":\"Q1\",\"page_id\":12345,\"page_namespace\":0,\"performer\":{\"user_text\":\"I am a user\",\"user_groups\":[\"I am a group\"],\"user_is_bot\":false},\"page_is_redirect\":false,\"rev_id\":1234,\"rev_timestamp\":\"2025-02-04T07:44:50.849Z\",\"rev_parent_id\":1233,\"rev_content_changed\":true,\"comment\":\"/* wbeditentity-update:0| */ add [it] label\"}","stream":"mediawiki.revision-create","rule_name":"wikidata_description_on_edit","executor":"RuleExecutor","levelPath":"fatal/internal_error","msg":"Internal error in changeprop","time":"2025-02-04T07:44:50.879Z","v":0}

If I run that test by itself, things pass. E.g. "npm run integration -- --g "wikidata description change" gives:

update rules
    ✓ Should update RESTBase summary and mobile-sections on wikidata description change (3006ms)


  1 passing (35s)

So there seems to be some issue with the local kafka getting very easily overwhelmed.

Change #1117267 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/services/change-propagation@master] Avoid failures due to passing undefined as the logger to createConsumer()

https://gerrit.wikimedia.org/r/1117267

Change #1117268 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/services/change-propagation@master] Add missing kafka topics to test_topics

https://gerrit.wikimedia.org/r/1117268

Change #1117270 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/services/change-propagation@master] Increase common.REQUEST_CHECK_DELAY to avoid wikibase description test failures

https://gerrit.wikimedia.org/r/1117270

With the above changes, and running my laptop in performance mode, I can get everything passing.

73 passing (3m)

Change #1117267 merged by jenkins-bot:

[mediawiki/services/change-propagation@master] Avoid failures due to passing undefined as the logger to createConsumer()

https://gerrit.wikimedia.org/r/1117267

Change #1117268 merged by jenkins-bot:

[mediawiki/services/change-propagation@master] Add missing kafka topics to test_topics

https://gerrit.wikimedia.org/r/1117268

Change #1117270 merged by BPirkle:

[mediawiki/services/change-propagation@master] Increase common.REQUEST_CHECK_DELAY to avoid wikibase description test failures

https://gerrit.wikimedia.org/r/1117270

With changes merged, and using Docker on a Windows computer (for whatever reasons, integration tests fail for me with Docker+Mac), I now have unit and integration tests passing locally on both node18 and node20.

aaron renamed this task from [SPIKE] Upgrade changeprop to node20 to [SPIKE] Get changeprop integration tests passing locally on node 18/20.Feb 11 2025, 3:42 PM
aaron closed this task as Resolved.