Page MenuHomePhabricator

puppetdb queue size went up since July 30
Closed, ResolvedPublic

Description

PuppetDB queue size went up quite a bit since July 30 (and pretty since August 1):
https://grafana.wikimedia.org/d/000000477/puppetdb?panelId=19&fullscreen&orgId=1&from=1564236470272&to=1565167416489

Might be related to the new Puppet 5 canaries, as more canary hosts were pointed to puppetmaster1003 around the time.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 7 2019, 9:31 AM
jbond added a comment.EditedAug 7 2019, 10:11 AM

It seems when that the severs using the new puppet master cause the following stack trace when they try reach the 'store report' phase

upstream bug

2019-08-07 10:04:55,104 ERROR [p.p.threadpool] Error processing command on thread cmd-proc-thread-19692
clojure.lang.ExceptionInfo: Value does not match schema: {:job_id disallowed-key}
        at schema.core$validator$fn__9031.invoke(core.clj:155)
        at schema.core$validate.invokeStatic(core.clj:164)
        at schema.core$validate.invoke(core.clj:159)
        at puppetlabs.puppetdb.command$store_report$fn__39288.invoke(command.clj:355)
        at puppetlabs.puppetdb.command$store_report.invokeStatic(command.clj:354)
        at puppetlabs.puppetdb.command$store_report.invoke(command.clj:353)
        at puppetlabs.puppetdb.command$process_command_BANG_.invokeStatic(command.clj:389)
        at puppetlabs.puppetdb.command$process_command_BANG_.invoke(command.clj:380)
        at puppetlabs.puppetdb.command$process_command_and_respond_BANG_$fn__39396.invoke(command.clj:442)
        at puppetlabs.puppetdb.command$call_with_quick_retry$fn__39389.invoke(command.clj:424)
        at puppetlabs.puppetdb.command$call_with_quick_retry.invokeStatic(command.clj:423)
        at puppetlabs.puppetdb.command$call_with_quick_retry.invoke(command.clj:421)
        at puppetlabs.puppetdb.command$process_command_and_respond_BANG_.invokeStatic(command.clj:440)
        at puppetlabs.puppetdb.command$process_command_and_respond_BANG_.invoke(command.clj:438)
        at puppetlabs.puppetdb.command$process_cmdref$fn__39406.invoke(command.clj:505)
        at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__37184$fn__37185$fn__37186.invoke(metrics.clj:14)
        at puppetlabs.puppetdb.utils.metrics.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
        at com.codahale.metrics.Timer.time(Timer.java:101)
        at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__37184$fn__37185.invoke(metrics.clj:14)
        at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__37184$fn__37185$fn__37186.invoke(metrics.clj:14)
        at puppetlabs.puppetdb.utils.metrics.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
        at com.codahale.metrics.Timer.time(Timer.java:101)
        at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__37184$fn__37185.invoke(metrics.clj:14)
        at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_.invokeStatic(metrics.clj:17)
        at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_.invoke(metrics.clj:6)
        at puppetlabs.puppetdb.command$process_cmdref.invokeStatic(command.clj:501)
        at puppetlabs.puppetdb.command$process_cmdref.invoke(command.clj:480)
        at puppetlabs.puppetdb.command$message_handler$fn__39414.invoke(command.clj:551)
        at puppetlabs.puppetdb.threadpool$dochan$fn__39167$fn__39168.invoke(threadpool.clj:117)
        at puppetlabs.puppetdb.threadpool$call_on_threadpool$fn__39162.invoke(threadpool.clj:95)
        at clojure.lang.AFn.run(AFn.java:22)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Change 528744 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] puppetmaster1003: offline this puppetmaster as its scheme is incompatible

https://gerrit.wikimedia.org/r/528744

Change 528744 merged by Jbond:
[operations/puppet@production] puppetmaster1003: offline this puppetmaster as its scheme is incompatible

https://gerrit.wikimedia.org/r/528744

jbond added a comment.Aug 7 2019, 10:35 AM

I have disabled puppetmaster1003 for now, unfortunately from reading the PUP-8901 It seems the advice from puppetlabs is to always upgrade all puppetmasters and the puppetdb in tandem

Change 528906 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] puppetmaster upgrade: add a lua filter to remove the job_id

https://gerrit.wikimedia.org/r/528906

colewhite triaged this task as Normal priority.Aug 7 2019, 11:26 PM

Change 528906 merged by Jbond:
[operations/puppet@production] puppetmaster upgrade: add a lua filter to remove the job_id

https://gerrit.wikimedia.org/r/528906

Change 529064 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] puppetdb: enable lua filter to munge puppet reports

https://gerrit.wikimedia.org/r/529064

Change 529064 merged by Jbond:
[operations/puppet@production] puppetdb: enable lua filter to munge puppet reports

https://gerrit.wikimedia.org/r/529064

jbond added a comment.Aug 8 2019, 4:00 PM

The lua hack seems to have worked. I have again updated the config to send some canary servers to puppetmaster1003. So far i have seen no errors in the puppetdb log. Further via puppetdb i can see that the puppet reports and facts are all refreshing.

Volans closed this task as Resolved.Aug 12 2019, 10:28 AM
Volans claimed this task.

As the queue on grafana has gone back to zero too I'll resolve it for now. Thanks a lot for the fix @jbond