Page MenuHomePhabricator

Install / configure new WDQS servers
Closed, ResolvedPublic

Event Timeline

Gehel triaged this task as High priority.Feb 20 2018, 9:50 AM
Gehel created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 415872 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] [WIP] wdqs: configure the new internal cluster

https://gerrit.wikimedia.org/r/415872

Change 416921 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: enable LDF server is now configurable

https://gerrit.wikimedia.org/r/416921

Change 416921 merged by Gehel:
[operations/puppet@production] wdqs: enable LDF server is now configurable

https://gerrit.wikimedia.org/r/416921

Change 416961 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: configure new servers wdqs200[4-6]

https://gerrit.wikimedia.org/r/416961

Change 417202 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: use the raid10-gpt-srv-lvm-ext4 partman config for new wdqs nodes

https://gerrit.wikimedia.org/r/417202

Change 415872 merged by Gehel:
[operations/puppet@production] wdqs: configure the new internal cluster

https://gerrit.wikimedia.org/r/415872

Change 417202 merged by Gehel:
[operations/puppet@production] wdqs: use the raid10-gpt-srv-lvm-ext4 partman config for new wdqs nodes

https://gerrit.wikimedia.org/r/417202

Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['wdqs2004.codfw.wmnet', 'wdqs2005.codfw.wmnet', 'wdqs2006.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201803090941_gehel_15241.log.

Change 416961 merged by Gehel:
[operations/puppet@production] wdqs: configure new servers wdqs200[4-6]

https://gerrit.wikimedia.org/r/416961

Change 417782 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: comment out wdqs_internal nodes from eqiad

https://gerrit.wikimedia.org/r/417782

Change 417782 merged by Gehel:
[operations/puppet@production] wdqs: comment out wdqs_internal nodes from eqiad

https://gerrit.wikimedia.org/r/417782

Completed auto-reimage of hosts:

['wdqs2006.codfw.wmnet']

Of which those FAILED:

['wdqs2006.codfw.wmnet']

Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['wdqs2006.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201803091054_gehel_31807.log.

Completed auto-reimage of hosts:

['wdqs2006.codfw.wmnet']

and were ALL successful.

Initial data import is in progress on wdqs200[456] (note that wdqs2006 has issues with mgmt interface). Eqiad servers are not yet racked (T188432).

Data import for wdqs200[456] completed.

Change 419264 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: collect prometheus metrics for both wdqs clusters

https://gerrit.wikimedia.org/r/419264

Change 419264 merged by Gehel:
[operations/puppet@production] wdqs: collect prometheus metrics for both wdqs clusters

https://gerrit.wikimedia.org/r/419264

Change 419707 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: add pigz package

https://gerrit.wikimedia.org/r/419707

Change 419707 merged by Gehel:
[operations/puppet@production] wdqs: add pigz package

https://gerrit.wikimedia.org/r/419707

Change 424260 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: configure new servers wdqs100[6-8]

https://gerrit.wikimedia.org/r/424260

Change 424260 merged by Gehel:
[operations/puppet@production] wdqs: configure new servers wdqs100[6-8]

https://gerrit.wikimedia.org/r/424260

Gehel closed subtask Unknown Object (Task) as Resolved.
Gehel mentioned this in Unknown Object (Task).
Gehel closed subtask Unknown Object (Task) as Resolved.

Change 424587 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/dns@master] wdqs: new wdqs-internal service

https://gerrit.wikimedia.org/r/424587

Change 424599 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: LVS and conftool configuration for new wdqs-internal service

https://gerrit.wikimedia.org/r/424599

Change 425051 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/dns@master] wdqs-internal: new entry for service discovery

https://gerrit.wikimedia.org/r/425051

Change 425051 abandoned by Gehel:
wdqs-internal: new entry for service discovery

Reason:
replaced by https://gerrit.wikimedia.org/r/#/c/424587/

https://gerrit.wikimedia.org/r/425051

Change 425275 had a related patch set uploaded (by Gehel; owner: Gehel):
[wikidata/query/deploy@master] add new wdqs-internal cluster to scap targets

https://gerrit.wikimedia.org/r/425275

Change 425275 merged by Smalyshev:
[wikidata/query/deploy@master] add new wdqs-internal cluster to scap targets

https://gerrit.wikimedia.org/r/425275

Change 424587 merged by Gehel:
[operations/dns@master] wdqs: new wdqs-internal service

https://gerrit.wikimedia.org/r/424587

Change 424599 merged by Gehel:
[operations/puppet@production] wdqs: LVS and conftool configuration for new wdqs-internal service

https://gerrit.wikimedia.org/r/424599

Change 426926 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs-internal: new service discovery entry

https://gerrit.wikimedia.org/r/426926

Change 426926 merged by Gehel:
[operations/puppet@production] wdqs-internal: new service discovery entry

https://gerrit.wikimedia.org/r/426926

Mentioned in SAL (#wikimedia-operations) [2018-04-16T14:25:38Z] <vgutierrez> restarting pybal on lvs2006 - T187766

Mentioned in SAL (#wikimedia-operations) [2018-04-16T14:42:00Z] <vgutierrez> restart pybal on lvs1006 - T187766

Mentioned in SAL (#wikimedia-operations) [2018-04-16T14:49:41Z] <vgutierrez> restart pybal on lvs2003 - T187766

Mentioned in SAL (#wikimedia-operations) [2018-04-16T14:53:13Z] <vgutierrez> restart pybal on lvs1003 - T187766

The DC specific endpoints and the service discovery endpoint seem to work correctly:

  • curl -s wdqs-internal.svc.eqiad.wmnet/readiness-probe
  • curl -s wdqs-internal.svc.codfw.wmnet/readiness-probe
  • curl -s wdqs-internal.discovery.wmnet/readiness-probe (<- this is the endpoint to use for any internal client)

I'd like to have @Smalyshev to have a look and validate that this all looks correct before sending real traffic...

Change 427160 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: tune performance limits for the new wdqs-internal cluster

https://gerrit.wikimedia.org/r/427160

Change 427160 merged by Gehel:
[operations/puppet@production] wdqs: tune performance limits for the new wdqs-internal cluster

https://gerrit.wikimedia.org/r/427160