Page MenuHomePhabricator

setup kafka2001 & kafka2002
Closed, ResolvedPublic8 Story Points

Description

This is the tracking task for the installation and implementation of kafka2001 and kafka2002. These were requested on hardware-requests task T114191 and allocated from spares ordered via T120246.

By default, requests such as these systems should go into two different racks/rows for redundancy.

  • - received in, asset tagged, added to racktables
  • - sub-task created for onsite to add label to systems and update racktables
  • - allocate mgmt dns
  • - system bios and drac setup
  • - allocate production dns
  • - allocate port and vlan (internal)
  • - install_module updates
  • - install OS (jessie)
  • - service implementation (typically handed off to requestor of hardware, in this case @aaron)

Event Timeline

RobH created this task.Dec 15 2015, 6:57 PM
RobH claimed this task.
RobH raised the priority of this task from to Normal.
RobH updated the task description. (Show Details)
RobH added subscribers: faidon, mobrovac, RobH and 13 others.
Papaul added a subscriber: Papaul.Dec 15 2015, 8:00 PM

For the task I will be using wmf6377 in row A and wmf6379 in row bB

Papaul updated the task description. (Show Details)Dec 15 2015, 8:37 PM
Papaul set Security to None.

@RobH for the install-module update section what type of RAID level will i be using?

RobH added a comment.Dec 15 2015, 8:42 PM

These systems are 4 * 3TB disks, as such need to make use of that disk space and use GPT.

raid10-gpt.cfg should be used for these, as it sets up the following:

# * four disks, sda, sdb, sdc, sdd
# * primary partitions, no LVM
# * GPT layout (large disks, > 2TB)
# * layout:
#   - /	:   ext3, RAID10, 50GB
#   - /srv: xfs,  RAID10, rest of the space
RobH reassigned this task from RobH to Papaul.Dec 15 2015, 8:43 PM
Ottomata added a comment.EditedDec 15 2015, 8:44 PM

I think that will be fine, but be warned that I may need to delete the /srv partition and recreate JBOD for Kafka. TBD.

/ ext RAID10 across all 4 sounds good.

RobH added a comment.Dec 15 2015, 8:53 PM

kafka2001 & kakfa2002 port descriptions and vlans updated.

RobH updated the task description. (Show Details)Dec 15 2015, 8:54 PM

We discussed this on IRC that the following RAID level need to be used
raid10-gpt-srv-ext4.cfg

Papaul updated the task description. (Show Details)Dec 15 2015, 9:46 PM

@RobH I am getting this error message during install.

RobH added a comment.Dec 16 2015, 2:13 AM

Just continue past it, we don't need swap space on it. Thanks for checking!

Papaul updated the task description. (Show Details)Dec 16 2015, 3:13 AM

signing puppet certs, salt key complete

Papaul reassigned this task from Papaul to aaron.Dec 16 2015, 4:00 AM

Hey Aaron the installation is complete on kafka200[1-2] I f you have any questions please let me know.

Thanks

CooOoL ok thanks!

ok! looking good! We may have a problem for codfw -- we don't have a zookeeper cluster there. For some reason I assumed there was already a mirror of the conf100x hosts in codfw, but there are not. We will have to think about this.

I actually did have Kafka set up on these for a few minutes before I realized this. I think we can go ahead and close this ticket and we can track the final set up and zookeeper problem elsewhere.

ori reassigned this task from aaron to Ottomata.Dec 21 2015, 7:55 PM

@Ottomata can you indeed file a task for codfw zookeeper?

Ottomata reassigned this task from Ottomata to elukey.Apr 19 2016, 1:32 PM
Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptApr 19 2016, 1:32 PM

Change 285958 had a related patch set uploaded (by Elukey):
Enable kafka200[12] to host Kafka and Event Bus

https://gerrit.wikimedia.org/r/285958

Restricted Application added a subscriber: Southparkfan. · View Herald TranscriptApr 28 2016, 2:35 PM
elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.
JAllemandou set the point value for this task to 5.Apr 28 2016, 4:30 PM

Change 285958 merged by Elukey:
Enable kafka200[12] to host Kafka and EventBus.

https://gerrit.wikimedia.org/r/285958

Change 286134 had a related patch set uploaded (by Elukey):
Add eventbus_codfw to the monitoring hiera variables.

https://gerrit.wikimedia.org/r/286134

Change 286134 merged by Elukey:
Add eventbus_codfw to the monitoring hiera variables.

https://gerrit.wikimedia.org/r/286134

Change 286136 had a related patch set uploaded (by Elukey):
Add kafka200[12] codfw hosts to scap.

https://gerrit.wikimedia.org/r/286136

Change 286136 merged by Mobrovac:
Add kafka200[12] codfw hosts to scap.

https://gerrit.wikimedia.org/r/286136

mobrovac moved this task from Backlog to In Progress on the Event-Platform board.Apr 29 2016, 11:04 AM
elukey added a comment.EditedApr 29 2016, 11:51 AM
  • Icinga configuration updated
  • Added kafka200[12] to eventbus scap config (thanks to Marko) + deploy
  • Run puppet on both nodes, no errors

Next steps:

  • Add LVS/Pyball configurations (need to pre-allocate an IP I guess?)

https://wikitech.wikimedia.org/wiki/LVS#LVS_installation

Change 286621 had a related patch set uploaded (by Elukey):
Add LVS configuration for EventBus in codfw (DNS reverse/wmnet config already in place).

https://gerrit.wikimedia.org/r/286621

elukey added a comment.EditedMay 3 2016, 11:00 AM

Double checked that health checks work fine:

elukey@kafka1001:~$ curl http://kafka2002.codfw.wmnet:8085/v1/topics
{"change-prop.retry.change-prop.backlinks.continue": {"schema_name": "retry"}, "change-prop.retry.mediawiki.page_delete": {"schema_name": "retry"}, "mediawiki.page_delete": {"schema_name": "page_delete"}, "mediawiki.page_move": {"schema_name": "page_move"}, "change-prop.backlinks.continue": {"schema_name": "continue"}, "change-prop.retry.mediawiki.page_restore": {"schema_name": "retry"}, "resource_change": {"schema_name": "resource_change"}, "mediawiki.revision_visibility_set": {"schema_name": "revision_visibility_set"}, "change-prop.retry.mediawiki.revision_visibility_set": {"schema_name": "retry"}, "mediawiki.user_block": {"schema_name": "user_block"}, "mediawiki.page_restore": {"schema_name": "page_restore"}, "change-prop.retry.mediawiki.user_block": {"schema_name": "retry"}, "change-prop.retry.mediawiki.page_move": {"schema_name": "retry"}, "mediawiki.revision_create": {"schema_name": "revision_create"}, "change-prop.retry.mediawiki.revision_create": {"schema_name": "retry"}, "test.event": {"schema_name": "test_event"}, "change-prop.retry.resource_change": {"schema_name": "retry"}}

elukey@kafka1001:~$ curl http://kafka2001.codfw.wmnet:8085/v1/topics
{"change-prop.retry.change-prop.backlinks.continue": {"schema_name": "retry"}, "change-prop.retry.mediawiki.page_delete": {"schema_name": "retry"}, "mediawiki.page_delete": {"schema_name": "page_delete"}, "mediawiki.page_move": {"schema_name": "page_move"}, "change-prop.backlinks.continue": {"schema_name": "continue"}, "change-prop.retry.mediawiki.page_restore": {"schema_name": "retry"}, "resource_change": {"schema_name": "resource_change"}, "mediawiki.revision_visibility_set": {"schema_name": "revision_visibility_set"}, "change-prop.retry.mediawiki.revision_visibility_set": {"schema_name": "retry"}, "mediawiki.user_block": {"schema_name": "user_block"}, "mediawiki.page_restore": {"schema_name": "page_restore"}, "change-prop.retry.mediawiki.user_block": {"schema_name": "retry"}, "change-prop.retry.mediawiki.page_move": {"schema_name": "retry"}, "mediawiki.revision_create": {"schema_name": "revision_create"}, "change-prop.retry.mediawiki.revision_create": {"schema_name": "retry"}, "test.event": {"schema_name": "test_event"}, "change-prop.retry.resource_change": {"schema_name": "retry"}}

Note that the topics' names ought to be prefixed with codfw.. @elukey I guess you created them by running ./bin/ensure-kafka-topics-exist ?

elukey added a comment.May 3 2016, 1:59 PM

Created topics with mobrovac, all good. Last step is to enable LVS.

Change 286621 merged by Elukey:
Add LVS configuration for EventBus in codfw (DNS reverse/wmnet config already in place).

https://gerrit.wikimedia.org/r/286621

LVS configuration set up on lvs200[36]:

curl http://eventbus.svc.codfw.wmnet:8085/v1/topics

Pending verification, but the work should be done :)

elukey changed the point value for this task from 5 to 8.May 4 2016, 11:25 AM
elukey moved this task from In Progress to Done on the Analytics-Kanban board.May 4 2016, 1:01 PM
elukey closed this task as Resolved.May 5 2016, 6:23 AM