This task will track the hardware troubleshooting and repair of server <enter FQDN of server here>.
The first half of the steps should be completed by the person filing the hardware repair task. Some of these steps require access to [[ https://icinga.wikimedia.org/icinga/ | Icinga ]] to put a host into maintenance mode.
=== General Checklist ===
Steps while filling out template:[] - Provide FQDN of system.
[] - Pull up [[ https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook | hardware troubleshooting runbook ]] for step by step directions.
[] - Update all fields with <> on this task with the required infoIf other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
[] - Determine location and warranty information of device in netbox, and provide in following:
[] - System warranty expires on: <enter warranty expiry info from netbox>Put system into a failed state in Netbox.
[] - Provide urgency of request, along with justification (redundancy, details of service cluster, number of hosts in cluster, etc) if attention is needed immediately.dependencies, : <enter in urgency level/justification>etc)
[] - Attach detailed hardware failure log from above hw troubleshooting runbook. <enter logs here>
[] - Append correct project tag for server location & assign to proper user (see above) for site.
=== Failure Specific Checklists ====
See [[ https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook | hardware troubleshooting runbook ]], for applicable type of hw failure. This template is for ALL hardware failuresDescribe issue and/or attach hardware failure log. You can DELETE other sections unrelated to your hardware failure.
Example: If you have a memory failure, delete the sections for Powersupply & Disk failures, leaving only the section for 'All Other Failures'
==== HDD/SSD Failure ====(Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
[] - Follow directions on https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#HDD_&_SSD_Failures
[] - Determine if software or hardware raid
[] - attach full raid failure logs and details via comment to task.
[] - Assign task to proper assignee & onsite project, and place in 'hardware troubleshooting' column on workboard.
==== Power Supply Failure ====
[] - Follow directions on https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures
[] - Do not offline system.
[] - Assign task to proper assignee & onsite project, and place in 'hardware troubleshooting' column on workboard.
==== All Other Failures (memory, battery, controller, cpu, etc.)====
[] - Follow directions on https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#All_Other_Failures
[] - Place system in fully offline state for hardware troubleshooting by the onsite. If it cannot be placed offline, coordinate with on-site engineer to schedule a maintenance window.
[] - Set system and mgmt interface to maint mode (no checks/alarms on services) for 5 business days (excluding weekends).
[] - attach detailed hardware failure log to this task via comment, see [[ https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook | hardware troubleshooting runbook ]] on how to accomplish this.
[] - Assign task to proper assignee & onsite project, and place in 'hardware troubleshooting' column on workboard.Assign correct project tag and appropriate owner (based on above)