This task will track the hardware troubleshooting and repair of server <enter FQDN of server here>.
The first half of the steps should be completed by the person filing the hardware repair task. Some of these steps require access to [[ https://icinga.wikimedia.org/icinga/ | Icinga ]] to put a host into maintenance mode.
=== General Checklist ===
Steps while filling out template:
[] - pull up [[ https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook | hardware troubleshooting runbook ]] for step by step directions. The checkboxes below are summary level items.template:
[] - pull up [[ https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook | hardware troubleshooting runbook ]] for step by step directions.
[] - update all fields with <> on this task with the required info.
[] - look up device in netbox or use the naming conventions guide to determine where server is located.
[] - look up device in netbox to determine warranty status,determine location and warranty information of device in netbox, and provide in following: APPEND THAT TO THE NEXT STEP:
[] - System warranty expires on: <enter warranty expiry info from netbox>
[] - a- Append in the correct projectt tag for server location & assign to the proper user for that(see above) for site.
[] - d- Detail out if this system is in a service cluster,. and what other systems are in the sameInclude number of hosts and utilization of service cluster..
[] - attach detailed hardware failure log to this task via comment- Provide urgency of request, see [[ https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook | hardware troubleshooting runbook ]] on how to accomplish this.along with justification if high priority: <enter in priority/justification>
[] - Attach detailed hardware failure log from above hw troubleshooting runbook. <enter logs here>
=== Failure Specific Checklists ====
See [[ https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook | hardware troubleshooting runbook ]], as what steps you do next depend on thefor applicable type of hw failure. This template is for ALL hardware failures. You will need to use the section below for the type of failure, andcan DELETE the other sections for the other types ofunrelated to your hardware failure.
Example: If you have a memory failure, delete the sections for Powersupply & Disk failures, leaving only the section for 'All Other Failures'
==== Power Supply Failure ====
[] - Follow directions on https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures
[] - System doesn't need to be offlined- Do not offline system.
[] - ensure- Assign task has theto proper assignee & onsite project, and placed in the 'hardware troubleshooting' column on that workboard.
==== HDD/SSD Failure ====
[] - Follow directions on https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#HDD_&_SSD_Failures
[] - Determine if software or hardware raid
[] - attach full raid failure logs and details via comment to task.
[] - ensure- Assign task has theto proper assignee & onsite project, and placed in the 'hardware troubleshooting' column on that workboard.
==== All Other Failures (memory, battery, controller, cpu, etc.)====
[] - Follow directions on https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#All_Other_Failures
[] - System will need to be placed in a- Place system in fully offline state for hardware troubleshooting by the onsite. If it cannot be placed offline, person filing task will need to coordinate with theth on-site engineer to schedule a maintenance window.
[] - Set system and mgmt interface to maint mode (no checks/alarms on services) for 5 business days (this will often result in 7 calendar days due toexcluding weekends).
[] - attach detailed hardware failure log to this task via comment, see [[ https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook | hardware troubleshooting runbook ]] on how to accomplish this.
[] - ensure- Assign task has theto proper assignee & onsite project, and placed in the 'hardware troubleshooting' column on that workboard.