Incident Response Manager, Data Infrastructure, Dublin Pike
Incident Response Manager, Data Infrastructure, Dublin Pike
-
Dublin Pike, Ireland
-
Posted: less than a month ago
-
Save
Description
Responsibilities Data Systems Infrastructure (DSI) team sits within the ByteDance global technology structure and supports the company's fast growth by building and operating hyper-scale datacenters, managing the life cycle of server fleet, providing cloud solutions, and developing various infrastructure services, making sure they are scalable and reliable.
We are seeking a technically skilled and detail-oriented professional to serve as a front-line responder for incident detection, triage, and response across infrastructure, facilities, and security operations. The ideal candidate will have a strong foundation in facility operations, broad knowledge across IT, infrastructure, or engineering disciplines, experience in critical environments, and the ability to analyze incidents, manage them properly, identify trends, and drive sustained improvements. This role requires performance under pressure, data-driven thinking, and a proactive approach to continuous improvement and operational resilience.
Act as the primary first responder for the IRC Operations Center by continuously monitoring infrastructure, facilities, and external services using approved tools, and immediately responding to all alerts and anomalies.
Promptly address environmental, facility, IT infrastructure, and external service events (e.g., power, temperature, flooding, outages, partner notifications) to minimize operational and customer impact.
Conduct detailed root cause analysis for all incidents, assess scope and impact, determine corrective actions, and ensure issues are fully understood before closure.
Accurately assess incident severity and customer risk, communicate clearly and proactively with stakeholders, and coordinate timely escalations and collaboration with resolver teams to drive rapid resolution.
Manage incidents properly and efficiently, track response performance against SLAs, and ensure alerts, notifications, and resolutions occur within agreed timelines.
Produce comprehensive incident reports, post-mortems, and operational metrics; analyze trends and recurring issues to generate insights and drive continuous improvement.
Own Incident, Problem, and Change Management processes; maintain SOPs/runbooks; provide technical leadership; and champion continuous improvements to reliability, security, and operational effectiveness across teams.
Qualifications Minimum Qualifications:
Hold a Bachelor's degree in Computer Science, Information Technology, Engineering, or a related technical discipline, with solid fundamentals in infrastructure and systems operations.
Demonstrate hands‑on experience in Data Center Facility Operations Centers, IT infrastructure, network operations, or systems monitoring environments.
Be proficient with monitoring and alerting platforms (e.g., Grafana, Nagios, or similar) to detect, analyze, and respond to operational events effectively.
Exhibit strong analytical and troubleshooting skills, with the proven ability to investigate incidents, determine root causes, and implement corrective actions.
Operate properly and decisively during critical situations while coordinating incident and problem management processes across cross‑functional teams.
Communicate clearly with both technical and non‑technical stakeholders through reports and reviews, while maintaining a proactive mindset focused on continuous improvement and operational excellence.
Preferred Qualifications:
5+ years of hands‑on experience in IT or data center environments, with strong exposure to incident and problem management in enterprise‑scale systems.
Demonstrate working knowledge of data center facility operations, including mechanical, electrical, and plumbing (MEP) systems, along with server and infrastructure technologies.
Have practical experience with ticketing systems, monitoring platforms (e.g., Grafana), and data center or server management tools to support reliable operations.
Consistently perform in fast changing, time‑sensitive situations, balancing multiple priorities while meeting deadlines and resolving critical issues efficiently.
Contribute to or lead initiatives that enhance operational efficiency, security, resilience, and overall infrastructure performance through continuous improvement efforts.
Maintain relevant certifications or technical knowledge (e.g., ITIL, Server+, DCCA, CCNA, PMP, analytics tools), adapt quickly to changing environments, and support operational needs including on‑call coverage.
This role requires on‑call coverage to support through a scheduled on‑call rotation.
#J-18808-Ljbffr
We are seeking a technically skilled and detail-oriented professional to serve as a front-line responder for incident detection, triage, and response across infrastructure, facilities, and security operations. The ideal candidate will have a strong foundation in facility operations, broad knowledge across IT, infrastructure, or engineering disciplines, experience in critical environments, and the ability to analyze incidents, manage them properly, identify trends, and drive sustained improvements. This role requires performance under pressure, data-driven thinking, and a proactive approach to continuous improvement and operational resilience.
Act as the primary first responder for the IRC Operations Center by continuously monitoring infrastructure, facilities, and external services using approved tools, and immediately responding to all alerts and anomalies.
Promptly address environmental, facility, IT infrastructure, and external service events (e.g., power, temperature, flooding, outages, partner notifications) to minimize operational and customer impact.
Conduct detailed root cause analysis for all incidents, assess scope and impact, determine corrective actions, and ensure issues are fully understood before closure.
Accurately assess incident severity and customer risk, communicate clearly and proactively with stakeholders, and coordinate timely escalations and collaboration with resolver teams to drive rapid resolution.
Manage incidents properly and efficiently, track response performance against SLAs, and ensure alerts, notifications, and resolutions occur within agreed timelines.
Produce comprehensive incident reports, post-mortems, and operational metrics; analyze trends and recurring issues to generate insights and drive continuous improvement.
Own Incident, Problem, and Change Management processes; maintain SOPs/runbooks; provide technical leadership; and champion continuous improvements to reliability, security, and operational effectiveness across teams.
Qualifications Minimum Qualifications:
Hold a Bachelor's degree in Computer Science, Information Technology, Engineering, or a related technical discipline, with solid fundamentals in infrastructure and systems operations.
Demonstrate hands‑on experience in Data Center Facility Operations Centers, IT infrastructure, network operations, or systems monitoring environments.
Be proficient with monitoring and alerting platforms (e.g., Grafana, Nagios, or similar) to detect, analyze, and respond to operational events effectively.
Exhibit strong analytical and troubleshooting skills, with the proven ability to investigate incidents, determine root causes, and implement corrective actions.
Operate properly and decisively during critical situations while coordinating incident and problem management processes across cross‑functional teams.
Communicate clearly with both technical and non‑technical stakeholders through reports and reviews, while maintaining a proactive mindset focused on continuous improvement and operational excellence.
Preferred Qualifications:
5+ years of hands‑on experience in IT or data center environments, with strong exposure to incident and problem management in enterprise‑scale systems.
Demonstrate working knowledge of data center facility operations, including mechanical, electrical, and plumbing (MEP) systems, along with server and infrastructure technologies.
Have practical experience with ticketing systems, monitoring platforms (e.g., Grafana), and data center or server management tools to support reliable operations.
Consistently perform in fast changing, time‑sensitive situations, balancing multiple priorities while meeting deadlines and resolving critical issues efficiently.
Contribute to or lead initiatives that enhance operational efficiency, security, resilience, and overall infrastructure performance through continuous improvement efforts.
Maintain relevant certifications or technical knowledge (e.g., ITIL, Server+, DCCA, CCNA, PMP, analytics tools), adapt quickly to changing environments, and support operational needs including on‑call coverage.
This role requires on‑call coverage to support through a scheduled on‑call rotation.
#J-18808-Ljbffr
Highlights
-
Company nameByteDance
-
Job positionIncident Response Manager, Data Infrastructure
Safety Tips
Be careful with jobs that explicitly state ’no experience needed’.
More info about this ad
Incident Response Manager, Data Infrastructure has been posted in the Cork Other Jobs category on Locanto.
Right now, this is the only ad posted in this category in Cork.
Interested in more? Widen your search to view ads in nearby areas of Cork. This includes Other Jobs in Passage West, Glanmire and Ballincollig. There are more ads within a 15 km radius for this category. If you want to view those ads, click here.