Key Responsibilities
- Lead and coordinate day-to-day IT operations and service delivery
- Ensure maximum system availability, reliability, and performance across all platforms
- Drive, evolve, and scale Site Reliability Engineering (SRE) practices, including monitoring, incident response, and automation
- Own and improve 24/7 operational readiness, including on-call models and escalation processes
- Collaborate closely with development teams in agile environments (e.g., SAFe) to enhance system resilience and scalability
- Continuously identify and implement improvements based on incident analysis, KPIs, and operational insights
- Align operations with ITIL processes (Critical Incident, Incident, Problem, Change Management)
- Manage and optimize cloud-based infrastructure, primarily within AWS environments
- Act as a bridge between operations, engineering, and business stakeholders.
