Site Reliability Engineering Manager

About the Role:

We are looking for a seasoned and strategic Site Reliability Engineering (SRE) Manager to
lead and grow our SRE team. You will be responsible for building and managing a team of
engineers who ensure the reliability, scalability, and performance of our mission-critical systems
and services. As the SRE Manager, you will play a key role in the design and execution of
operational best practices while promoting a culture of automation and continuous improvement
across the engineering teams.

Key Responsibilities:

Team Leadership and Mentorship — Lead, mentor, and grow a team of Site Reliability
Engineers by providing guidance on best practices, technical decisions, and career
development.
Operational Excellence — Own the overall reliability, uptime, and performance of the
systems and services, ensuring they meet business SLAs and customer expectations.
Incident Management — Oversee the incident response process, including monitoring,
alerting, incident resolution, and root cause analysis, with a focus on improving response
times and minimizing impact.
Automation and Tooling — Drive the adoption of automation and self-service tools to
reduce manual intervention, improve system reliability, and enhance engineering
productivity.
Collaboration with Engineering Teams — Work closely with software developers, QA
teams, and other stakeholders to embed reliability into the design and development of
applications and services.
Capacity Planning and Performance Optimization — Manage capacity planning,
performance monitoring, and optimization to ensure the infrastructure can scale to meet
business needs.
Infrastructure Management — Collaborate with the DevOps and cloud infrastructure
teams to manage, maintain, and optimize cloud infrastructure using modern IaC
(Infrastructure as Code) tools and methodologies.
Budget and Resource Management — Manage budgets, vendor relationships, and
resource allocations to ensure efficient use of infrastructure and technology investments.
Drive SRE Culture — Promote a culture of continuous improvement, emphasizing
learning from failure, monitoring, and proactive problem-solving.
Security and Compliance — Work closely with security teams to implement best
practices for secure infrastructure and ensure compliance with internal and external
regulations.

Required Qualifications:

Bachelor’s degree in Computer Science, Engineering, Information Technology, or equivalent work experience.
5+ years of experience in site reliability engineering, DevOps, or infrastructure roles with at least 2 years in a leadership or management role.
Deep knowledge of cloud platforms (AWS, Google Cloud Platform, Azure) and the ability to manage highly available and scalable infrastructure.
Hands-on experience with monitoring, alerting, and observability tools (Datadog, Prometheus, Grafana, ELK, etc.).
Strong expertise in automation tools and practices (CI/CD pipelines, IaC tools such as Terraform).
Solid understanding of containers and orchestration tools (Docker, Kubernetes).
Proven experience with incident management, root cause analysis, and post-mortem processes.
Deep knowledge of Linux/Unix systems administration and networking concepts (DNS, TCP/IP, load balancing).
Strong communication and leadership skills, with the ability to collaborate across teams and functions.

Preferred Qualifications:

Experience with large-scale distributed systems and high-availability architectures.
Familiarity with security best practices for cloud environments.
Experience managing multi-region, multi-cloud deployments.
Prior experience working in Agile or Scrum environments.
Knowledge of cost optimization strategies for cloud infrastructure

Soft Skills:

Strong organizational and time management skills.
Ability to influence and inspire teams to adopt best practices.
Excellent verbal and written communication skills for both technical and non-technical stakeholders.
Ability to think strategically while staying hands-on when necessary.
Demonstrated problem-solving skills and a proactive approach to identifying risks and finding solutions

Site Reliability Engineering Manager | Toshiba (Poland)

About the Role:

Key Responsibilities:

Required Qualifications:

Preferred Qualifications:

Soft Skills:

Thank you!

Looking for another position?