By continuing to browse this website, you agree to our use of cookies. Learn more at the Privacy Policy page.

Site Reliability Engineering Manager | Toshiba (Poland)

Apply now

About the Role:

We are looking for a seasoned and strategic Site Reliability Engineering (SRE) Manager to
lead and grow our SRE team. You will be responsible for building and managing a team of
engineers who ensure the reliability, scalability, and performance of our mission-critical systems
and services. As the SRE Manager, you will play a key role in the design and execution of
operational best practices while promoting a culture of automation and continuous improvement
across the engineering teams.

Key Responsibilities:

  1. Team Leadership and Mentorship — Lead, mentor, and grow a team of Site Reliability
    Engineers by providing guidance on best practices, technical decisions, and career
    development.
  2. Operational Excellence — Own the overall reliability, uptime, and performance of the
    systems and services, ensuring they meet business SLAs and customer expectations.
  3. Incident Management — Oversee the incident response process, including monitoring,
    alerting, incident resolution, and root cause analysis, with a focus on improving response
    times and minimizing impact.
  4. Automation and Tooling — Drive the adoption of automation and self-service tools to
    reduce manual intervention, improve system reliability, and enhance engineering
    productivity.
  5. Collaboration with Engineering Teams — Work closely with software developers, QA
    teams, and other stakeholders to embed reliability into the design and development of
    applications and services.
  6. Capacity Planning and Performance Optimization — Manage capacity planning,
    performance monitoring, and optimization to ensure the infrastructure can scale to meet
    business needs.
  7. Infrastructure Management — Collaborate with the DevOps and cloud infrastructure
    teams to manage, maintain, and optimize cloud infrastructure using modern IaC
    (Infrastructure as Code) tools and methodologies.
  8. Budget and Resource Management — Manage budgets, vendor relationships, and
    resource allocations to ensure efficient use of infrastructure and technology investments.
  9. Drive SRE Culture — Promote a culture of continuous improvement, emphasizing
    learning from failure, monitoring, and proactive problem-solving.
  10. Security and Compliance — Work closely with security teams to implement best
    practices for secure infrastructure and ensure compliance with internal and external
    regulations.

Required Qualifications:

  • Bachelor’s degree in Computer Science, Engineering, Information Technology, or equivalent work experience.
  • 5+ years of experience in site reliability engineering, DevOps, or infrastructure roles with at least 2 years in a leadership or management role.
  • Deep knowledge of cloud platforms (AWS, Google Cloud Platform, Azure) and the ability to manage highly available and scalable infrastructure.
  • Hands-on experience with monitoring, alerting, and observability tools (Datadog, Prometheus, Grafana, ELK, etc.).
  • Strong expertise in automation tools and practices (CI/CD pipelines, IaC tools such as Terraform).
  • Solid understanding of containers and orchestration tools (Docker, Kubernetes).
  • Proven experience with incident management, root cause analysis, and post-mortem processes.
  • Deep knowledge of Linux/Unix systems administration and networking concepts (DNS, TCP/IP, load balancing).
  • Strong communication and leadership skills, with the ability to collaborate across teams and functions.

Preferred Qualifications:

  • Experience with large-scale distributed systems and high-availability architectures.
  • Familiarity with security best practices for cloud environments.
  • Experience managing multi-region, multi-cloud deployments.
  • Prior experience working in Agile or Scrum environments.
  • Knowledge of cost optimization strategies for cloud infrastructure

Soft Skills:

  • Strong organizational and time management skills.
  • Ability to influence and inspire teams to adopt best practices.
  • Excellent verbal and written communication skills for both technical and non-technical stakeholders.
  • Ability to think strategically while staying hands-on when necessary.
  • Demonstrated problem-solving skills and a proactive approach to identifying risks and finding solutions
icon_mail

Thank you!

We'll process your submission and contact you if your qualifications match the job.

Apply now

    Or contact our
    Hiring Manager

    Looking for another position?

    See all our open positions and learn why your should consider joining the Xenoss team.

    Careers at Xenoss