Job Description
Summary
We're seeking an experienced Site Reliability Engineer to lead and mentor our SRE team. You're a seasoned professional with a proven track record in designing and implementing robust SRE processes at scale. You excel in cloud and hybrid environments, have a deep understanding of containerization, and are passionate about creating resilient, high-performance systems that can handle extreme traffic peaks. Beyond technical expertise, you're a skilled communicator and collaborator, able to bridge the gap between technical teams and stakeholders. You thrive in cross-functional environments and can effectively represent SRE concerns at the leadership level.
Responsibilities:
- Lead the implementation and refinement of SRE practices across the organization, including SLOs, error budgets, and blameless postmortems
- Design and implement automation to eliminate toil and improve system reliability and efficiency
- Lead initiatives and architect scalable hybrid cloud solutions for Web3 infrastructure
- Manage error budgets and make data-driven decisions about when to prioritize reliability vs. new features
- Drive SRE practices to ensure high availability, performance, and reliability under varying load conditions
- Collaborate closely with Platform engineering team to build reliability into services from the ground up
- Collaborate closely with Nethermind’s Infrastructure Leadership department to align SRE strategies with overall technical vision
- Drive the adoption of observability best practices and implement comprehensive monitoring systems
- Develop and maintain service level indicators (SLIs) and objectives (SLOs), working with product owners to define appropriate reliability targets
- Mentor team members in SRE practices and foster a culture of continuous learning
- Lead capacity planning efforts, using quantitative analysis to predict and address future scaling challenges
- Contribute to long-term technical roadmaps, balancing reliability concerns with product innovation
Skills:
- 5+ years of experience in Site Reliability Engineering or DevOps
- Expert knowledge of cloud platforms (AWS, GCP)
- Expert knowledge of Kubernetes
- Proven experience in designing and implementing scalable, efficient, resilient systems
- Deep understanding of Linux/Unix systems and networking protocols
- Strong programming skills in Python or Go
- Strong background in monitoring, observability, and logging systems (e.g., Grafana, Prometheus, Loki)
- Expertise in CI/CD tools (e.g. GitHub Actions, ArgoCD)
- Excellent communication skills, both written and verbal, with the ability to explain complex technical concepts to various audiences
- Experience in producing technical documentation, runbooks, presentations, and post-mortem reports
- Experience and passion for mentoring and upskilling team members
Nice to have:
- Experience leading technical teams
- Contributions to open-source projects or thought leadership in SRE
- Familiarity with MLOps and big data technologies
- Knowledge of blockchain technology and infrastructure
- Experience with chaos engineering principles and tools
- Familiarity with traffic management and CDN technologies
- Systems or backend engineering background
Skills
- AWS
- Communications Skills
- Development
- Leadership
- Python
- Team Collaboration