Job Description
Summary
We're looking for a Senior Site Reliability Engineer to join our Infrastructure team. This Engineer will enable our developers as they work efficiently while building a vibrant ecosystem for the Avalanche Blockchain. You'll enable our teams across several business units and engineering teams to design, optimize, and and implement greenfield technology for a variety of use cases. This particular role will be a key part of our release schedule and production monitoring.
WHAT YOU WILL DO
- Develop and optimize highly reliable and scalable infrastructure focused on SRE principles.
- Implement and maintain monitoring, logging, and tracing tools to gain insights into service behavior and health.
- Uphold SLOs (Service Level Objectives), SLIs (Service Level Indicators), and error budgets for critical systems.
- Enhance the reliability and resiliency of critical systems by identifying single points of failure and implementing best practices.
- Collaborate with software developers to build reliability and performance into applications from inception.
- Automate and streamline incident management processes to minimize service disruption and improve response times.
- Participate in on-call rotations, ensuring quick restoration of services and fostering a blameless post-mortem culture.
- Foster a continuous improvement mindset by analyzing and learning from incidents and implementing preventive measures.
- Leverage cloud technologies and IaC tools to ensure scalability and repeatability.
- Advocate for best practices in reliability, security, and maintainability within the team.
WHAT YOU WILL BRING
- BS in Computer Science or related field.
- 5+ years of experience as an SRE, DevOps, or Cloud Engineer.
- Strong grasp of SRE principles, including error budgets, SLOs, and SLIs.
- Cloud networking and orchestration with AWS (EKS, ECS, VPC, S3, ELB).
- Strong Kubernetes experience with Docker or RKT containerization.
- Proficiency in Infrastructure as Code (IaC) using tools such as Terraform, Terragrunt, and Ansible.
- Experience with monitoring and observability tools like Prometheus, Grafana, or ELK Stack.
- Building and maintaining CI/CD pipelines with GitHub Actions (preferred), Jenkins, Travis CI, Circle CI.
- Experience with automation and configuration management using Ansible, Puppet or Chef.
- Experience with Linux-based infrastructures. (Ubuntu preferred).
- Experience with scripting languages and the creation of scripts. (Python and GoLang preferred).
- Working knowledge of decentralized architecture design patterns and distributed systems.
Skills
- AWS
- Networking
- Python
- Team Collaboration