Job Description
Summary
Responsibilities
- Service Management: Design, build, deploy, and maintain services to ensure the high availability and reliability of DFINITY's products and the Internet Computer Protocol (ICP).
- Automation: Identify and implement opportunities to automate processes through coding, enhancing efficiency and reducing manual intervention.
- Reliability and Operability: Integrate reliability and operability into the product from the start by participating in design and code reviews, identifying risks, and proposing mitigations.
- Collaboration: Work with engineering and security teams to establish processes that align with the goals of the Internet Computer while remaining operationally feasible and automatable.
- Service Level Objectives (SLOs): Collaborate with product owners to define SLOs and implement them in code and observability infrastructure.
- On-Call Duty: Participate in on-call duties for production services on a 12/7 schedule, split across two sites. On-call duty is approximately 1 week every 6 weeks. Coordinate incident response and ensure resolution, involving engineers from other teams as necessary. On-call work is compensated with a monetary and a time off compensation.
- On-Call Philosophy: Our team chooses to be on-call because it enhances our ability to identify and address system alerts, ultimately improving performance.
- Unix Systems: Operate, troubleshoot, and deploy software on Unix systems.
Requirements:
- Observability: Proven experience in monitoring and maintaining large production systems using tools such as Prometheus, Victoria Metrics, Elastic Search, and Grafana.
- Kubernetes: Proficiency in managing multiple observability stacks across various availability zones, leveraging Kubernetes for deployment orchestration.
- Rust Coding: Extensive experience in designing and developing moderate-sized applications (up to ~10K lines of code) in Rust. Skilled in setting up automated testing and CI/CD environments. Ability to identify and implement opportunities for automation and process improvement. Experience in developing reliability engineering tools for large open-source projects is highly desirable.
- Systemic Thinking: Capable of approaching problems methodically and systemically, especially during troubleshooting.
- Pragmatism: Ability to balance immediate needs with long-term goals, understanding when a solution is "good enough for the next 12 months."
- Incident Response: Expertise in coordinating incident response across multiple teams, with excellent communication skills to clearly understand the situation, next steps, and team responsibilities.
- Reliability Engineering: Preferable experience in Site Reliability Engineering (SRE) within a crypto environment where decisions are governed by DAOs.
- Security Background: Experience in building security-sensitive tools and managing security risks in such environments. A background in DevSecOps is highly desirable.
- Community interaction: Proven experience in engaging with community members of large open-source projects. Ideally, the candidate is already active within the ICP community.
Within 1 month, you will:
- Gain a thorough understanding of DFINITY's infrastructure and production environment.
- Start working on a suitable starter project.
- Submit improvements to our documentation and processes based on your onboarding experience.
Within 3 months, you will:
- Successfully deliver your starter project.
- Shadow team members on-call, preparing to join the on-call rotation from month 4 onwards.
- Proactively identify and propose improvements, initiating projects to implement them.
Skills
- Communications Skills
- Development
- Rust
- Team Collaboration