Job Description

Summary

Responsibilities

  1. Service Management: Design, build, deploy, and maintain services to ensure the high availability and reliability of DFINITY's products and the Internet Computer Protocol (ICP).
  2. Automation: Identify and implement opportunities to automate processes through coding, enhancing efficiency and reducing manual intervention.
  3. Reliability and Operability: Integrate reliability and operability into the product from the start by participating in design and code reviews, identifying risks, and proposing mitigations.
  4. Collaboration: Work with engineering and security teams to establish processes that align with the goals of the Internet Computer while remaining operationally feasible and automatable.
  5. Service Level Objectives (SLOs): Collaborate with product owners to define SLOs and implement them in code and observability infrastructure.
  6. On-Call Duty: Participate in on-call duties for production services on a 12/7 schedule, split across two sites. On-call duty is approximately 1 week every 6 weeks. Coordinate incident response and ensure resolution, involving engineers from other teams as necessary. On-call work is compensated with a monetary and a time off compensation.
  7. On-Call Philosophy: Our team chooses to be on-call because it enhances our ability to identify and address system alerts, ultimately improving performance.
  8. Unix Systems: Operate, troubleshoot, and deploy software on Unix systems.

Requirements:

  1. Observability: Proven experience in monitoring and maintaining large production systems using tools such as Prometheus, Victoria Metrics, Elastic Search, and Grafana.
  2. Kubernetes: Proficiency in managing multiple observability stacks across various availability zones, leveraging Kubernetes for deployment orchestration.
  3. Rust Coding: Extensive experience in designing and developing moderate-sized applications (up to ~10K lines of code) in Rust. Skilled in setting up automated testing and CI/CD environments. Ability to identify and implement opportunities for automation and process improvement. Experience in developing reliability engineering tools for large open-source projects is highly desirable.
  4. Systemic Thinking: Capable of approaching problems methodically and systemically, especially during troubleshooting.
  5. Pragmatism: Ability to balance immediate needs with long-term goals, understanding when a solution is "good enough for the next 12 months."
  6. Incident Response: Expertise in coordinating incident response across multiple teams, with excellent communication skills to clearly understand the situation, next steps, and team responsibilities.
  7. Reliability Engineering: Preferable experience in Site Reliability Engineering (SRE) within a crypto environment where decisions are governed by DAOs.
  8. Security Background: Experience in building security-sensitive tools and managing security risks in such environments. A background in DevSecOps is highly desirable.
  9. Community interaction: Proven experience in engaging with community members of large open-source projects. Ideally, the candidate is already active within the ICP community.

Within 1 month, you will:

  1. Gain a thorough understanding of DFINITY's infrastructure and production environment.
  2. Start working on a suitable starter project.
  3. Submit improvements to our documentation and processes based on your onboarding experience.

Within 3 months, you will:

  1. Successfully deliver your starter project.
  2. Shadow team members on-call, preparing to join the on-call rotation from month 4 onwards.
  3. Proactively identify and propose improvements, initiating projects to implement them.

Skills
  • Communications Skills
  • Development
  • Rust
  • Team Collaboration
© 2024 cryptojobs.com. All right reserved.