Site Reliability Engineer at DFINITY | Switzerland | Full-Time | cryptojobs.com | Best Platform for the Latest Web3 and Blockchain Jobs

Summary

Responsibilities

Service Management: Design, build, deploy, and maintain services to ensure the high availability and reliability of DFINITY's products and the Internet Computer Protocol (ICP).
Automation: Identify and implement opportunities to automate processes through coding, enhancing efficiency and reducing manual intervention.
Reliability and Operability: Integrate reliability and operability into the product from the start by participating in design and code reviews, identifying risks, and proposing mitigations.
Collaboration: Work with engineering and security teams to establish processes that align with the goals of the Internet Computer while remaining operationally feasible and automatable.
Service Level Objectives (SLOs): Collaborate with product owners to define SLOs and implement them in code and observability infrastructure.
On-Call Duty: Participate in on-call duties for production services on a 12/7 schedule, split across two sites. On-call duty is approximately 1 week every 6 weeks. Coordinate incident response and ensure resolution, involving engineers from other teams as necessary. On-call work is compensated with a monetary and a time off compensation.
On-Call Philosophy: Our team chooses to be on-call because it enhances our ability to identify and address system alerts, ultimately improving performance.
Unix Systems: Operate, troubleshoot, and deploy software on Unix systems.

Requirements:

Observability: Proven experience in monitoring and maintaining large production systems using tools such as Prometheus, Victoria Metrics, Elastic Search, and Grafana.
Kubernetes: Proficiency in managing multiple observability stacks across various availability zones, leveraging Kubernetes for deployment orchestration.
Rust Coding: Extensive experience in designing and developing moderate-sized applications (up to ~10K lines of code) in Rust. Skilled in setting up automated testing and CI/CD environments. Ability to identify and implement opportunities for automation and process improvement. Experience in developing reliability engineering tools for large open-source projects is highly desirable.
Systemic Thinking: Capable of approaching problems methodically and systemically, especially during troubleshooting.
Pragmatism: Ability to balance immediate needs with long-term goals, understanding when a solution is "good enough for the next 12 months."
Incident Response: Expertise in coordinating incident response across multiple teams, with excellent communication skills to clearly understand the situation, next steps, and team responsibilities.
Reliability Engineering: Preferable experience in Site Reliability Engineering (SRE) within a crypto environment where decisions are governed by DAOs.
Security Background: Experience in building security-sensitive tools and managing security risks in such environments. A background in DevSecOps is highly desirable.
Community interaction: Proven experience in engaging with community members of large open-source projects. Ideally, the candidate is already active within the ICP community.

Within 1 month, you will:

Gain a thorough understanding of DFINITY's infrastructure and production environment.
Start working on a suitable starter project.
Submit improvements to our documentation and processes based on your onboarding experience.

Within 3 months, you will:

Successfully deliver your starter project.
Shadow team members on-call, preparing to join the on-call rotation from month 4 onwards.
Proactively identify and propose improvements, initiating projects to implement them.

Skills

Communications Skills
Development
Rust
Team Collaboration

About Company

Job Description

Summary

Skills

About Company

Job Description

Summary

Skills

Newsletter