Job Description
Summary
Responsibilities:
- Own the production infrastructure over AWS and Azure. Implement sustainable and scalable solutions with goals of improving availability and performance
- Help Identify root causes for every incident and prevent incidents from ever happening again
- Have alerts on symptoms and not on outages. Ensure all infrastructure and application alerts are “actionable” alerts and/or self-healing automation
- Work closely with the R&D and Support: offering education and guidance on integration, support, and monitoring across the toolset
- Everything as a code approach: Run our infrastructure with Ansible, Terraform, and Kubernetes
- Document every action and turn it into repeatable actions and then into automation
- Focus on the system's observability, availability, reliability, performance/latency, monitoring
- Conduct periodic on-call duties and emergency response
Minimum Requirements:
- At least 3+ years of experience as DevOps or SRE in a SaaS environment
- Experience with Coding languages - Python/JavaScript/Bash, or similar
- At least 3+ years of experience with Alerting & Monitoring systems such as DataDog Splunk / New Relic / Prometheus, or similar
- Experience working with Linux systems from kernel to shell and beyond
- Cloud systems such as AWS / Google cloud / Azure
- Configuration management such as Ansible/Chef/Puppet
- Experience with Docker, Kubernetes and Helm
- SCM - Git/bitbucket/gitlab/Phabricator/gerrit
- High Analytical & Troubleshooting skills - ability to solve complex problems
- Strong verbal and written communication skills and a collaborative mindset
- Ability to dive into detail while understanding the big picture
Nice-to-have:
- DataDog extensive experience, monitoring\dashboard expert
- Participated in Kubernetes migration projects
- Previous experience as a C++ or Node Developer
- BSC in Computer Science or related technical certifications
- Previous experience in cryptocurrencies \ blockchains - big advantage
Skills
- Analytical Thinking
- AWS
- Communications Skills
- Development
- Python
- Software Engineering