Job Description
Summary
We are seeking a new Site Reliability Engineer (SRE) to join our Team in Singapore.
WHAT YOU’LL DO:
- Keeping your assigned site or service up and running or rapid recovery from failures
- Actively troubleshoot any issues that arise during testing and production, catching and solving issues before launch,
- Automating work including infrastructure needs, testing, failover solutions, failure mitigation, and much more,
- Monitor and troubleshoot highly scalable and distributed server clusters that perform various functions, from web-servers to machine learning processing,
- Be on a PagerDuty rotation to respond to availability incidents and provide support for service engineers with customer incidents,
- Participate and establish best practices in Site Reliability Engineering,
- Manage code deployments, fixes, updates, and related processes,
- Work with a close-knit team and brainstorm on the best ways to tackle complex problems in infrastructure, security and monitoring,
- Provide technical guidance and educate team members and coworkers on monitoring and logging. (Have an interesting idea or solution? Present it!),
- Automating any software maintenance processes which previously required a manual procedure.
WHAT YOU'LL BRING:
- 5+ years’ experience with software eaoing, software development, or system operations on high available and high traffic environments
- Strong experience with Linux-based infrastructures, Linux/Unix administration, Azure and AWS
- Experience with databases such as PostgreSQL
- Experience administering Linux servers as well as docker based infrastructure (like Kubernetes, AKS, etc.) in a highly available environment
- Experience of scripting languages such as Typescript, Java Bash
- Experience with message broker/queue technologies like RabbitMQ, AMQP 1.0
- Experience with modern monitoring, logging and observability tools in complex distributed systems such as with Application Insights, Grafana, New Relic, Splunk, Elastic stack, Datadog, Prometheus, etc
- Practical experience with infrastructure-as-code (with tools like Terraform, Chef, Ansible, etc.).
- Good understanding of cybersecurity fundamentals and best practices
- Stellar problem-solving and troubleshooting skills with the ability to spot issues before they become problems
- Excellent problem-solving and communication skills
- Committed to processes, with excellent documentation skills and a strong ability to work well in a team!
Skills
- AWS
- Communications Skills
- Cybersecurity Solutions
- Database Management
- Development
- Problem Solving
- Software Engineering
- SQL
- TypeScript