Job Description

Summary

We are seeking a new Site Reliability Engineer (SRE) to join our Team in Singapore. 

WHAT YOU’LL DO:

  1. Keeping your assigned site or service up and running or rapid recovery from failures
  2. Actively troubleshoot any issues that arise during testing and production, catching and solving issues before launch,
  3. Automating work including infrastructure needs, testing, failover solutions, failure mitigation, and much more,
  4. Monitor and troubleshoot highly scalable and distributed server clusters that perform various functions, from web-servers to machine learning processing,
  5. Be on a PagerDuty rotation to respond to availability incidents and provide support for service engineers with customer incidents,
  6. Participate and establish best practices in Site Reliability Engineering,
  7. Manage code deployments, fixes, updates, and related processes,
  8. Work with a close-knit team and brainstorm on the best ways to tackle complex problems in infrastructure, security and monitoring,
  9. Provide technical guidance and educate team members and coworkers on monitoring and logging. (Have an interesting idea or solution? Present it!),
  10. Automating any software maintenance processes which previously required a manual procedure.

WHAT YOU'LL BRING: 

  1. 5+ years’ experience with software eaoing, software development, or system operations on high available and high traffic environments
  2. Strong experience with Linux-based infrastructures, Linux/Unix administration, Azure and AWS
  3. Experience with databases such as PostgreSQL
  4. Experience administering Linux servers as well as docker based infrastructure (like Kubernetes, AKS, etc.) in a highly available environment
  5. Experience of scripting languages such as Typescript, Java Bash
  6. Experience with message broker/queue technologies like RabbitMQ, AMQP 1.0
  7. Experience with modern monitoring, logging and observability tools in complex distributed systems such as with Application Insights, Grafana, New Relic, Splunk, Elastic stack, Datadog, Prometheus, etc
  8. Practical experience with infrastructure-as-code (with tools like Terraform, Chef, Ansible, etc.).
  9. Good understanding of cybersecurity fundamentals and best practices
  10. Stellar problem-solving and troubleshooting skills with the ability to spot issues before they become problems
  11. Excellent problem-solving and communication skills
  12. Committed to processes, with excellent documentation skills and a strong ability to work well in a team!

Skills
  • AWS
  • Communications Skills
  • Cybersecurity Solutions
  • Database Management
  • Development
  • Problem Solving
  • Software Engineering
  • SQL
  • TypeScript
© 2024 cryptojobs.com. All right reserved.