Role
Reporting to the Head of Engineering, this individual will work closely with the QA and Customer Success teams to ensure that the product is delivered successfully to a customer with high quality and is always available in a secure manner.
- Design and maintain fault-tolerant, scalable, and reliable systems to meet uptime goals
- Implement and manage monitoring, logging, and alerting systems.
- Optimize infrastructure and application performance on cloud platforms (AWS, Azure, GCP).
- Build and maintain CI/CD pipelines for automated deployments using tools like Jenkins or GitLab CI.
- Automate operational tasks using scripting (Python, Bash) and infrastructure-as-code tools (Terraform, Ansible).
- Collaborate with development and operations teams to improve reliability and resolve incidents.
- Participate in on-call rotations for 7x24 system availability, promptly responding to and resolving production issues.
Required Skills
- Platforms: Strong experience in Linux/Unix system administration and cloud platforms (AWS, Azure, or GCP).
- Tools: Proficiency in containerization and orchestration tools (Docker, Kubernetes).
- Network and security: Hands-on experience with networking concepts (DNS, HTTP, load balancing) and security best practices.
- Scripting: Advanced scripting skills (Python, Groovy) and familiarity with at least one programming language (e.g., Go, Java).
- Log analysis: Expertise in monitoring and observability tools (e.g., Datadog, Nagios, or Splunk).
- Deployment: Solid understanding of CI/CD principles and tools (e.g., Jenkins, Git).
- Mindset: Must be driven to deliver a robust, reliable, secure environment that is always available for a customer.
Qualifications
- 2-4 years of experience
- Show us your work to date