Site Reliability Engineer
LockedIn AI is the #1 real-time AI interview and meeting copilot, trusted by over 1 million users worldwide. We are building the most advanced AI-powered career preparation platform that helps users succeed in interviews, coding assessments, and professional communication.
Our platform delivers real-time AI assistance during live conversations — where reliability, speed, and uptime are mission-critical.
Role Overview
We are looking for a proactive, systems-minded Site Reliability Engineer (SRE) to ensure that LockedIn AI’s production systems are highly reliable, scalable, and performant.
This is a high-impact engineering role where system stability directly defines user experience. When users are in live interviews, latency and uptime are the product.
You will own the reliability of real-time AI infrastructure serving over 1 million users globally.
Key Responsibilities
1. System Reliability & Performance
- Own uptime, reliability, and performance across production systems
- Define and manage SLIs, SLOs, and error budgets
- Build fault-tolerant and self-healing architectures
- Optimize latency, throughput, and system efficiency
2. Infrastructure as Code & Cloud Systems
- Design and manage cloud infrastructure using Terraform, Pulumi, or CloudFormation
- Operate AWS, GCP, or Azure-based production environments
- Manage Kubernetes clusters and microservices infrastructure
- Optimize cloud costs while maintaining performance and reliability
3. Observability & Monitoring
- Build monitoring systems using Prometheus, Grafana, Datadog, or similar tools
- Design alerting systems with low noise and high accuracy
- Implement distributed tracing and centralized logging
- Monitor AI-specific metrics (latency, GPU usage, inference throughput)
4. Incident Response & Reliability Engineering
- Lead incident response for outages and production issues
- Participate in on-call rotations
- Conduct postmortems and root cause analysis
- Build runbooks and improve system resilience over time
5. CI/CD & Release Engineering
- Build and maintain CI/CD pipelines for fast and safe deployments
- Implement canary, blue-green, and rollback strategies
- Ensure safe deployment of application and AI model updates
- Improve deployment velocity without compromising stability
6. Security & Infrastructure Best Practices
- Implement secure infrastructure design (IAM, encryption, secrets management)
- Maintain compliance with privacy and security standards
- Manage vulnerability scanning and system hardening
- Ensure secure handling of user data across systems
Required Qualifications
Experience
- 3+ years in SRE, DevOps, or infrastructure engineering
- Experience managing production systems at scale
- Strong background in incident response and system reliability
- Experience working in fast-paced startup environments
Education
- Bachelor’s degree in Computer Science, Engineering, or related field
- Equivalent hands-on experience strongly considered
Technical Skills
- Strong programming skills (Python, Go, or similar)
- Experience with AWS, GCP, or Azure
- Kubernetes and Docker expertise
- Infrastructure as Code (Terraform, Pulumi, CloudFormation)
- CI/CD systems (GitHub Actions, GitLab CI, Jenkins, ArgoCD)
- Observability tools (Prometheus, Grafana, Datadog, ELK, etc.)
Soft Skills
- Strong reliability-first engineering mindset
- Calm and effective under production incidents
- Excellent communication and documentation skills
- Strong ownership and proactive problem-solving
Preferred Qualifications
- Experience with real-time AI or ML infrastructure
- Knowledge of GPU-based or inference-heavy systems
- Experience with streaming, WebSockets, or low-latency systems
- Familiarity with chaos engineering practices
- Multi-cloud or hybrid-cloud experience
- Experience in SaaS, edtech, or AI startups
- Open-source infrastructure contributions
What We Offer
Equity
Meaningful early-stage ownership in a fast-growing AI company
Impact
Your work directly supports over 1 million active users
Team
Join a lean, high-performance engineering team
Flexibility
Remote-first with optional hybrid work in New York
Growth
Fast-paced startup environment with high ownership
Culture
User-focused, feedback-driven, and execution-oriented
Why Join LockedIn AI?
- Category-defining AI interview copilot platform
- Massive and fast-growing AI career tech market
- Reliability directly impacts real-time user experience
- Work on cutting-edge AI infrastructure at scale
- High ownership and real production responsibility
How to Apply
Please submit:
- Resume / CV
- Short note covering:
- Why you want to join LockedIn AI
- Whether you’ve used the product
- Ideas for improving reliability or performance
- Optional: GitHub, projects, or technical writing
Equal Opportunity Statement
LockedIn AI is committed to building a diverse and inclusive team. We welcome applicants from all backgrounds. Hiring decisions are based on merit, skills, and business needs.