[Remote] Senior Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. You.com is building the AI Search Infrastructure that powers modern AI systems. As a Site Reliability Engineer, you will own parts of the reliability, observability, and incident response posture for You.com’s production services, ensuring uptime and developing tools for incident management.
Responsibilities
- Instrument services end-to-end using OpenTelemetry metrics and structured logging to ensure every critical path is measurable
- Develop and maintain SRE standards and patterns (instrumentation guidelines, incident playbooks, service templates) that engineering teams adopt by default in new and existing services
- Build internal tooling and automation in Python, Bash and Terraform to improve deployment safety, reliability, and operational efficiency
- Design and maintain actionable dashboards that surface real user impact, not vanity metrics, for service owners and leadership
- Tune alerting rules continuously to maximize signal-to-noise ratio; tie alerts to SLO-based error-budget burn rates rather than arbitrary thresholds
- Own reliability incident response end-to-end: detection, triage, communication, escalation, resolution, and stakeholder updates
- Track and run blameless postmortems that focus on systemic contributing factors, not individual fault, producing actionable remediation items with owners and deadlines
- Track remediation follow-through as a first-class metric. Ensure postmortem action items are completed, not just documented
- Continuously improve MTTD and MTTR by feeding incident learnings back into monitoring, runbooks, and automation
- Collaborate with Customer Success and ensure we by feed incident learnings back into monitoring, runbooks, and automation
- Define meaningful SLOs for all production services grounded in critical user journeys, historical performance data, and business requirements
- Eliminate alert fatigue by auditing, categorizing, and deprecating noisy or non-actionable alerts on a regular cadence
- Help manage incident management processes and playbooks
Skills
- 2+ years of full-time experience in an SRE or similar role
- 3+ years of experience working in AWS with EKS and Github (GHA) & CI/CD
- Strong hands-on experience with Git, Python, and Bash. Comfortable building production-grade automation and tooling
- Experience establishing SRE practices across multiple teams (SLO definitions, alert hygiene, postmortem culture)
- Built or maintained Prometheus-based monitoring with dashboards they have in Grafana
- Demonstrated experience scoping and delivering infrastructure projects from proposal through production deployment
- Demonstrated experience managing incidents and response to service outage
- Hands-on experience integrating AI with SRE efforts to improve reliability, development and velocity
- Demonstrated track record of collaborating with teams to define SLOs, instrument services against measurable SLIs, and operationalize error-budget burn-rate alerting that teams use independently to balance risk and delivery speed
Benefits
- Hubs in San Francisco and New York City offering regular in-person gatherings and co-working sessions
- Flexible PTO with U.S. holidays observed and a week shutdown in December to rest and recharge*
- A competitive health insurance plan covers 100% of the policyholder and 75% for dependents*
- 12 weeks of paid parental leave in the US*
- 401k program, 3% match - vested immediately!*
- $500 work-from-home stipend to be used up to a year of your start date*
- $600 technology stipend to support a portion of our hybrid/remote team's cell phone and internet expenses*
- $1,200 per year Health & Wellness Allowance to support your personal goals*
- *Certain perks and benefits are limited to full-time employees only
Company Overview