All roles

[Remote] Senior Site Reliability Engineer

Remote · USA Full-time New today

Note: The job is a remote job and is open to candidates in USA. You.com is building the AI Search Infrastructure that powers modern AI systems. As a Site Reliability Engineer, you will own parts of the reliability, observability, and incident response posture for You.com’s production services, ensuring uptime and developing tools for incident management.

Responsibilities

  • Instrument services end-to-end using OpenTelemetry metrics and structured logging to ensure every critical path is measurable
  • Develop and maintain SRE standards and patterns (instrumentation guidelines, incident playbooks, service templates) that engineering teams adopt by default in new and existing services
  • Build internal tooling and automation in Python, Bash and Terraform to improve deployment safety, reliability, and operational efficiency
  • Design and maintain actionable dashboards that surface real user impact, not vanity metrics, for service owners and leadership
  • Tune alerting rules continuously to maximize signal-to-noise ratio; tie alerts to SLO-based error-budget burn rates rather than arbitrary thresholds
  • Own reliability incident response end-to-end: detection, triage, communication, escalation, resolution, and stakeholder updates
  • Track and run blameless postmortems that focus on systemic contributing factors, not individual fault, producing actionable remediation items with owners and deadlines
  • Track remediation follow-through as a first-class metric. Ensure postmortem action items are completed, not just documented
  • Continuously improve MTTD and MTTR by feeding incident learnings back into monitoring, runbooks, and automation
  • Collaborate with Customer Success and ensure we by feed incident learnings back into monitoring, runbooks, and automation
  • Define meaningful SLOs for all production services grounded in critical user journeys, historical performance data, and business requirements
  • Eliminate alert fatigue by auditing, categorizing, and deprecating noisy or non-actionable alerts on a regular cadence
  • Help manage incident management processes and playbooks

Skills

  • 2+ years of full-time experience in an SRE or similar role
  • 3+ years of experience working in AWS with EKS and Github (GHA) & CI/CD
  • Strong hands-on experience with Git, Python, and Bash. Comfortable building production-grade automation and tooling
  • Experience establishing SRE practices across multiple teams (SLO definitions, alert hygiene, postmortem culture)
  • Built or maintained Prometheus-based monitoring with dashboards they have in Grafana
  • Demonstrated experience scoping and delivering infrastructure projects from proposal through production deployment
  • Demonstrated experience managing incidents and response to service outage
  • Hands-on experience integrating AI with SRE efforts to improve reliability, development and velocity
  • Demonstrated track record of collaborating with teams to define SLOs, instrument services against measurable SLIs, and operationalize error-budget burn-rate alerting that teams use independently to balance risk and delivery speed

Benefits

  • Hubs in San Francisco and New York City offering regular in-person gatherings and co-working sessions
  • Flexible PTO with U.S. holidays observed and a week shutdown in December to rest and recharge*
  • A competitive health insurance plan covers 100% of the policyholder and 75% for dependents*
  • 12 weeks of paid parental leave in the US*
  • 401k program, 3% match - vested immediately!*
  • $500 work-from-home stipend to be used up to a year of your start date*
  • $600 technology stipend to support a portion of our hybrid/remote team's cell phone and internet expenses*
  • $1,200 per year Health & Wellness Allowance to support your personal goals*
  • *Certain perks and benefits are limited to full-time employees only

Company Overview

  • You.com is a personalized AI search engine that delivers customized recommendations and allows natural conversation with its AI chatbot. It was founded in 2020, and is headquartered in Palo Alto, California, USA, with a workforce of 51-200 employees. Its website is https://you.com.
  • Apply To This Job

    Related roles

    [Remote] Business Development Manager - US

    Remote · USA Full-time

    [Remote] Customer Success Engineer (Americas)

    Remote · USA Full-time

    [Remote] Senior Presales Engineer - Series B Cloud Security Start Up Vendor

    Remote · USA Full-time

    [Remote] Account Manager – Performance Additives (Coatings Industry)

    Remote · USA Full-time

    [Remote] Founding Product Designer

    Remote · USA Full-time

    [Remote] Digital Operations Team Lead

    Remote · USA Full-time

    [Remote] Junior SAAS Account Manager

    Remote · USA Full-time

    [Remote] Lead AI / ML Engineer

    Remote · USA Full-time

    [Remote] Clinical Trial Analyst, Full Time, Day

    Remote · USA Full-time

    [Remote] Senior Customer Success Manager - North America

    Remote · USA Full-time

    Experienced Customer Service Representative – Work from Home Opportunity with arenaflex

    Remote · USA Full-time

    Beginner Level Mobile Customer Service Representative

    Remote · USA Full-time

    PA LPC or LCSW for REMOTE TELEHEALTH FT/PT

    Remote · USA Full-time

    [Remote] Full Stack QA Engineer - AI Trainer

    Remote · USA Full-time

    Digital Customer Service Representative (Remote) – Join arenaflex's Dynamic Team

    Remote · USA Full-time

    Experienced Remote Customer Service Representative – Part-Time Aviation Support & Passenger Experience Specialist at arenaflex

    Remote · USA Full-time

    [Remote] Customer Solutions and Success Manager

    Remote · USA Full-time

    Talent Acquisition Specialist (Security)

    Remote · USA Full-time

    Remote | Psychology Assessment Expert — $40–$75/hour

    Remote · USA Full-time

    Personal Injury Reductions / Closing Virtual Assistant

    Remote · USA Full-time