Job Details

View jobs in our app

Learn more about the app. Workinapps.com

Senior Site Reliability Engineer

2025-06-25 ZEFR all cities,CA

Description:

Join to apply for the Senior Site Reliability Engineer role at Zefr

Get AI-powered advice on this job and more exclusive features.

About Zefr

Zefr is the leading global technology company enabling responsible marketing in walled garden social environments. Our solutions empower brands to manage their content adjacency on platforms like YouTube, Meta, TikTok, and Snap, in accordance with industry standards. Using patented AI technology, we provide more accurate and transparent solutions for social walled gardens. Headquartered in Los Angeles, California, with additional global locations.

What You'll Do

As a Site Reliability Engineer at Zefr, you will leverage your expertise in cloud infrastructure, CI/CD, Observability, and core SRE principles to deliver reliable, scalable solutions. You will collaborate closely with our Machine Learning team to ensure robust infrastructure for model training, deployment, and serving. We seek someone with technical prowess, leadership skills, and a passion for continuous innovation, who will help maintain the health and efficiency of our infrastructure supporting ML workloads.

Support and develop systems and tools for rapid and safe deployment and management of features and models.
Deploy and support multi-cloud, microservice architectures, including ML-specific infrastructure, using Github Actions, ArgoCD, and Kubernetes.
Work with engineering teams to design secure, resilient, scalable, and cost-effective applications and ML pipelines in AWS and GCP.
Promote DevOps culture and continuous improvement across teams.
Maintain production environment health, monitor ML model performance and resource use.
Participate in 24/7 on-call rotations, respond to outages and performance issues.
Debug application and infrastructure code.
Enhance CI/CD workflows and release processes.
Research and propose innovative solutions.
Review and propose changes to engineering architecture via RFCs.

Technology Stack at Zefr

Core Infrastructure & Cloud Platforms

GCP, AWS
Terraform
Docker, Kubernetes (GKE/EKS), Helm, Kustomize
Istio

CI/CD & Automation

GitHub Actions
Argo CD
Python

Observability & Monitoring

Prometheus, Datadog, Pagerduty
OpenTelemetry

Application & Data Ecosystem

Python, FastAPI, Flask, Node.js, React
Apache Kafka, Pandas, DBT, Airflow, Ray
ML Stack: Triton, Weights & Biases, DVC, Transformers, HuggingFace, Onnx, TensorRT

Data Stores & Databases

PostgreSQL, DynamoDB, OpenSearch, Qdrant, Redis, Snowflake

Qualifications

6+ years managing cloud infrastructure in production, with AWS or GCP experience required.
Experience deploying container workloads with Kubernetes.
At least 1 year in ML infrastructure development and operations.
Knowledge of GitOps, CI/CD pipelines, and IaC tools.
Strong problem-solving skills focused on automation.
Experience with monitoring and observability tools.
Understanding of cloud networking concepts.
Excellent communication and organizational skills.

Benefits & Compensation

For US-based employees, benefits include flexible PTO, health insurance, life insurance, parental leave, 401(k), professional development, paid holidays, summer Fridays, hybrid work, and more. Salary range: $150,000 - $170,000, dependent on experience and skills.

Additional Info

Senior level, full-time, in the engineering and IT industry, based in Marina del Rey, CA, or remote with preference for candidates in CA.

#J-18808-Ljbffr

Job Details

View jobs in our app

Senior Site Reliability Engineer

About Zefr

What You'll Do

Technology Stack at Zefr

Core Infrastructure & Cloud Platforms

CI/CD & Automation

Observability & Monitoring

Application & Data Ecosystem

Data Stores & Databases

Qualifications

Benefits & Compensation

Additional Info

Apply for this Job

Registration Required

Login to Apply

You are leaving our site

Registration Required

Email this job to a friend

Job: Senior Site Reliability Engineer

Job Alert Sign Up

Add To Job Alert

Job Alert Updated

Email Customer Care