Join to apply for the Senior Site Reliability Engineer role at Zefr
Get AI-powered advice on this job and more exclusive features.
About Zefr
Zefr is the leading global technology company enabling responsible marketing in walled garden social environments. Our solutions empower brands to manage their content adjacency on platforms like YouTube, Meta, TikTok, and Snap, in accordance with industry standards. Using patented AI technology, we provide more accurate and transparent solutions for social walled gardens. Headquartered in Los Angeles, California, with additional global locations.
What You'll Do
As a Site Reliability Engineer at Zefr, you will leverage your expertise in cloud infrastructure, CI/CD, Observability, and core SRE principles to deliver reliable, scalable solutions. You will collaborate closely with our Machine Learning team to ensure robust infrastructure for model training, deployment, and serving. We seek someone with technical prowess, leadership skills, and a passion for continuous innovation, who will help maintain the health and efficiency of our infrastructure supporting ML workloads.
- Support and develop systems and tools for rapid and safe deployment and management of features and models.
- Deploy and support multi-cloud, microservice architectures, including ML-specific infrastructure, using Github Actions, ArgoCD, and Kubernetes.
- Work with engineering teams to design secure, resilient, scalable, and cost-effective applications and ML pipelines in AWS and GCP.
- Promote DevOps culture and continuous improvement across teams.
- Maintain production environment health, monitor ML model performance and resource use.
- Participate in 24/7 on-call rotations, respond to outages and performance issues.
- Debug application and infrastructure code.
- Enhance CI/CD workflows and release processes.
- Research and propose innovative solutions.
- Review and propose changes to engineering architecture via RFCs.
Technology Stack at Zefr
Core Infrastructure & Cloud Platforms
- GCP, AWS
- Terraform
- Docker, Kubernetes (GKE/EKS), Helm, Kustomize
- Istio
CI/CD & Automation
- GitHub Actions
- Argo CD
- Python
Observability & Monitoring
- Prometheus, Datadog, Pagerduty
- OpenTelemetry
Application & Data Ecosystem
- Python, FastAPI, Flask, Node.js, React
- Apache Kafka, Pandas, DBT, Airflow, Ray
- ML Stack: Triton, Weights & Biases, DVC, Transformers, HuggingFace, Onnx, TensorRT
Data Stores & Databases
- PostgreSQL, DynamoDB, OpenSearch, Qdrant, Redis, Snowflake
Qualifications
- 6+ years managing cloud infrastructure in production, with AWS or GCP experience required.
- Experience deploying container workloads with Kubernetes.
- At least 1 year in ML infrastructure development and operations.
- Knowledge of GitOps, CI/CD pipelines, and IaC tools.
- Strong problem-solving skills focused on automation.
- Experience with monitoring and observability tools.
- Understanding of cloud networking concepts.
- Excellent communication and organizational skills.
Benefits & Compensation
For US-based employees, benefits include flexible PTO, health insurance, life insurance, parental leave, 401(k), professional development, paid holidays, summer Fridays, hybrid work, and more. Salary range: $150,000 - $170,000, dependent on experience and skills.
Additional Info
Senior level, full-time, in the engineering and IT industry, based in Marina del Rey, CA, or remote with preference for candidates in CA.
#J-18808-Ljbffr