Senior Site Reliability Engineer at Sana Commerce Latinoamrica
Cape Town, Western Cape, South Africa -
Full Time


Start Date

Immediate

Expiry Date

26 Jan, 26

Salary

0.0

Posted On

28 Oct, 25

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

No

Skills

Site Reliability Engineering, DevOps, Cloud Infrastructure, Microsoft Azure, Kubernetes, Dynatrace, Honeycomb, ElasticSearch, Kibana, Grafana, Azure Monitor, PowerShell, Bash, Python, Terraform, Infrastructure-as-Code

Industry

Software Development

Description
Company Description What started in 2007 with a pizza and a plan has grown into a fast-moving SaaS company that helps manufacturers, distributors, and wholesalers thrive in B2B commerce complexity. Our mission? To transform the way businesses buy and sell, so they can grow, build stronger relationships, and make the most of digital commerce. Join us and take ownership of your career in a dynamic, fast-moving environment. At Sana Commerce, we're looking for a Senior Site Reliability Engineer (SRE) to strengthen our reliability, observability, and automation capabilities across our Azure and Kubernetes-based platforms. This role blends hands-on operational excellence with engineering practices, ensuring uptime today while building the systems that make tomorrow more resilient. This SRE-position focuses on engineering reliability in everything we do: automating repetitive tasks, improving monitoring signals, running deep root cause analysis, and shaping systems for scalability. You’ll be the engineer others look to during critical incidents, and the one raising the bar on how we prevent them in the first place. What you'll get: The opportunity to make an impact at a fast-growing SaaS scale-up; A global and customized onboarding program (9,1/10 rated by previous hires); A hybrid working model – 3 days from the office, 2 days from home. Job Description What you'll be doing Lead incident response and root cause analysis by driving deep investigations, educating the team, and delivering actionable post-incident insights that prevent recurrence. Manage Kubernetes and Azure environments by owning cluster configurations, platform usage, and ensuring availability, cost efficiency, and security best practices. Develop observability and monitoring strategies with Dynatrace, Honeycomb, ElasticSearch, Kibana/Grafana, and Azure Monitor to measure performance, user impact, and continuously refine alerts and dashboards. Implement and maintain edge and CDN integrations (Fastly WAF, bot management, CDN) to enhance performance, security, and reliability of customer-facing services. Write and debug automation scripts in PowerShell, Bash, Python, or C#, ensuring logging, rollback, and versioning practices make the platform more resilient and self-healing. Drive Infrastructure-as-Code adoption with Terraform, Bicep, and ARM to standardize environments, automate deployments, and reduce manual interventions. Optimize system and application performance through deep monitoring, dump analysis, and right-sizing of resources to eliminate bottlenecks and maximize efficiency. Collaborate across teams to break down complex problems, contribute to CI/CD and SDLC improvements, and embed reliability into development and release pipelines. Participate in the on-call rotation by taking ownership of incidents, coordinating responses, and ensuring sustainable fixes rather than temporary workarounds. Qualifications What you bring 5+ years of experience in SRE, DevOps, or Cloud Infrastructure, with demonstrated ownership of large-scale systems. Strong hands-on knowledge of Microsoft Azure services and practical experience operating Azure Kubernetes clusters in production. Expertise in Dynatrace, Honeycomb, ElasticSearch, Kibana/Grafana, Azure Monitor (KQL). Able to design actionable monitoring that leads to prevention, not just detection. Proficient in at least one programming/scripting language (PowerShell, Bash, Python, or C#). Strong debugging and logging practices. Hands-on experience with Infrastructure-as-Code (Terraform, Bicep, or ARM) to automate and manage cloud infrastructure. Solid understanding of TCP/IP protocols and troubleshooting network issues in distributed systems. Ability to go beyond surface fixes, identify patterns, and engineer permanent improvements. Strong communicator who can work with cross-functional teams and explain complex issues simply. Microsoft Certified: Azure Administrator Associate CKA: Certified Kubernetes Administrator Who we are: So, what does it mean to be a part of the Sana Commerce team? At Sana Commerce, our values guide how we work, collaborate, and drive success. Champions of Our League. "We deliver lasting success, balancing quick wins and long-term value." We take pride in our unique product and extensive B2B knowledge and continuously strive to improve. No matter our role, we bring value every day, helping our customers and partners succeed. Supercharge Our Customers. "We’re revolutionizing B2B commerce together, helping our customers to lead and succeed." Our customers are at the heart of everything we do. We go beyond solutions, providing the tools and support they need to grow. Determined to Grow. "We embrace challenges, growing and raising the bar for ourselves and our industry." We take on challenges, seek feedback, and keep learning. Every setback is a chance to improve and move forward. Bold Together. "We dare to be bold because we have each other’s back." We collaborate across teams and time zones, challenge the status quo, and support each other to achieve the best outcomes. At Sana Commerce, we’re committed to creating an inclusive environment because we know our diverse workforce is one of our greatest strengths. Apply now! Additional Information #LI-Hybrid
Responsibilities
Lead incident response and root cause analysis while managing Kubernetes and Azure environments. Develop observability strategies and implement automation to enhance system reliability and performance.
Loading...