maycon @ edge-infra :~$ whoami

Maycon Ritzmann

Staff Site Reliability Engineer

Systems fail. What matters is how fast you understand why and what you do about it. That's what I've spent 12 years learning.

~8B
daily requests
12+
years of experience
5+
years at iFood
Staff
SRE level
Brazil · UTC-3 · Working remotely

Who I am

I started as a computer technician fixing machines for small businesses in Joinville. No plan, just curiosity and a need to understand how things work. Twelve years later I'm operating edge infrastructure that carries Latin America's internet. And I still approach every system the same way: what breaks, why, and how do we make sure it doesn't happen again.

Staff Site Reliability / Production Engineer with 12+ years of experience operating and evolving mission-critical systems in high-scale cloud environments.

Strong background in incident leadership, reliability engineering, and cross-team collaboration to ensure availability, performance, and resilience under pressure. Experienced in designing and operating cloud-native platforms, leading modernization and migration initiatives, and improving system observability and operational maturity.

Hands-on expertise with AWS, Kubernetes, Terraform, CI/CD, and observability stacks – always with a focus on reducing MTTR, preventing incidents, and enabling engineering teams to move faster and safer. Beyond the technical work, I enjoy mentoring engineers and helping teams raise the operational bar – reliability depends as much on communication and ownership as it does on tooling.

Currently working as a Staff SRE at iFood – Latin America's largest food delivery platform – designing and operating edge and API gateway infrastructure handling ~8 billion daily requests.

Focus Areas

  • Edge & Traffic Infrastructure
  • API Gateway Platforms
  • Kubernetes & Cloud (AWS)
  • Incident Command & SRE
  • Observability & Reliability
  • Platform Automation (Python)

Languages

Portuguese Native
English Full Professional
French Elementary

Education

DevOps Engineering (CST)

Anhanguera · 2022–2024

Computer Technician

UNISOCIESC · 2012–2013

Career timeline

iFood

São Paulo, Brazil · Remote

4 years 7 months

October 2021 – Present

This is where scale stopped being abstract. Every decision carries weight when a bug means millions of people can't order their next meal.

Staff Site Reliability Engineer

current

June 2025 – Present

11 months

Member of the Layer 7 SRE team, focused on designing, operating, and evolving large-scale traffic and edge infrastructure supporting critical business flows.

  • Technical owner of iFood's API Gateway platform, leading scalability, reliability, and performance initiatives for systems handling 2.2+ billion requests per month.
  • Architect and owner of the Edge traffic platform, responsible for ~8 billion daily requests, ensuring low-latency, high-availability traffic management at massive scale.
  • Led performance, reliability, and cost optimization initiatives, reducing latency and error rates while implementing blast radius reduction strategies and disaster recovery testing.
  • Drove edge security and acceleration improvements through deep integration with Akamai and Cloudflare, enhancing HTTP protocol handling, resilience, and content delivery performance.
  • Designed and optimized advanced Kubernetes ingress architectures, including custom ingress classes for HTTP and gRPC traffic, tuned for long-lived connections and application-specific workloads.
  • Led large-scale network troubleshooting and optimization efforts, addressing traffic shaping, high-throughput scenarios, and performance bottlenecks through Linux kernel and ENI-level tuning.
  • Act as incident lead and escalation point during high-severity production events, ensuring fast recovery and system stability across distributed teams.
  • Participate in 24/7 on-call rotation as part of shared ownership of production reliability.

Staff Production Support Engineer

April 2022 – June 2025

3 years 2 months

Staff-level engineer responsible for reliability, scalability, and operational excellence of mission-critical production systems supporting millions of users.

  • Owned incident response and operational reliability for high-traffic, business-critical systems, acting as escalation point and technical lead during high-severity events.
  • Drove operational maturity through automation and platform improvements, reducing manual intervention and improving recovery times.
  • Led root cause analysis and post-incident reviews, translating operational failures into concrete engineering improvements.
  • Designed and implemented automation solutions in Python, including an internal chatbot leveraging LlamaIndex to streamline operational workflows and reduce support friction.
  • Architected and operated Kubernetes-based platforms and AWS infrastructure, ensuring scalability, resilience, and predictable performance.
  • Led infrastructure automation initiatives using Terraform, Packer, and Ansible to standardize environments and improve deployment safety.
  • Evolved CI/CD pipelines (GitLab) to enable faster and more reliable releases with reduced operational risk.
  • Owned observability strategy and improvements, integrating Datadog, Prometheus, Grafana, and logging systems to improve detection and diagnosis of production issues.
  • Optimized performance and reliability of critical components, including Redis/ElastiCache, Kong API Gateway, Akamai, and NGINX.
  • Mentored and coached junior engineers, raising the overall operational and technical bar of the team.

Senior Production Support Engineer

October 2021 – April 2022

7 months

  • Operated and evolved Kubernetes-based platforms and AWS infrastructure supporting millions of daily active users.
  • Contributed to reliability, observability, and incident response practices across business-critical services.
  • Optimized performance and reliability of critical components including Redis/ElastiCache, Kong API Gateway, Akamai, and NGINX.

Claranet

São Paulo, Brazil · Remote

9 months

February 2021 – October 2021

My entry point into iFood's world, operating at a scale I'd never seen before and learning fast that production is unforgiving.

Senior System Engineer

February 2021 – October 2021

9 months

Worked as a dedicated Production Support Engineer fully allocated to iFood, supporting high-traffic production environments and critical platform services.

  • Operated Kubernetes clusters (Kops and EC2-based) and AWS infrastructure supporting iFood's high-traffic production environment, ensuring availability and operational stability.
  • Contributed to reliability, observability, and incident response practices across critical platform services using Prometheus, Grafana, and Zabbix.
  • Built and maintained CI/CD pipelines with Jenkins and Terraform-based IaC, improving deployment consistency and reducing configuration drift.
  • Supported distributed workloads running on AWS (EC2, SQS, SNS, Route 53, VPC), troubleshooting production incidents and improving system reliability.

TOTVS

Joinville, Santa Catarina, Brazil · Remote

6 months

September 2020 – February 2021

Where I first learned to think in terms of reliability, not just uptime. There is a difference.

Site Reliability Engineer

September 2020 – February 2021

6 months

  • Operated and improved Kubernetes-based platforms on AWS, ensuring high availability, performance, and predictable deployments for critical applications.
  • Contributed to cloud infrastructure design and evolution, working with AWS services such as EKS, ECR, VPC, Route 53, CloudFront, and CloudWatch.
  • Automated CI/CD pipelines using GitLab, Jenkins, and Bamboo, reducing deployment friction and improving release reliability.
  • Implemented Infrastructure as Code practices with Terraform to standardize environments and reduce configuration drift.
  • Improved monitoring, alerting, and observability, integrating Prometheus, Alertmanager, Grafana, and Zabbix to enhance incident detection and response.
  • Supported and automated Linux-based infrastructure, leveraging Ansible and Chef to improve operational consistency.
  • Collaborated closely with engineering teams to translate reliability requirements into practical platform improvements.

SoftExpert

Joinville, Brazil · Hybrid

6 months

April 2020 – September 2020

A turning point: migrating legacy infrastructure to Kubernetes and seeing firsthand what modern platforms can do.

DevOps Engineer

April 2020 – September 2020

6 months

Worked on the evolution of infrastructure supporting enterprise SaaS applications, with focus on reliability, scalability, and cloud adoption.

  • Led migration of on-premises workloads to Kubernetes on AWS (EKS), transitioning from legacy infrastructure to a more scalable and resilient architecture.
  • Defined and implemented Terraform standards for infrastructure as code, establishing reusable and consistent cloud provisioning practices.
  • Automated Linux-based infrastructure and CI/CD pipelines, improving operational consistency and deployment reliability.
  • Built monitoring and observability using Prometheus, Grafana, Alertmanager, and Loki, improving detection and response to production issues.

Eiti Solutions

Joinville, Brazil

1 year 6 months

October 2018 – March 2020

Where the foundations were built. Kubernetes before it was mainstream, and a deep respect for how critical systems behave under pressure.

Infrastructure Analyst

October 2018 – March 2020

1 year 6 months

  • Operated and maintained business-critical infrastructure – networks, servers, and core services – ensuring continuous availability and operational stability.
  • Managed on-premises and hybrid Kubernetes clusters, building foundational container orchestration skills in high-criticality production environments.
  • Automated CI/CD workflows with Jenkins, Ansible, and Puppet, reducing manual operational effort and improving deployment consistency.
  • Built monitoring and observability with Zabbix, Grafana, and Prometheus, improving visibility into system health and accelerating issue detection.
  • Administered core network services including DNS, DHCP, and email platforms (Zimbra, Postfix), ensuring reliable communications infrastructure.
  • Operated Docker Swarm and Foreman environments, supporting application lifecycle management across containerized workloads.
  • Contributed to AWS and IBM Cloud adoption initiatives, supporting the team's transition to hybrid and cloud-native architectures.

Open Technology

Joinville, Brazil

4 years 9 months

January 2014 – September 2018

Twelve years ago, this is where it started. Fixing computers, managing networks, learning that every system tells you what it needs if you listen.

Network Administrator

April 2017 – September 2018

1 year 6 months

  • Operated AWS-based and on-premises infrastructure supporting Linux and Windows workloads across applications, databases, web, and email services.
  • Implemented Puppet-based IaC to standardize system configurations, reducing manual operations and configuration drift.
  • Designed and maintained production email platforms (Postfix, Dovecot, Roundcube, anti-spam), ensuring reliable communications for business workloads.
  • Automated repetitive operational tasks with shell scripting, improving team efficiency and reducing toil.
  • Deployed and maintained Zabbix monitoring and alerting, ensuring system availability and early detection of issues.
  • Administered Linux and Windows Server environments (2003–2016, Debian, Ubuntu, CentOS), managing OS lifecycle and system health.
  • Managed core network services – DNS, DHCP, domain controllers, file sharing, print, and remote access.
  • Designed firewall and network security configurations using iptables, proxies, and traffic control.
  • Defined and validated backup and disaster recovery routines, ensuring data protection and operational continuity.

Computer Technician

January 2014 – April 2017

3 years 4 months

  • Delivered hands-on technical support for servers, desktops, and notebooks in business environments, ensuring continuity of client operations.
  • Assembled, configured, and maintained hardware systems to operational and client specifications, developing strong foundations in system internals.
  • Troubleshot networking and OS issues across Linux and Windows environments, building early expertise in TCP/IP and system operations.
  • Advised clients on technical requirements and equipment selection, developing customer-facing communication and problem-solving skills.

Technical toolkit

Traffic & Edge

Nginx Kong API Gateway Akamai Cloudflare Envoy gRPC HTTP/2 NGINX Ingress

Cloud – AWS

EKS EC2 VPC Route 53 CloudFront ElastiCache ECR CloudWatch RDS EBS SQS S3

Infrastructure & IaC

Kubernetes Terraform Ansible Packer Puppet Docker Linux Helm

Observability

Datadog Prometheus Grafana Alertmanager Zabbix OpenTelemetry

CI/CD & Automation

GitLab CI Jenkins Bamboo Python Go Node.js Bash / Shell LlamaIndex

Networking

TCP/IP iptables Service Mesh Content Delivery Networks DNS

SRE Practices

Incident Command Production Engineering Post-mortem Culture Capacity Planning Blast Radius Reduction Disaster Recovery Distributed Systems

Certifications

Grouped by domain – same stack as the toolkit above.

Kubernetes & cloud

  • CKA

    Certified Kubernetes Administrator

    The Linux Foundation

  • AWS

    Architecting on AWS

    Amazon Web Services

  • AWS

    Cloud Practitioner Essentials

    Amazon Web Services

Infrastructure & platforms

  • TF

    HashiCorp Certified: Terraform Associate

    HashiCorp

  • LPI

    LPIC-1

    Linux Professional Institute

Observability

  • DD

    Datadog Certified: Log Management Fundamentals

    Datadog

Things I'm building

Side projects and public initiatives – not day-job work, but where I invest in the broader SRE and platform community.

  • SRE Labs

    Live

    Hands-on learning for SRE, DevOps, cloud, and Kubernetes – challenges, community, and portfolio-ready work.

    I built SRE Labs because I couldn't find a place that taught reliability the way I learned it: by doing.

    About

    SRE Labs is my public education project: a place to grow a serious, practical learning ecosystem (not just courses) for people entering reliability and platform roles. The focus is real exercises, peer community, and eventually cohorts and self-service labs – so newcomers can go from fundamentals to job-ready with evidence they can ship in production.

    Community Education Kubernetes DevOps
  • Practical SRE Mentorship

    Live

    A guided 30-day production-style challenge with four 1:1 live sessions – for developers and SREs who want real practice, not only courses and certs.

    One-to-one because that's how I grew. Someone showing me the gap between what I thought I knew and what production actually requires.

    About

    One-to-one mentorship built around incidents, observability, and reliability trade-offs you can defend in interviews. You get a tailored challenge, a GitHub playbook, help shaping your own study lab, and live videocalls so the work becomes a portfolio case aligned with what hiring teams actually ask in production roles.

    Mentorship SRE Education Career

Let's connect

I'm always open to interesting conversations about infrastructure, reliability engineering, and challenging problems at scale.

Whether it's a technical discussion or just to say hello – feel free to reach out.

$ curl -X POST

--data '"opportunity"'

mayconritzmann@gmail.com

200 OK · will reply within 24h

or reach me at +55 47 99997-4698