maycon @ edge-infra :~$ whoami

Maycon Ritzmann

Staff Site Reliability Engineer

Designing and operating edge infrastructure at Latin American scale – Kubernetes, API Gateways, AWS, and traffic engineering for systems that can't afford to fail.

~8B
daily requests
12+
years of experience
5+
years at iFood
Staff
SRE level
Brazil · UTC-3 · Working remotely

Who I am

Staff Site Reliability / Production Engineer with 12+ years of experience operating and evolving mission-critical systems in high-scale cloud environments.

Strong background in incident leadership, reliability engineering, and cross-team collaboration to ensure availability, performance, and resilience under pressure. Experienced in designing and operating cloud-native platforms, leading modernization and migration initiatives, and improving system observability and operational maturity.

Hands-on expertise with AWS, Kubernetes, Terraform, CI/CD, and observability stacks – always with a focus on reducing MTTR, preventing incidents, and enabling engineering teams to move faster and safer.

Currently working as a Staff SRE at iFood – Latin America's largest food delivery platform – designing and operating edge and API gateway infrastructure handling ~8 billion daily requests.

Focus Areas

  • Edge & Traffic Infrastructure
  • API Gateway Platforms
  • Kubernetes & Cloud (AWS)
  • Incident Command & SRE
  • Observability & Reliability
  • Platform Automation (Python)

Languages

Portuguese Native
English Full Professional
French Elementary

Education

DevOps Engineering (CST)

Anhanguera · 2022–2024

Computer Technician

UNISOCIESC · 2012–2013

Career timeline

iFood

São Paulo, Brazil

5 years 3 months

February 2021 – Present

Staff Site Reliability Engineer

current

June 2025 – Present

11 months

  • Architected and own iFood's API Gateway platform end-to-end, driving scalability and reliability for 2.2+ billion requests per month.
  • Designed and operate the Edge traffic platform, sustaining ~8 billion daily requests with low-latency, high-availability delivery at global scale.
  • Led performance and cost optimization across edge systems, cutting latency and error rates through blast radius reduction strategies and structured disaster recovery testing.
  • Integrated Akamai and Cloudflare at depth to drive edge security and acceleration improvements, strengthening HTTP protocol resilience and CDN performance.
  • Designed custom Kubernetes ingress classes for HTTP and gRPC workloads, enabling stable long-lived connections and optimized per-application traffic handling.
  • Resolved complex network bottlenecks through Linux kernel and ENI-level tuning, improving throughput and stability under high-traffic scenarios.
  • Command incident response as technical lead for high-severity events, driving fast recovery and restoring system stability across distributed teams.

Staff Production Support Engineer

June 2023 – June 2025

2 years 1 month

  • Led incident response and recovery for high-traffic, business-critical systems, acting as escalation point and technical lead to minimize downtime during high-severity events.
  • Automated operational workflows and platform improvements, reducing manual intervention and accelerating recovery times across production systems.
  • Conducted root cause analyses and post-incident reviews, converting operational failures into engineering improvements that prevented recurrence.
  • Built an AI-powered internal chatbot in Python using LlamaIndex, streamlining operational workflows and reducing support friction for the team.
  • Architected and operated Kubernetes platforms on AWS, ensuring scalable, resilient, and predictable performance for production workloads.
  • Automated infrastructure provisioning with Terraform, Packer, and Ansible, standardizing environments and improving deployment safety at scale.
  • Evolved GitLab CI/CD pipelines to enable faster, more reliable releases, reducing deployment risk across critical services.
  • Owned observability strategy across production, integrating Datadog, Prometheus, and Grafana to accelerate detection and diagnosis of incidents.
  • Optimized Redis/ElastiCache, Kong API Gateway, Akamai, and NGINX, improving performance and reliability of key production components.
  • Mentored junior engineers on SRE practices and operational standards, raising the technical bar and improving team readiness for high-severity incidents.

Senior Production Support Engineer

February 2021 – June 2023

2 years 5 months

  • Operated and evolved Kubernetes-based platforms and AWS infrastructure supporting millions of daily active users.
  • Contributed to reliability, observability, and incident response practices across business-critical services.
  • Optimized performance and reliability of critical components including Redis/ElastiCache, Kong API Gateway, Akamai, and NGINX.

TOTVS

Joinville, Santa Catarina, Brazil

6 months

September 2020 – February 2021

Site Reliability Engineer

September 2020 – February 2021

6 months

  • Operated and evolved Kubernetes platforms on AWS (EKS, ECR, VPC, Route 53, CloudFront, CloudWatch), ensuring high availability and predictable deployments for critical applications.
  • Automated CI/CD pipelines with GitLab, Jenkins, and Bamboo, reducing deployment friction and improving release reliability across teams.
  • Implemented Terraform-based IaC across the platform, standardizing environments and eliminating configuration drift.
  • Built monitoring and alerting with Prometheus, Alertmanager, Grafana, and Zabbix, improving incident detection and response time.
  • Automated Linux-based infrastructure management with Ansible and Chef, improving operational consistency across environments.
  • Partnered with engineering teams to translate reliability requirements into practical platform improvements.

Eiti Solutions

Joinville, Brazil

1 year 8 months

October 2018 – May 2020

Infrastructure Analyst

October 2018 – May 2020

1 year 8 months

  • Operated and maintained business-critical infrastructure – networks, servers, and core services – ensuring continuous availability and operational stability.
  • Managed on-premises and hybrid Kubernetes clusters, building foundational container orchestration skills in high-criticality production environments.
  • Automated CI/CD workflows with Jenkins, Ansible, and Puppet, reducing manual operational effort and improving deployment consistency.
  • Built monitoring and observability with Zabbix, Grafana, and Prometheus, improving visibility into system health and accelerating issue detection.
  • Administered core network services including DNS, DHCP, and email platforms (Zimbra, Postfix), ensuring reliable communications infrastructure.
  • Operated Docker Swarm and Foreman environments, supporting application lifecycle management across containerized workloads.
  • Contributed to AWS and IBM Cloud adoption initiatives, supporting the team's transition to hybrid and cloud-native architectures.

Open Technology

Joinville, Brazil

4 years 9 months

January 2014 – September 2018

Network Administrator

April 2017 – September 2018

1 year 6 months

  • Operated AWS-based and on-premises infrastructure supporting Linux and Windows workloads across applications, databases, web, and email services.
  • Implemented Puppet-based IaC to standardize system configurations, reducing manual operations and configuration drift.
  • Designed and maintained production email platforms (Postfix, Dovecot, Roundcube, anti-spam), ensuring reliable communications for business workloads.
  • Automated repetitive operational tasks with shell scripting, improving team efficiency and reducing toil.
  • Deployed and maintained Zabbix monitoring and alerting, ensuring system availability and early detection of issues.
  • Administered Linux and Windows Server environments (2003–2016, Debian, Ubuntu, CentOS), managing OS lifecycle and system health.
  • Managed core network services – DNS, DHCP, domain controllers, file sharing, print, and remote access.
  • Designed firewall and network security configurations using iptables, proxies, and traffic control.
  • Defined and validated backup and disaster recovery routines, ensuring data protection and operational continuity.

Computer Technician

January 2014 – April 2017

3 years 4 months

  • Delivered hands-on technical support for servers, desktops, and notebooks in business environments, ensuring continuity of client operations.
  • Assembled, configured, and maintained hardware systems to operational and client specifications, developing strong foundations in system internals.
  • Troubleshot networking and OS issues across Linux and Windows environments, building early expertise in TCP/IP and system operations.
  • Advised clients on technical requirements and equipment selection, developing customer-facing communication and problem-solving skills.

Technical toolkit

Traffic & Edge

Nginx Kong API Gateway Akamai Cloudflare Envoy gRPC HTTP/2 NGINX Ingress

Cloud – AWS

EKS EC2 VPC Route 53 CloudFront ElastiCache ECR CloudWatch RDS EBS SQS S3

Infrastructure & IaC

Kubernetes Terraform Ansible Packer Puppet Docker Linux Helm

Observability

Datadog Prometheus Grafana Alertmanager Zabbix OpenTelemetry

CI/CD & Automation

GitLab CI Jenkins Bamboo Python Node.js Bash / Shell LlamaIndex

Networking

TCP/IP iptables Service Mesh Content Delivery Networks DNS

SRE Practices

Incident Command Production Engineering Post-mortem Culture Capacity Planning Blast Radius Reduction Disaster Recovery Distributed Systems

Certifications

Grouped by domain – same stack as the toolkit above.

Kubernetes & cloud

  • CKA

    Certified Kubernetes Administrator

    The Linux Foundation

  • AWS

    Architecting on AWS

    Amazon Web Services

  • AWS

    Cloud Practitioner Essentials

    Amazon Web Services

Infrastructure & platforms

  • TF

    HashiCorp Certified: Terraform Associate

    HashiCorp

  • LPI

    LPIC-1

    Linux Professional Institute

Observability

  • DD

    Datadog Certified: Log Management Fundamentals

    Datadog

Things I'm building

Side projects and public initiatives – not day-job work, but where I invest in the broader SRE and platform community.

  • SRE Labs

    Live

    Hands-on learning for SRE, DevOps, cloud, and Kubernetes – challenges, community, and portfolio-ready work.

    About

    SRE Labs is my public education project: a place to grow a serious, practical learning ecosystem (not just courses) for people entering reliability and platform roles. The focus is real exercises, peer community, and eventually cohorts and self-service labs – so newcomers can go from fundamentals to job-ready with evidence they can ship in production.

    Community Education Kubernetes DevOps
  • Practical SRE Mentorship

    Live

    A guided 30-day production-style challenge with four 1:1 live sessions – for developers and SREs who want real practice, not only courses and certs.

    About

    One-to-one mentorship built around incidents, observability, and reliability trade-offs you can defend in interviews. You get a tailored challenge, a GitHub playbook, help shaping your own study lab, and live videocalls so the work becomes a portfolio case aligned with what hiring teams actually ask in production roles.

    Mentorship SRE Education Career

Let's connect

I'm always open to interesting conversations about infrastructure, reliability engineering, and challenging problems at scale.

Whether it's a technical discussion or just to say hello – feel free to reach out.

$ curl -X POST

--data '"opportunity"'

mayconritzmann@gmail.com

200 OK · will reply within 24h

or reach me at +55 47 99997-4698