-
Monitoring & Performance:
-
Design, implement, and optimize monitoring solutions to track system uptime, performance, and error rates using tools like Grafana, Prometheus, Loki, Tempo, and Mimir.
-
Define and implement SLAs, SLOs, and SLIs for system performance, ensuring alignment with business requirements.
-
Create and maintain reports on key system metrics, providing visibility to internal teams and stakeholders.
-
-
Third-Party Services Management:
-
Manage and monitor third-party services critical to our platform, ensuring that service-level agreements are met and dependencies are properly tracked.
-
Collaborate with the development team to identify and mitigate risks associated with external services.
-
-
Automation & Incident Management:
-
Lead the automation of processes related to system deployments, scaling, and fault recovery.
-
Act as first-line support for development teams, troubleshooting and resolving operational issues efficiently.
-
Implement incident response protocols and help minimize downtime through proactive measures and automation.
-
-
Kubernetes & Infrastructure Management:
-
Oversee and optimize Kubernetes clusters, ensuring they are secure, scalable, and reliable.
-
Manage Cert-Manager and Ingress-Nginx configurations for secure handling of TLS certificates and ingress traffic.
-
Work to enhance the reliability of our infrastructure built on Google Cloud, PostgreSQL, BigQuery, and Cloud Storage.
-
-
Collaboration & Reporting:
-
Work closely with the Cloud Engineer in Denmark to manage cloud resources and improve system reliability.
-
Build reports and dashboards to support business intelligence and give stakeholders visibility into operational health.
-
-
Cloud & Infrastructure:
-
Strong experience with Google Cloud (or similar cloud platforms), including services like BigQuery, Pub/Sub, Cloud Storage, and MemoryStore (Redis).
-
Hands-on experience managing Kubernetes clusters, with knowledge of tools such as ArgoCD, Cert-Manager, and Ingress-Nginx.
-
-
Monitoring & Observability Tools:
-
High-level experience with tools like Grafana, Prometheus, Loki, Tempo, Mimir, and Promtail for monitoring, logging, and tracing.
-
Familiarity with ElasticSearch and Kibana for logging and searching is a plus.
-
-
Automation & CI/CD:
-
Strong skills in automating system management, deployments, and scaling using Docker, ArgoCD, and GitHub.
-
Experience with CI/CD pipelines to streamline the deployment of new code and infrastructure.
-
-
Development & Scripting:
-
Familiarity with TypeScript, Node.js, React, and React Native for troubleshooting and collaborating with development teams.
-
Ability to write scripts for automation and monitoring configuration.
-
-
Infrastructure: Google Cloud, PostgreSQL, MemoryStore (Redis), Pub/Sub, Kubernetes, BigQuery, ArgoCD, Docker, Cloudflare.
-
Monitoring & Logging: Grafana, Prometheus, Loki, Tempo, Promtail.
-
Tools: GitHub, Jira, Confluence.
-
Programming Languages: TypeScript, Node.js, React, React Native.
-
Danish Team: 4 Developers, 1 Cloud Engineer (collaboration with Cloud Engineer is key)
-
Bolivia Team: 4 Developers
-
Proven experience in Site Reliability Engineering or similar roles, preferably within cloud-based environments.
-
Strong troubleshooting and problem-solving skills, with a focus on ensuring system availability and performance.
-
Excellent communication and collaboration skills, with experience working in distributed teams.
-
A proactive mindset and a desire to improve system reliability through automation and innovative solutions.
-
Competitive salary and benefits package.
-
Opportunity to work in a collaborative, fast-paced environment with teams in Bolivia and Denmark.
-
Career growth opportunities and exposure to cutting-edge technologies in cloud infrastructure, automation, and observability.
We are looking forward to hear from you
If you have any questions about the position, please feel free to contact Guillermo.
WhatsApp: +591 76944464
email: gumvi@jfmedier.dk