Site Reliability Engineer (SRE)

november 19, 2024
Posted in Job Bolivia
november 19, 2024 Admin

Site Reliability Engineer (SRE)

We are seeking a talented and motivated Site Reliability Engineer (SRE) to join our team in Cochabamba, Bolivia. As an SRE, you will be responsible for ensuring the reliability, performance, and scalability of our cloud infrastructure and services. You will work closely with our development teams and cloud engineering team (based in Denmark) to implement robust monitoring, automation, and incident response strategies. Your role will be key in improving the uptime, performance, and operational efficiency of our platform.

Key Responsibilities:
  • Monitoring & Performance:
    • Design, implement, and optimize monitoring solutions to track system uptime, performance, and error rates using tools like GrafanaPrometheusLokiTempo, and Mimir.
    • Define and implement SLAs, SLOs, and SLIs for system performance, ensuring alignment with business requirements.
    • Create and maintain reports on key system metrics, providing visibility to internal teams and stakeholders.

  • Third-Party Services Management:
    • Manage and monitor third-party services critical to our platform, ensuring that service-level agreements are met and dependencies are properly tracked.
    • Collaborate with the development team to identify and mitigate risks associated with external services.

  • Automation & Incident Management:
    • Lead the automation of processes related to system deployments, scaling, and fault recovery.
    • Act as first-line support for development teams, troubleshooting and resolving operational issues efficiently.
    • Implement incident response protocols and help minimize downtime through proactive measures and automation.

  • Kubernetes & Infrastructure Management:
    • Oversee and optimize Kubernetes clusters, ensuring they are secure, scalable, and reliable.
    • Manage Cert-Manager and Ingress-Nginx configurations for secure handling of TLS certificates and ingress traffic.
    • Work to enhance the reliability of our infrastructure built on Google CloudPostgreSQLBigQuery, and Cloud Storage.

  • Collaboration & Reporting:
    • Work closely with the Cloud Engineer in Denmark to manage cloud resources and improve system reliability.
    • Build reports and dashboards to support business intelligence and give stakeholders visibility into operational health.

Key Skills & Experience:
  • Cloud & Infrastructure:
    • Strong experience with Google Cloud (or similar cloud platforms), including services like BigQueryPub/SubCloud Storage, and MemoryStore (Redis).
    • Hands-on experience managing Kubernetes clusters, with knowledge of tools such as ArgoCDCert-Manager, and Ingress-Nginx.

  • Monitoring & Observability Tools:
    • High-level experience with tools like GrafanaPrometheusLokiTempoMimir, and Promtail for monitoring, logging, and tracing.
    • Familiarity with ElasticSearch and Kibana for logging and searching is a plus.

  • Automation & CI/CD:
    • Strong skills in automating system management, deployments, and scaling using DockerArgoCD, and GitHub.
    • Experience with CI/CD pipelines to streamline the deployment of new code and infrastructure.

  • Development & Scripting:
    • Familiarity with TypeScriptNode.jsReact, and React Native for troubleshooting and collaborating with development teams.
    • Ability to write scripts for automation and monitoring configuration.

Our Stack:
  • Infrastructure: Google Cloud, PostgreSQL, MemoryStore (Redis), Pub/Sub, Kubernetes, BigQuery, ArgoCD, Docker, Cloudflare.
  • Monitoring & Logging: Grafana, Prometheus, Loki, Tempo, Promtail.
  • Tools: GitHub, Jira, Confluence.
  • Programming Languages: TypeScript, Node.js, React, React Native.

Your Team:
  • Danish Team: 4 Developers, 1 Cloud Engineer (collaboration with Cloud Engineer is key)
  • Bolivia Team: 4 Developers

Qualifications:
  • Proven experience in Site Reliability Engineering or similar roles, preferably within cloud-based environments.
  • Strong troubleshooting and problem-solving skills, with a focus on ensuring system availability and performance.
  • Excellent communication and collaboration skills, with experience working in distributed teams.
  • A proactive mindset and a desire to improve system reliability through automation and innovative solutions.

What We Offer:
  • Competitive salary and benefits package.
  • Opportunity to work in a collaborative, fast-paced environment with teams in Bolivia and Denmark.
  • Career growth opportunities and exposure to cutting-edge technologies in cloud infrastructure, automation, and observability.

If you’re passionate about building reliable, scalable, and efficient systems and enjoy working with cloud technologies, we’d love to hear from you. Apply now and join our team in Cochabamba!

 

We are looking forward to hear from you

If you have any questions about the position, please feel free to contact Guillermo.
WhatsApp: +591 76944464
email: gumvi@jfmedier.dk