Site Reliability Engineer - Resilience

  • Lisboa
  • Celfocus
Celfocus is a European high-tech system integrator, providing professional services focused on creating business value through Analytics and Cognitive solutions – addressing Telecommunications, Energy & Utilities, Financial Services and other markets' strategic opportunities. Serving Clients in 25+ countries, Celfocus delivers solutions such as accelerating digital network transformation in Autonomous Networks, elevating and monetising business services in B2B2x ecosystems, and providing highly relevant customer experiences through Hyper-personalisation solutions. Make an impact by working for sectors where technology is the enabler, everything is ground-breaking and there’s a constant need to be innovative. Be part of the team that combines business knowledge, technological edge and a design experience. Our different backgrounds and know-how are key in developing solutions and experiences for digital clients. Face challenges and learn other ways of thinking and seeing the world - there’s always room for your energy and creativity.   About the role   An SRE focused on Resilience. Someone who can look at a complex system of services, products, applications, and contents that work together for a full E2E customer experience in a telco company and identify areas for improvement to make it more solid, stable, reliable. Closely related to the Googles definition of an SRE however here almost exclusively focused on resilience itself. This can be before, during or after code has been written for that product.   As a part of your job, you will: Define/create/implement standards and drive implementation of resilient design Understand what happens if a downstream service fails. How is our upstream response handled? What is the customer experience (impact)? Define/create/implement fallback mechanisms/circuit breakers, understand if its appropriate to create one at all. Define/create logic for aforementioned circuit breakers (experience shows todays implementations may have a negative impact) How do we tackle E2E resilience on a customer journey? Define/create/implement timeouts settings E2E (these have caused negative outcomes in the past) Participate in complex operational issues E2E, identifying root causes and architectural solutions (or other improvements) to avoid re-occurrence Work closely with architecture team and Tech Leads in early stages of SDLC  What are we looking for? An environment where services can be built in mobile, web, integration or backend technologies, Google Cloud based and Apigee exposure. Some of the technologies involved are: Angular Strapi CMS Squid Proxy PingFederate Kotlin and Swift Apigee GCP Availability to travel is important, the project requires trips to the UK (once every two months). Personal traits: Ability to adapt to different contexts, teams and Clients Teamwork skills but also sense of autonomy Motivation for international projects and ok if travel is included Willingness to collaborate with other players Strong communication skills We want people who like to roll up their sleeves and open their minds. Believe this is you? Come join the Team!