SENIOR SITE RELIABILITY ENGINEER
Reports to: Product Care Manager
Location: Canada
Role Type: Full time (Permanent or Contract)
Level: Individual Contributor
Opportunity Details
PeerTech Consulting Inc. is actively seeking a Senior Cloud Infrastructure Engineer for a permanent position.
The Role
As we expand the capability across our Product Care offering, we are looking for a Sr. Site Reliability Engineer (SRE) to help us build our capability and deliver insights from massive scale data in real time. The Sr. SRE role is responsible for developing automated solutions for operational aspects such as on-call monitoring, performance and capacity planning, and disaster response. The role will complement our ongoing development teams, looking at continuous delivery and infrastructure automation.
As the bridge between development and operations, you will be our primary escalation point across key customer accounts.
Key Responsibilities:
- Contribute to the design, implementation, and maintenance of our AWS infrastructure
- Be proactive in anticipating production issues. Assess risks and mitigate against these, planning for contingencies and counter-measures in advance
- Ensuring reliability to get systems back to a steady state by quickly investigating and fixing performance, stability and scalability issues, ensuring Kablamo is able to meet SLA and SLO requirements
- Responsible for ensuring that the underlying infrastructure is running smoothly and that systems and tools are working as expected. You will be assessing risks and mitigating against these or planning appropriate contingencies and counter-measures in advance
- Develop or implement visual tools for technical and business teams to observe system health and supporting the Technical Account Manager in reporting on reliability metrics
- Use automation tools to solve problems, writing and developing code to automate processes, such as analysing logs and testing production environments
- Working with the engineering and/or development team to identify recurring problems which can be resolved through automation
- Responsible for enhancing performance, efficiency and monitoring of software development processes
- Act on system incidents; as the SRE you are a key contact involved in incident response and resolutions including active collaboration in any PIRs/Post-mortems
- Collaborate closely with product developers to ensure that the designed solution responds to non-functional requirements such as availability, performance, security, and maintainability. Actively collaborating with the development team to define fields for logging and tracing.
- Being a voice to advocate for reliability against competing priorities
- Helping prepare activities for production release, including facilitating training and enablement of client technical teams and/or attending appropriate meetings (Technical Working Groups, Architecture Review Boards, Change Advisory Boards)
Required Skills and Experience:
- 5+ years’ experience in an SRE or DevOps role
- Deep understanding of system architecture and design principles
- Ability to think critically and problem solve, providing good performance under pressure
- Troubleshooting experience with the ability to clearly communicate to customers or the engineering team
- A proactive approach to spotting problems, areas for improvement, and performance bottlenecks
- Experience with AWS and its services (Serverless, Deployment Tools, Networking, Containerization, Security, Cost Management)
- Familiarity with tools such as AWS CloudWatch, Datadog, Grafana, Prometheus, Scalyr, PagerDuty, OpsGenie, Jira Service Management
- Ability to work cross functionally with support engineering, development teams and/or client vendors to deliver sound outcomes and suggest system improvements
- Understanding of security requirements and implications and can conform to applicable security frameworks
- An in-depth knowledge of version control
- CI/CD implementation expertise
- Experience with production rollback
- Knowledge of fundamental network concepts and protocols
- The ability to program with one or more high level languages, such as Python, Go, Java, C/C++ and JavaScript
- A good understanding of DevOps concepts and best practices including Infrastructure-as-Code
Bonus Points for:
- Bachelor’s degree in computer science or other similar technical qualification
- AWS Associate and/or Professional Level Certifications
- Strong grasp of networking, security, and reliability fundamentals
- Solid understanding of Agile methodologies and practices
Career Progression:
- Lead SRE
- Principal/Staff SRE
About PeerTech Consulting Inc.
Thank you for taking the time to apply! PeerTech Consulting Inc. is a dynamic and innovative IT recruiting and consulting firm that specializes in connecting top-tier tech talent with leading organizations. With a deep-rooted commitment to excellence, we pride ourselves on delivering tailored solutions that match the unique needs of our clients. Our team of industry experts brings a wealth of experience to the table, ensuring that we source the best IT professionals and provide strategic guidance to businesses of all sizes. At PeerTech, we foster partnerships that drive success, forging the path to a brighter, tech-savvy future for both candidates and companies.
Accessibility accommodations are available upon request.