
Senior Site Reliability Engineer (SRE)
Vitech Systems Group
Full time
Software Development
Canada
Hiring from: Canada
Department: Solutions Consulting
Location: Canada
At Vitech, we believe in the power of technology to simplify complex business processes. Our mission is to bring better software solutions to market, addressing the intricacies of the insurance and retirement industries. We combine deep domain expertise with the latest technological advancements to deliver innovative, user-centric solutions that future-proof and empower our clients to thrive in an ever-changing landscape. With over 1,600 talented professionals on our team, our innovative solutions are recognized by industry leaders like Gartner, Celent, Aite-Novarica, and ISG.
We offer a competitive compensation package along with comprehensive benefits that support your health, well-being, and financial security.
Senior Site Reliability Engineer (SRE)
Location: Canada or United States (Remote Role)
Senior Site Reliability Engineer (SRE) – Join Our Global Engineering Team
At Vitech we believe that excellence in production systems starts with engineering-driven solutions to operational challenges. Our Site Reliability Engineering (SRE) team is at the heart of ensuring seamless performance for our clients, preventing potential outages, and proactively identifying and resolving issues before they arise.
Our SRE team is a diverse group of talented engineers across India, the US, and Canada. We have T-shaped expertise spanning application development, database management, networking, and system administration across both on-premise environments and AWS cloud. Together, we support mission-critical client environments and drive automation to reduce manual toil, freeing our team to focus on innovation.
About the Role: Senior SRE
As an SRE, you’ll be a key player in revolutionizing how we operate production systems for single and multi-tenant environments. You'll support SRE initiatives, support production, and drive infrastructure automation. Working in an Agile team environment, you’ll have the opportunity to explore and implement the latest technologies, engage in on-call duties, and contribute to continuous learning as part of an ever-evolving tech landscape.
If you’re passionate about scalability, reliability, security, and automation of business-critical infrastructure, this role is for you.
What you will do:
- Own and manage our AWS cloud-based technology stack, using native AWS services and top-tier SRE tools to support multiple client environments with Java-based applications and microservices architecture.
- Define SRE strategy, vision, and goals aligned to Vitech’s overall objectives. Establish roadmaps and plans for improving system reliability, scalability, and efficiency.
- Collaborate with Architecture review boards, Solution Architects, engage in viable solutions reviews/implementations.
- Design/refine and implement SLIs and SLO’s that covers broad spectrum of SRE – availability, performance, Error budgeting
- Design, deploy, and manage AWS Aurora PostgreSQL clusters for high availability and scalability. Optimize SQL queries, indexes, and database parameters for performance tuning.
- Automate database operations using Terraform, Ansible, AWS Lambda, and AWS CLI. Manage Aurora’s read replicas, auto-scaling, and failover mechanisms.
- Enhance infrastructure as code (IAC) patterns using technologies like Terraform, CloudFormation, Ansible, Python, and SDK. Collaborate with DevOps teams to integrate Aurora with CI/CD pipelines.
- Provide full-stack support, as per assigned schedule, on applications across technologies such as Oracle WebLogic, AWS Aurora PostgreSQL, Oracle Database, Apache Tomcat, AWS Elastic Beanstalk, Docker/ECS, EC2, S3, etc.,
- Troubleshoot database incidents, perform root cause analysis, and implement preventive measures. Document database architecture, configurations, and operational procedures.
- Ensure high availability, scalability, and performance of PostgreSQL databases on AWS Aurora. Monitor database health, troubleshoot issues, and perform root cause analysis for incidents.
- Embrace SRE principles such as Chaos Engineering, Reliability, Reducing Toil, etc.,
- Proven hands-on experience as an SRE for critical, client-facing applications, with the ability to dive deep into daily SRE tasks, manage incidents, and oversee operational tools.
- 4+ years of experience developing and/or administering software in AWS public cloud and deep level experience in hosting applications in AWS (EC2, EBS, ECS/EKS, Elastic Beanstalk, RDS, CloudWatch).
- 3+ years of experience in managing relational databases (Oracle, and/or PostgreSQL) in both cloud and on-prem environments, including SRE tasks like backup/restore, Performance issues and replication.
- Demonstrable cross-functional full-stack knowledge with compute, storage, networking, security and databases
- Strong understanding of AWS networking concepts (VPC, VPN/DX/Endpoints, Route53, CloudFront, Load Balancers, WAF).
- Experience with containerized applications (Docker, Kubernetes, ECS). Leverage AWS Aurora features (e.g., read replicas, auto-scaling, multi-region deployments) to enhance database performance and reliability.
- Familiarity with Datalake architecture, Elasticsearch, Zookeeper, DynamoDB, a plus.
- Familiarity with tools like pgAdmin, psql, or other database management utilities. Automate routine database maintenance tasks (e.g., vacuuming, reindexing, patching). Knowledge of backup and recovery strategies (e.g., pg_dump, PITR).
- Set up and maintain monitoring and alerting systems for database performance and availability (e.g., CloudWatch, Honeycomb, New Relic, Dynatrace etc.,).
- Work closely with development teams to optimize database schemas, queries, and application performance. Provide database support during application deployments and migrations.
- Hands-on experience with web/application layers (Oracle WebLogic, Apache Tomcat, AWS Elastic Beanstalk, SSL certificates, S3 buckets).
- Automation experience with Infrastructure as Code (Terraform, CloudFormation, Python, Jenkins, GitHub/Actions). Knowledge of multi-region Aurora Global Databases for disaster recovery.
- Scripting experience in Python, Bash, Java, JavaScript, Node.js.
- Oversee and streamline change management procedures, efficiently handling daily production change requests to ensure seamless operations.
- Excellent written/verbal communication, critical thinking.
At Vitech, we believe in empowering our teams to drive innovation through technology. If you thrive in a dynamic environment and are eager to drive innovation in SRE practices, we want to hear from you!
You’ll be part of a forward-thinking team that values collaboration, innovation, and continuous improvement. We provide a supportive and inclusive environment where you can grow as a leader while helping shape the future of our organization.
How to apply
To apply for this job you need to authorize on our website. If you don't have an account yet, please register.
Post a resumeSimilar jobs

Overview of the role: Join us at PolicyMe! We're modernizing insurance and we'd like your help. The Canadian insurance landscape has remained largely unchanged for decades and we are in the process of changing that. We're a remote-first, Toronto-based startup...
Software Development
Canada
Hiring from: Canada

Job Summary Performing T&E audits and handling inbound and outbound calls related to Travel & Expense (T&E). Processing vendor invoices within agreed SLA or before the due date, managing and processing multi-line invoices, rectifying errors in the vendor master, and...
Software Development
United States
Hiring from: United States

Outlier helps the world’s most innovative companies improve their AI models by providing human feedback. Are you an experienced Philosophy expert who would like to lend your expertise to train AI models? About The Opportunity Outlier is looking for talented...
Software Development
Canada
Hiring from: Canada