Test Engineer - AI and LLMs

Architech

Contract

45 - 60 CAD / hour

Posted on: 3 weeks ago

Software Development

Canada

Hiring from: Canada

Architech is a Toronto-based software company with 20 years of experience in creating technology solutions for clients across North America. We leverage the latest cloud technology and hire top talent to modernize applications so that businesses can succeed in today’s digital world.

Our Dream Team has a main hub in Toronto, but expands across Canada and Kraków, Poland. Our team consists of over 100 certified technical experts in our Product, Design, Engineering, and Delivery disciplines. Our values drive our culture of success: Think Big, Be Open & Collaborate, Never Fail a Client, Grow Our People, Do the Right Thing, and Embrace Change.

Be Open & Collaborate: Our Culture Says It All

You’ll work very closely with a diverse tight-knit group of creative and talented people who are passionate about technology, software, and solutions. Not only will you work in a collaborative and supportive environment, you’ll also grow your existing skills while keeping up with technology trends.

Who We Are

We’re passionate about creating an environment where every team member feels empowered to share their unique point of view. We celebrate diverse talents and encourage our teammates to share their whole selves – because our greatest source of inspiration is each other, and we believe diversity drives innovation.

In order to be inclusive, we must be intentional. We have taken a multi-pillar approach to D&I at Architech including: Listening & Learning, Being an Ally, and Accountability.

In 2020 we launched our first Diversity & Inclusion survey. While we are always striving for more equal representation, we are very proud of our results:

31% women, 57% BIPOC, 14% LGBTQIA+
49% of our people were born in countries other than where our offices are located. Our team members collectively speak 19 different languages. 59% of our people speak more than one language
In the past year Architech has increased the number of women in our technology function by 200%. We strive to do even better as our multi-year strategic plan unfolds.
We analyzed salaries by gender of persons in the same role and are delighted to report a 0% gender pay gap in our delivery and technology roles!

What Our People Say

“Employees of different backgrounds interact well within our company” – 97% of employees agree

“Architech respects individuals and values their differences” - 96% of employees agree

Welcome to Architech.

Test Engineer - AI and LLMs

We are seeking a highly motivated Test Engineer - AI and LLM Evaluation with a strong software development background and a passion for ensuring the quality and reliability of cutting-edge AI applications. This is not a traditional QA role. We need an engineer experienced in automation who understands software development principles and the nuances of evaluating Generative AI systems, particularly those leveraging Large Language Models (LLMs). You will be integral to testing AI-driven solutions within a telecom-focused environment, focusing on the quality, reliability, performance, safety, and fairness of applications built using LLMs, RAG pipelines, and other AI models through rigorous evaluation and testing.

If you are an analytical thinker, a meticulous problem solver, and a fast learner eager to work at the forefront of AI evaluation, this role is for you!

Key Responsibilities

Design, develop, and execute automated evaluation suites and test cases specifically targeting AI/LLM components, focusing on aspects like response quality, factual accuracy, safety, and task completion.
Implement and manage batch testing processes using curated datasets to assess model performance, identify regressions, and benchmark different model versions or prompts.
Develop, maintain, and enhance test and evaluation frameworks using libraries such as Promptflow, DeepEval, Ragas, and similar LLM evaluation tools.
Define and implement comprehensive test strategies to evaluate LLM outputs for accuracy, relevance, coherence, safety (toxicity, bias), hallucination reduction, and consistency, using both automated metrics and potentially qualitative review processes.
Collaborate closely with developers, data scientists, and prompt engineers to understand model behavior, identify edge cases, potential biases, and failure modes in AI models and agents.
Test and validate components of Retrieval-Augmented Generation (RAG) pipelines, including retriever performance, chunking strategies, and generator quality.
Evaluate the end-to-end functionality and performance of AI-driven workflows within telecom applications against defined benchmarks.
Continuously research and improve testing methodologies and metrics for AI/LLM applications, incorporating industry best practices in automated evaluation and validation.
Document evaluation results and findings, providing actionable feedback to development teams to enhance AI model robustness, reliability, and overall quality.

Required Skills & Qualifications

3-5 years of experience in software development, SDET (Software Development Engineer in Test), or QA automation, with a demonstrable focus on backend systems, APIs, or complex data pipelines.
Strong hands-on programming experience in Python is essential.
Proven experience with test automation frameworks and libraries (e.g., Pytest).
Solid understanding of AI/ML concepts, particularly LLMs, Generative AI, prompt engineering, vector databases, RAG architectures, and principles of LLM safety and ethical AI testing.
Experience or strong familiarity with LLM evaluation metrics and methodologies (e.g., ROUGE, BLEU, BertScore, F1, precision, recall, faithfulness, relevance).
Familiarity with API testing (e.g., testing RESTful APIs used by AI services) and tools (e.g., Postman, requests library).
Experience with version control systems (e.g., Git) and CI/CD pipelines (e.g., Jenkins, GitLab CI, GitHub Actions).
Strong analytical skills and a meticulous, problem-solving mindset.
Excellent communication skills and the ability to articulate complex technical issues clearly.
*A quick learner who can rapidly adapt to evolving AI technologies and evaluation techniques.

Preferred Qualifications

Direct hands-on experience using LLM evaluation frameworks like Promptflow, DeepEval, Ragas, LangSmith, or similar.
Experience with or exposure to LLM red teaming tools and techniques (e.g., Garak, PyRIT, Giskard, manual adversarial prompt crafting) is a significant advantage.
Experience developing and managing datasets for testing and evaluation (e.g., 'golden datasets', adversarial examples).
Familiarity with data handling and manipulation libraries in Python (e.g., Pandas, NumPy).
Knowledge of AI ethics, fairness, and bias testing methodologies beyond basic safety checks.
Experience with cloud platforms (AWS, GCP, Azure), particularly services related to AI/ML.
Experience working in the telecom sector.
Experience with UI test automation (e.g., Selenium, Playwright) for testing applications integrating AI features is a plus, but not the primary focus of this role.

Architech is an equal opportunity employer committed to diversity. Should you require any accommodations prior to or during the interview process, please indicate this during the interview process. We strongly encourage applications from racialized people, people with disabilities, people from gender and sexually diverse communities and/or people with intersectional identities.

How to apply

To apply for this job you need to authorize on our website. If you don't have an account yet, please register.

Post a resume

Similar jobs

PAYROLL SPECIALIST - PART TIME (REMOTE)

Lensa

Part time

Lensa is the leading career site for job seekers at every stage of their career. Our client, Compass Group, North America, is seeking professionals. Apply via Lensa today! Compass Corporate Salary: $22hr. A family of companies and experiences As the...

Posted on: Jun 19, 2025

Software Development

United States

Hiring from: United States

Remote Customer Care Voice Associate

Lensa

Full time

Lensa is a U.S. career site that helps job seekers discover job opportunities. We are not a staffing firm or agency. We promote jobs on behalf of our clients, which include employers, recruitment agencies, and marketing partners. At NTT DATA,...

Posted on: Jun 18, 2025

Software Development

United States

Hiring from: United States

DÉVELOPPEUR FULLSTACK SÉNIOR

Chrome Technologies

Full time

Référence 1574/QC/0606 Date de démarrage Au plus vite Localisation du poste Québec (télétravail) Durée Permanent Description Nous recherchons un Développeur Fullstack Sénior pour rejoindre notre équipe de conseillers du bureau de Québec. Ce poste est en mode télétravail mais des...

Posted on: Jun 15, 2025

Software Development

Canada

Hiring from: Canada