Test Engineer - AI and LLMs

Architech
Contract
45 - 60 CAD / hour
Software Development
Canada
Hiring from: Canada

Architech is a Toronto-based software company with 20 years of experience in creating technology solutions for clients across North America. We leverage the latest cloud technology and hire top talent to modernize applications so that businesses can succeed in today’s digital world.

Our Dream Team has a main hub in Toronto, but expands across Canada and Kraków, Poland. Our team consists of over 100 certified technical experts in our Product, Design, Engineering, and Delivery disciplines. Our values drive our culture of success: Think Big, Be Open & Collaborate, Never Fail a Client, Grow Our People, Do the Right Thing, and Embrace Change.


Be Open & Collaborate: Our Culture Says It All

You’ll work very closely with a diverse tight-knit group of creative and talented people who are passionate about technology, software, and solutions. Not only will you work in a collaborative and supportive environment, you’ll also grow your existing skills while keeping up with technology trends.


Who We Are

We’re passionate about creating an environment where every team member feels empowered to share their unique point of view. We celebrate diverse talents and encourage our teammates to share their whole selves – because our greatest source of inspiration is each other, and we believe diversity drives innovation.

In order to be inclusive, we must be intentional. We have taken a multi-pillar approach to D&I at Architech including: Listening & Learning, Being an Ally, and Accountability.

In 2020 we launched our first Diversity & Inclusion survey. While we are always striving for more equal representation, we are very proud of our results:

  • 31% women, 57% BIPOC, 14% LGBTQIA+
  • 49% of our people were born in countries other than where our offices are located. Our team members collectively speak 19 different languages. 59% of our people speak more than one language
  • In the past year Architech has increased the number of women in our technology function by 200%. We strive to do even better as our multi-year strategic plan unfolds.
  • We analyzed salaries by gender of persons in the same role and are delighted to report a 0% gender pay gap in our delivery and technology roles!


What Our People Say

“Employees of different backgrounds interact well within our company” – 97% of employees agree

“Architech respects individuals and values their differences” - 96% of employees agree

Welcome to Architech.


Test Engineer - AI and LLMs


We are seeking a highly motivated Test Engineer - AI and LLM Evaluation with a strong software development background and a passion for ensuring the quality and reliability of cutting-edge AI applications. This is not a traditional QA role. We need an engineer experienced in automation who understands software development principles and the nuances of evaluating Generative AI systems, particularly those leveraging Large Language Models (LLMs). You will be integral to testing AI-driven solutions within a telecom-focused environment, focusing on the quality, reliability, performance, safety, and fairness of applications built using LLMs, RAG pipelines, and other AI models through rigorous evaluation and testing.

If you are an analytical thinker, a meticulous problem solver, and a fast learner eager to work at the forefront of AI evaluation, this role is for you!


Key Responsibilities

  • Design, develop, and execute automated evaluation suites and test cases specifically targeting AI/LLM components, focusing on aspects like response quality, factual accuracy, safety, and task completion.
  • Implement and manage batch testing processes using curated datasets to assess model performance, identify regressions, and benchmark different model versions or prompts.
  • Develop, maintain, and enhance test and evaluation frameworks using libraries such as Promptflow, DeepEval, Ragas, and similar LLM evaluation tools.
  • Define and implement comprehensive test strategies to evaluate LLM outputs for accuracy, relevance, coherence, safety (toxicity, bias), hallucination reduction, and consistency, using both automated metrics and potentially qualitative review processes.
  • Collaborate closely with developers, data scientists, and prompt engineers to understand model behavior, identify edge cases, potential biases, and failure modes in AI models and agents.
  • Test and validate components of Retrieval-Augmented Generation (RAG) pipelines, including retriever performance, chunking strategies, and generator quality.
  • Evaluate the end-to-end functionality and performance of AI-driven workflows within telecom applications against defined benchmarks.
  • Continuously research and improve testing methodologies and metrics for AI/LLM applications, incorporating industry best practices in automated evaluation and validation.
  • Document evaluation results and findings, providing actionable feedback to development teams to enhance AI model robustness, reliability, and overall quality.


Required Skills & Qualifications

  • 3-5 years of experience in software development, SDET (Software Development Engineer in Test), or QA automation, with a demonstrable focus on backend systems, APIs, or complex data pipelines.
  • Strong hands-on programming experience in Python is essential.
  • Proven experience with test automation frameworks and libraries (e.g., Pytest).
  • Solid understanding of AI/ML concepts, particularly LLMs, Generative AI, prompt engineering, vector databases, RAG architectures, and principles of LLM safety and ethical AI testing.
  • Experience or strong familiarity with LLM evaluation metrics and methodologies (e.g., ROUGE, BLEU, BertScore, F1, precision, recall, faithfulness, relevance).
  • Familiarity with API testing (e.g., testing RESTful APIs used by AI services) and tools (e.g., Postman, requests library).
  • Experience with version control systems (e.g., Git) and CI/CD pipelines (e.g., Jenkins, GitLab CI, GitHub Actions).
  • Strong analytical skills and a meticulous, problem-solving mindset.
  • Excellent communication skills and the ability to articulate complex technical issues clearly.
  • *A quick learner who can rapidly adapt to evolving AI technologies and evaluation techniques.


Preferred Qualifications

  • Direct hands-on experience using LLM evaluation frameworks like Promptflow, DeepEval, Ragas, LangSmith, or similar.
  • Experience with or exposure to LLM red teaming tools and techniques (e.g., Garak, PyRIT, Giskard, manual adversarial prompt crafting) is a significant advantage.
  • Experience developing and managing datasets for testing and evaluation (e.g., 'golden datasets', adversarial examples).
  • Familiarity with data handling and manipulation libraries in Python (e.g., Pandas, NumPy).
  • Knowledge of AI ethics, fairness, and bias testing methodologies beyond basic safety checks.
  • Experience with cloud platforms (AWS, GCP, Azure), particularly services related to AI/ML.
  • Experience working in the telecom sector.
  • Experience with UI test automation (e.g., Selenium, Playwright) for testing applications integrating AI features is a plus, but not the primary focus of this role.


Architech is an equal opportunity employer committed to diversity. Should you require any accommodations prior to or during the interview process, please indicate this during the interview process. We strongly encourage applications from racialized people, people with disabilities, people from gender and sexually diverse communities and/or people with intersectional identities.

How to apply

To apply for this job you need to authorize on our website. If you don't have an account yet, please register.

Post a resume

Similar jobs

Thumbtack helps millions of people confidently care for their homes. Thumbtack is the one app you need to take care of and improve your home — from personalized guidance to AI tools and a best-in-class hiring experience. Every day in...
Software Development
Canada
Hiring from: Canada
DataAnnotation
Contract
We are looking for a biologist to join our team to train AI models. You will measure the progress of these AI chatbots, evaluate their logic, and solve problems to improve the quality of each model. In this role you...
Software Development
Canada
Hiring from: Canada
Lensa is the leading career site for job seekers at every stage of their career. Our client, Corewell Health, is seeking professionals. Apply via Lensa today! Our Virtual Urgent Care team is looking for an experienced Physician Assistant or Nurse...
Software Development
United States
Hiring from: United States