How can Large Language Models (LLMs) revolutionize software testing by automating test case generation and documentation for full-stack applications? This blog proposes a novel multi-agent workflow that leverages different LLMs for specific testing tasks, significantly reducing manual effort and improving efficiency.
Building and maintaining robust software applications
Building, deploying and maintaining robust software applications serve the needs of the customers 24x7x365. Ensuring SLAs are met and shipping with least number of defects (TQM) are indeed challenging for even the most talented software delivery teams regardless of size.
While automated test cases are the standard in most delivery processes, the struggle lies in proactively maintaining and adding automated test cases at various stages to ensure maximum possible coverage of the codebase as well as shipping a bug-free reliable solution.
So, given my previous experiences and existing case studies in the areas of architecting, developing & maintaining production-grade traditional as well as Generative AI solutions for startups and enterprises, I decided to evaluate the viability of using LLMs (Large language models) for automating and deploying AI based application testing.
Finding the Best Approach
The following points were crucial to identifying the most viable and accurate approach at this point.
- Transformers & GANs (Generative Adversarial Networks) are good at generating synthetic data given examples of real-world data.
- Transformers based architecture has been able to allow us to apply Transfer Learning to NLP. Transfer learning is a method that allows us to take the weights of a model trained for a particular kind of task and use it for a new task to get a better result. Previously, transfer learning was only applied in computer vision, where models were pre-trained on supervised data. With the advent of transformers, we can apply transfer learning to other domains as well without even needing supervised data. This means we can build a layered AI architecture where every domain or a collection of domain expertise can be mapped to one dataset, the AI can be trained on one or more such datasets to gain expertise in problem solving in those domains.
- During execution, the agents use a technique known as Active Learning or Interactive Prompting to question and correct their own responses till the required correctness criteria are met. This is very useful especially in the case of code generation where a parser can be used by the agent to iteratively correct itself and refine its own results. We have used this technique with great success while building GenAI chatbots in insurance and similar sectors involving knowledge-based workflow execution.
- Based on real-world deployments, It has become evident that it is easier to use an orchestrated workflow for a team of iterative agents with separate smaller LLMs that can be independently trained for specific tasks rather than using one large language model which can do it all. This has to do with the accuracy, the cost of training as well as the cost of reducing hallucinations (errors) which becomes easier for training with domain specific datasets given the roots of large language models in Deep learning AI.
- LLMs struggle with ambiguity in tasks like completing user workflows in B2B/B2C/B2B2C/Enterprise application scenarios. Again, this is a result of hallucinations and tendency to assume the best (meaning, they are very optimistic about their predictions).
- LLMs still don’t understand what they do 100% as proclaimed by most optimists and salesmen (Subbarao Kambhampati, esteemed professor from Arizona School of Computing & AI explains this @ https://arxiv.org/html/2403.04121v1).
- LLMs are not capable of planning, however they can generate plans based on prompts which can be used by other LLMs.
Test-LLM by Facebook and Checksum.ai are current attempts at using Generative AI to make the QA process less cumbersome and more reliable. The approach we discuss in this Blog uses a different process which depends on the Project management repository and the source code repository-based triggers for training, generating and evaluating different tests.
Base Architecture for Multi LLM Agent based Workflow
Following figure depicts the base architecture, which was designed for the same. It uses a multi GPT agent-based workflow.
Figure 1: Multi LLM Agent Based Workflow
An agent is a python script utilizing LangChain & OpenAI /Starcoder LLM which performs the following tasks:
- It uses the existing data repository which differs based on the test it generates to generate similar tests, and iteratively validates its own responses first for syntax errors and then for runtime issues using the next step using active learning/ interactive prompting technique. Effectively, it works as a code generator in most cases.
- It runs the tests in a sandbox environment (in most cases, a Docker-Kubernetes based environment which can run on-premises or on any cloud) and uses the generated logs to refine the test generation process in terms of accuracy and diversity.
- It documents the generated test code and generates reports in excel and html formats for the evaluated test cases.
- Using RHLF (Reinforcement learning from Human Feedback) algorithm, the agent is capable of accepting inputs from Humans to improve or finetune its performance.
We have the following agents for ensuring a robust software delivery process.
Agent A) - The User Story & Bug Reporting Agent:
This agent is responsible for generating and evaluating test cases and user journeys as “flows” based on User stories and bug reports from the project management repository. This agent was trained to use data from the existing project management repositories to generate workflows which can easily be translated into code.
Each “flow” is nothing, but a set of steps represented as nodes in a “flowchart” which the user is expected to follow, and the desired output is achieved when the user follows those steps (for a bug or a user story). An example is “A successful purchase of a specific mobile phone model on the specified ecommerce site”.
Agent B) - The E2E Agent:
This agent generates and validates end-to-end scripts using the flows generated from Agent A). This agent was trained to generate e2e tests in cypress/ playwright/ Selenium. For the purpose of evaluation, we used Cypress primarily and results showed that the same process is applicable to Playwright, Selenium etc. This enables facilitation of running & evaluating both user stories as well as bug reports.
The test scripts generated by Agent B) are used in turn as inputs by the following agents through a parallel workflow executed as a DAG (Directed Acyclic Graph).
Agent C) - The Infrastructure Test Agent:
This agent generates and verifies infrastructure tests using a combination of Terraform and Docker-Kubernetes scripts as they can be used together to provision resources as well as deploy and test applications in any environment (on-premises/ cloud).
Agent D) - The Performance/Load Testing Agent:
Agent D) generates and verifies load-stress tests in JavaScript.
Agent E) - The Database Testing Agent:
Agent E) generates and verifies database tests in JavaScript which will run on K6 clusters given K6’s extensive proven support for running distributed load testing scripts for applications and databases.
Agent F) - The Security and Penetration Testing Agent:
This agent creates and evaluates tests which are basically generated in python using the OWASP ZAP API, OWASP standards(https://owasp.org/) are widely used and recommended for ensuring the application is developed and architected to make it as secure and reliable as possible.
The source code repository is used by the following agents.
Agent G) - The API Test Agent
This agent creates, runs and documents API tests in PyTest/ Mocha/ JUnit/ Chai based on the backend runtime when the source is pushed to the repository as well as on subsequent pushes to the repository for the backend files.
Agent F) - The Frontend Component Test Agent:
In a similar way, on every source push trigger, Agent F) creates, runs and documents Component tests in Jest/ Karma/ Selenium based on the frontend framework (React/Angular currently or Vanilla HTML-JS-CSS which works typically with all frameworks as they end up producing the same output).
Results Achieved
Following results were achieved upon using the above approach on an existing codebase of an ecommerce application:
- We noticed that this greatly reduced the effort on ensuring error free test coverage and documentation with every source code push as well as addition of features to the task log /defect log on the project management boards (JIRA API was used). The improvement in effort & time was around 70% in comparison to the existing manual intensive/semi-automated process.
- Using pre-trained models fine-tuned for each testing task (DB, Security, API, Frontend etc.) for each agents increased the accuracy and efficiency as opposed to developing one agent type which tries to generate all tests. This was the initial hypothesis which was successfully verified.
- The diversity of generation of test coverage and documentation improved with the addition of codebases, documentation, logs, and SRS (Software requirement specification documents), Project management logs (Task/Feature) unrelated to the codebase for which the tests were generated.
- Because of active learning strategy being employed in the Agent execution, the Agents were able to correct themselves after 50 attempts or so on an average of trying to generate the required tests and then verified them by running in the corresponding sandbox to further refine their approach.
The Human expert system feedback for refining the generated cases in an async fashion contributed to a substantial increase in accuracy for the concerned codebase, this was expected as fine-tuning based on the codebase results in the tests being more suited for the application under consideration.
- GPT-4 was noticeably more accurate, however slower, and significantly more expensive than GPT3.5 in code and documentation generation. Other LLMs evaluated like StarCoder, LLamA2/3 proved to be suitable alternatives as well, the accuracy however wasn’t good as using GPT based models which is expected to improve over time as using a local LLM can definitely improve the privacy, security and reduce the cost of the process.
- To replicate the same process for mobile testing, it is possible to use screenshot based visual algorithms to navigate through the application and verify UI objects and elements visually to create UI tests. We can use a device farm like AWS Device Farm/Browser stack for running the E2E tests and use this process. This seems to be a more generalized approach for most testing scenarios regardless of environment/language, however the model we are working with needs to be capable of handling more ambiguity in the specifications which is not necessarily a good thing for LLMs.
Conclusion
These findings strongly suggest that our proposed LLM based multi-agent workflow offers a powerful and efficient approach to automating test case generation and documentation within full-stack enterprise and consumer grade applications. Continued research and development can further refine this process and expand its applicability across diverse testing scenarios.