AI Agents Functional and Non-Functional Testing

AI Agent Functional and Non-Functional Testing

Software vocabulary can be extremely confusing. The difference between functional and non-functional testing is a great example. The phrase “non-functional test” suggests a test for things that don't really matter, which is the opposite of what it actually means. However, the distinction between functional and non-functional testing is critical in software engineering. It becomes even more important with AI agents, where the failure modes are unfamiliar, and traditional testing procedures are wholly inadequate.

Whether, and how well

Functional testing asks whether the software does the thing it was designed to do. When you press the button, is the form submitted? When you enter the password, do you get logged in? If you move money from account A to account B, does the right amount arrive? Each test compares an input to the output and reports the result.

Non-functional testing asks how well the software does the thing it was designed to do. How fast, how reliably, how securely, under how much load, on how many browsers, for how many simultaneous users, and with how much memory? How well does the software perform when the network is patchy, when the server restarts, and when an attacker probes the system? These are properties of the system's behaviour rather than the behaviour itself.

Consider this analogy from the world of restaurants. A functional test investigates whether the kitchen sent out the steak for a customer who ordered steak. A non-functional test investigates how the steak arrived. Was it served while the customer was still hungry, on a clean plate, and at the right temperature? Did the kitchen also feed forty other diners that night without anyone waiting for an hour?

The history of an awkward name

The phrase "non-functional requirement" first appeared in Yeh and Zave's 1980 paper "Specifying software requirements" in the Proceedings of the IEEE. It became more widely used after Mylopoulos, Chung and Nixon's 1992 paper "Representing and using nonfunctional requirements" appeared in IEEE Transactions on Software Engineering. It finally reached the mainstream with Chung et al's textbook of the same title in 2000.

There have long been engineers who disliked the term. Mike Cohn of Mountain Goat Software prefers the term "constraints", on the grounds that calling something non-functional suggests that we should not care about it. Others use "quality attributes", and some refer to "the -ilities", by which they mean things like reliability, scalability, usability, maintainability, portability, and observability. So the vocabulary can vary, but for nearly half a century, developers have acknowledged that the underlying distinction between what it does and how it does it is important.

Where the distinction gets fuzzy

The split between functional and non-functional testing is cleaner in theory than in practice. A login screen that takes ninety seconds to respond is functionally working, but in practical terms it is broken. A search that returns results in the wrong order has produced an output, so strictly speaking it passes a functional test, but again, in practical terms the system has failed the user. Performance and correctness blur into each other at the edges, and the question of which bucket a particular test belongs in can become somewhat academic.

Applying the distinction to AI agents

An AI agent presented with a task — to book a flight, to summarise a document, to file a support ticket, to run an SQL query — can fail in two very different ways.

Functional failure is the familiar kind. The agent books the wrong flight, mis-reads the document, or calls the wrong tool. It returns the right answer to the wrong question. These are the agentic equivalents of a button that fails by submitting the wrong form, and they can in principle be tested for by comparing input-output pairs.

Non-functional failure is where agents differ from traditional software. The agent might succeed with the task on Tuesday and fail on Wednesday, despite relying on the same input, because the underlying model is stochastic. It might perform a task at a ten times the budgeted cost. It might complete the task while leaking customer data into a third-party API. It might perform the task while drifting far away from the persona the marketing team want it to present. It might complete the task satisfactorily today, but silently degrade in six months when the underlying model is updated. It might be brittle when faced with adversarial inputs that no functional test would think to try.

The list of -ilities is longer for agents than for conventional software, and several items on it are genuinely new: consistency across runs, robustness to prompt injection, calibration of confidence, faithfulness to source documents, refusal behaviour, fallback when tools fail, behaviour under distributional shift, cost per successful task. Some of these have no direct analogue in conventional software, because conventional software is deterministic and does not have opinions. Agents are stochastic and they have something close to opinions, which gives many more dimensions to the question "how well does it behave?".

The upshot is that an organisation putting agents into production cannot rely on functional testing alone, even very thorough functional testing. A test suite which confirms that the agent got the right answer on a thousand curated cases tells you nothing about what happens on another ten thousand cases that the tests did not anticipate, or about whether the agent will pass the same tests next month, or about what the agent does when the API it depends on returns garbage. For agents, non-functional questions are not optional extras. They are questions which determine whether an agent can and should be put into production.

Conclusion

Functional testing asks whether software does its job. Non-functional testing asks whether anyone would want to use it when it does. The terminology is awkward, and engineers have been complaining about it since at least the early 1980s, but the underlying distinction has proved its worth. With the arrival of AI agents, it provides a useful way of thinking about what is missing from most current approaches to testing.