Machine-Learning (ML) models are increasingly integrated into
safety-critical systems, from self-driving cars to aviation, making
their dependability assessment crucial. This thesis introduces novel
approaches to specify and test the functional correctness of ML
artifacts by adapting established software testing concepts.
We first address the challenge of testing action policies in sequential
decision-making problems by developing π-fuzz, a framework that uses
metamorphic relations between states to identify undesirable yet
avoidable outcomes. We then formalize these relations as k-safety
hyperproperties and introduce NOMOS, a domain-agnostic specification
language for expressing functional correctness properties of ML models.
NOMOS comes with an automated testing framework that effectively
identifies bugs across diverse domains including image classification,
sentiment analysis, and speech recognition. We further extend NOMOS to
evaluate code translation models.
By providing these specification languages and testing frameworks, this
thesis contributes essential tools for validating the reliability and
safety of ML models in our increasingly machine-learning-dependent
world.