Annex 22 on Artificial Intelligence, currently under public consultation since 2025 and part of EudraLex Volume 4 – Good Manufacturing Practice (GMP) Guidelines, brings an important regulatory perspective on how AI should be applied within GMP-regulated processes.
From the first paragraphs, it is already clear that this initial draft of Annex 22 adopts a highly restrictive and cautious approach. It is expected that significant changes may still occur before the guideline is officially adopted.
What models of AI are covered by Annex 22?
Annex 22 clearly limits the types of artificial intelligence considered acceptable for GMP-critical applications.
The following are explicitly excluded or discouraged:
- Generative AI systems, including Large Language Models (LLMs);
- Dynamic or continuously learning models, where the algorithm adapts during operational use.
These technologies are currently considered unsuitable for critical GMP processes, due to challenges related to traceability, reproducibility, explainability, and validation.
On the other hand, Annex 22 focuses on:
- Static AI models, which do not learn or change during use;
- Deterministic models, where the same input always produces the same output without variation.
This approach reinforces the expectation of predictability, control, and reproducibility, which are fundamental GMP principles.
Acceptance criteria and model testing in GMP environments
The use of computational models and artificial intelligence in regulated environments requires more than advanced technology. It is essential to ensure that models are reliable, traceable, and fit for their intended use, in line with GMP requirements.
- Definition of Metrics and Acceptance Criteria
Before testing begins, it is mandatory to define how model performance will be measured.
Test metrics must be:
- Appropriate to the intended use;
- Clear and objective;
- Capable of demonstrating model reliability.
For classification models (e.g. accept/reject decisions), suitable metrics may include:
- Confusion matrix;
- Sensitivity;
- Specificity;
- Accuracy;
- Precision;
- F1 score.
Acceptance criteria must:
- Be defined by a process Subject Matter Expert (SME);
- Be documented and approved prior to testing;
- Consider relevant process subgroups, where applicable.
A critical requirement:
The model’s performance must be at least equivalent to the process it replaces or supports.
- Quality and Representativeness of Test Data
Test data must accurately reflect real operational conditions.
This means the data should:
- Cover the full scope of the intended use;
- Include common and rare variations;
- Represent process limitations and complexity.
In addition:
- The dataset must be large enough to ensure statistical confidence;
- Data labeling must be highly reliable, preferably verified by independent experts;
- Any data pre-processing (normalization, transformation, standardization) must be predefined and justified.
The use of synthetically generated data or labels, including generative AI, is not recommended, unless fully justified.
- Independence of Data and Personnel
Test data independence is a fundamental GMP requirement.
It must be ensured that:
- Test data are not used for training or validation;
- Access to test data is strictly controlled;
- Audit trails record access and changes;
- No unauthorized copies exist outside the official repository.
Regarding personnel:
- Individuals with access to test data should not be involved in model training;
- Where full separation is not possible, the four-eyes principle must be applied.
- Test Execution and Deviation Control
Testing must demonstrate that the model:
- Is fit for its intended use;
- Generalizes well to new data;
- Does not suffer from overfitting or underfitting.
A test plan must be prepared and approved in advance, including:
- Description of the intended use;
- Defined metrics and acceptance criteria;
- Identification of test data;
- Test execution steps;
- Methods for metric calculation.
Any deviation, failure, or omission must be documented, investigated, and justified.
- Explainability and Confidence
For critical GMP applications, models must be explainable.
Good practices include:
- Recording which features contributed to each decision;
- Using techniques such as SHAP, LIME, or heat maps;
- Reviewing these outputs as part of test approval.
Additionally:
- The system should log a confidence score for each prediction;
- Appropriate confidence thresholds must be defined;
- Low-confidence outcomes should be flagged as “undecided”.
- Operation, Monitoring, and Human Review
Once deployed, the model must be under:
- Change control;
- Configuration control;
- Ongoing performance monitoring.
Organizations should monitor:
- Performance degradation;
- Environmental changes (e.g. lighting, equipment);
- Input data drift.
When a model supports human decision-making (human-in-the-loop), particularly where reduced testing has been applied:
- Decisions must be recorded;
- Systematic review of model outputs may be required;
- Human operators must be trained and monitored like any other critical manual process.
Model validation in GMP environments goes far beyond technology. It requires governance, reliable data, human oversight, explainability, and continuous monitoring to ensure regulatory compliance, patient safety, and process integrity.
To support organizations in the safe and compliant implementation of AI models, computerized systems, and digital solutions in GMP environments, Kivalita Consulting provides specialized consultancy in validation, risk management, and regulatory compliance.
For professionals seeking hands-on, practical knowledge, the VSC 5.0 Training offers an up-to-date and applied approach to software and system validation, aligned with RDC 658/22, IN 134/22, IN 138/22, ANVISA Guide 33, and incorporating Annex 22 and the challenges of validating AI-based software, preparing teams for audits and regulated operations with confidence.
Learn more at:
https://conteudo.kivalita.com.br/treinamento-vsc-5-0-46