Family Guides
AWOSS-VAL: Validation, Testing, And Review
AWOSS-VAL turns control promises into tests. The goal is to show which agent paths were checked, what passed, what failed, who owns the fixes or accepted risks, and when the review has to run again.
Paper controls are not enough for agentic work. Approval screens, sandboxes, logging settings, source reviews, DLP rules, and policy gates can all look reasonable until a real workflow reads context, calls tools, uses connectors, writes files, runs commands, stores memory, requests approvals, handles sensitive data, or triggers a downstream business action.
The output should be a reviewable validation packet: coverage, fixtures, findings, retests, owner decisions, and independent challenge for higher-impact systems where appropriate.
What This Family Covers
In scope:
- Validation coverage matrices that say which candidate controls were checked, how they were checked, and which controls were not checked.
- Review artifacts with scope, method, reviewer or owner, date, evidence references, finding status, assumptions, and claim limits.
- Gap, exception, residual-risk, and untested-control records discovered during validation.
- Pre-production and pre-expansion tests for approval gates, denied-action paths, source-trust controls, sensitive-data controls, logging controls, human oversight paths, incident handling, and rollback procedures.
- Finding lifecycle records that connect validation failures to owner, remediation, accepted risk, target date, retest trigger, and closure state.
- Repeatable validation fixtures, review checklists, policy tests, adversarial prompts, context-boundary tests, safe evidence queries, and production-log samples.
- Recurring validation for high-impact workflows after model, prompt, source, connector, policy, boundary, data, evidence-store, monitoring, or provider changes.
- Separated, independent, or qualified review for high-assurance validation where feasible.
- Adversarial testing, red-team exercises, tabletop exercises, and abuse-case scenarios for material agentic workspace risks.
Out of scope:
- Creating a general
awosscertification, assessor, auditor, or public conformance program. - Proving that a model evaluation, red-team scan, benchmark, or single guardrail test validates the complete scoped workspace.
- Guaranteeing absence of prompt injection, data leakage, unsafe tool use, source drift, logging gaps, or governance failure.
- Replacing legal, regulatory, privacy, safety, employment, procurement, or sector-specific review.
- Storing raw exploit payloads, prompts, screenshots, media, credentials, personal data, customer records, or confidential documents where synthetic fixtures, masked samples, hashes, summaries, or protected references are sufficient.
Level Summary
Levels are cumulative. Level 2 builds on Level 1, and Level 3 builds on both.
| Level | Plain-language meaning | Why this level exists | Typical evidence |
|---|---|---|---|
| Level 1 | The organization knows what was checked, has at least one review artifact, and records known gaps instead of overclaiming. | A scoped system cannot support assurance discussions until reviewed controls, methods, gaps, and assumptions are visible. | Coverage matrix, review artifact, gap register, assumptions record, untested-control list. |
| Level 2 | Production or expanded use is preceded by meaningful tests of important gates, denied paths, data handling, logging, oversight, and rollback, with findings tracked to decision or retest. | Managed production use needs repeatable validation and a finding lifecycle, not one-off screenshots or informal signoff. | Pre-production test plan, fixture results, denial receipts, approval test records, finding tracker, retest records. |
| Level 3 | High-impact workflows are revalidated over time, challenged by separated or qualified reviewers where feasible, and tested against adversarial or incident scenarios. | High-assurance environments need recurring review, drift checks, independent challenge, and abuse-case coverage for material risks. | Scheduled validation runs, drift review, production-log sample, independent review summary, red-team or tabletop report. |
Candidate Controls
AWOSS-VAL-L1-001: Validation Coverage Matrix Level 1
Requirement summary
Identify which candidate controls were reviewed by documentation, configuration inspection, sampled evidence, manual test, automated test, monitoring review, or not reviewed in the current draft assessment.
Why it exists
Without a coverage matrix, a team may mistake a few screenshots, eval runs, or policy notes for complete validation. The matrix makes the review method explicit and shows where no review happened.
Why this level
This belongs at Level 1 because it is the foundation for honest assurance. It does not require advanced tooling, but it requires naming the controls, methods, evidence references, and gaps.
Evidence examples
| Evidence | Likely owner/provider | When collected | What it should show | Claim limit |
|---|---|---|---|---|
| Control coverage matrix | Evidence or audit owner with family control owners | Before assurance discussion and after material scope or control changes | Candidate controls, method used for each, evidence reference, reviewer or owner, and not-reviewed status | Shows review coverage; does not prove controls were effective. |
| Untested-control register | Evidence or audit owner | During each validation pass | Controls, workflows, data classes, tools, or scenarios not reviewed and why | Prevents overclaiming; does not prove untested paths are low risk. |
| Review-method taxonomy | Evidence or audit owner | Before validation planning and after method changes | Definitions for documentation review, configuration inspection, sampled evidence, manual test, automated test, monitoring review, and no review | Standardizes method labels; does not prove method quality. |
AWOSS-VAL-L1-002: Minimum Review Artifact Level 1
Requirement summary
Maintain at least one validation or review artifact for the scoped boundary before using awoss candidate controls in internal assurance discussions. Include scope, method, reviewer or owner, date, and finding status.
Why it exists
Internal assurance claims need a durable record. A conversation, meeting memory, or undocumented walkthrough cannot show later what was reviewed, by whom, against which boundary, or with what result.
Why this level
This belongs at Level 1 because every scoped system should have at least one review packet before anyone discusses awoss readiness, mapping, or control support.
Evidence examples
| Evidence | Likely owner/provider | When collected | What it should show | Claim limit |
|---|---|---|---|---|
| Validation review packet | Evidence or audit owner | Before internal assurance discussion and after material review updates | Scoped boundary, reviewed controls, method, reviewer or owner, date, findings, gaps, and evidence references | Supports review of selected controls; does not prove conformance or complete coverage. |
| Reviewer signoff note | Reviewer, control owner, or evidence owner | At review completion | Reviewer identity or role, relationship to system, scope reviewed, result, and open findings | Records review participation; does not prove reviewer independence or assessor qualification. |
| Sample evidence bundle | Evidence owner with runtime, source, log, and governance owners | During validation packet preparation | Representative receipts, logs, configuration exports, test results, and redacted references tied to the scoped boundary | Supports sampled review; does not prove all workflows were tested. |
AWOSS-VAL-L1-003: Known Gaps And Assumptions Level 1
Requirement summary
Record known gaps, assumptions, exceptions, residual risks, or untested controls discovered during review.
Why it exists
A useful validation pass should make uncertainty visible. Hidden assumptions and unstated exceptions are a common source of overclaiming, especially when hosted products, local desktop agents, connectors, logs, and governance records expose different evidence.
Why this level
This belongs at Level 1 because transparent gap recording is required before stronger testing, retesting, or independent review can be meaningful.
Evidence examples
| Evidence | Likely owner/provider | When collected | What it should show | Claim limit |
|---|---|---|---|---|
| Gap and assumption register | Evidence or governance owner | During review and after findings, incidents, provider changes, or scope changes | Known gaps, assumptions, untested controls, exception references, residual risks, owners, and review dates | Shows acknowledged limitations; does not make the risk acceptable by itself. |
| Residual-risk note | Governance owner with control owner input | When a gap cannot be remediated before use | Risk description, affected controls, evidence basis, mitigation, owner, and expiry or review date | Supports governance review; does not prove legal or business acceptability. |
| Claim-limit update | Governance or evidence owner | When a gap affects internal or external wording | Claims that must be blocked, narrowed, delayed, or reviewed because of validation results | Controls wording; does not prove the underlying risk is fixed. |
AWOSS-VAL-L2-001: Pre-Production And Expansion Tests Level 2
Requirement summary
Test or review approval gates, denied-action paths, source-trust controls, sensitive-data controls, and logging controls before production deployment or material boundary expansion. Include human oversight paths and incident or rollback procedures for high-impact workflows.
Why it exists
Production use and boundary expansion are where paper controls often fail. A new connector, memory source, source package, workflow, approval policy, file path, SaaS action, or data class can introduce paths that were never exercised.
Why this level
This belongs at Level 2 because managed production use needs practical testing of important gates and bad paths, not only a control inventory.
Evidence examples
| Evidence | Likely owner/provider | When collected | What it should show | Claim limit |
|---|---|---|---|---|
| Pre-production validation plan | Evidence owner with runtime, workspace, source, data, log, and governance owners | Before production deployment or material boundary expansion | Approval, denial, source-trust, sensitive-data, logging, oversight, incident, and rollback tests to run | Defines tests; does not prove they passed. |
| Denied-path and approval test result | Runtime or evidence owner | Before production use and after policy or workflow changes | Safe fixture, expected deny or approval path, actual result, receipt ID, reviewer, and finding if bypassed | Validates named paths only; does not prove all bypasses are closed. |
| Rollback or emergency procedure drill | Runtime, workspace, or incident owner | Before high-impact production use and after rollback-path changes | Test workflow, stop or rollback action, restored state, owner signoff, and gaps | Tests selected rollback path; does not prove every downstream side effect is reversible. |
AWOSS-VAL-L2-002: Finding Lifecycle And Retest Triggers Level 2
Requirement summary
Track validation findings, remediation status, risk acceptance, owners, target dates, and retest or review triggers for material gaps.
Why it exists
A failed test should not disappear into a chat thread, spreadsheet, or informal TODO. Material validation findings need a lifecycle that records who owns the decision, what changed, whether risk was accepted, and when the issue must be retested.
Why this level
This belongs at Level 2 because production validation needs closed-loop management. Level 1 can record gaps; Level 2 must track material findings to remediation, acceptance, or retest.
Evidence examples
| Evidence | Likely owner/provider | When collected | What it should show | Claim limit |
|---|---|---|---|---|
| Validation finding record | Evidence or security owner with affected control owner | When a validation gap is found | Finding ID, affected controls, scenario, severity or impact, owner, evidence reference, and status | Tracks finding state; does not prove remediation is sufficient. |
| Retest trigger record | Evidence owner or release owner | When remediation, risk acceptance, scope change, or provider change occurs | Trigger, required retest, owner, target date, fixture or scenario, and closure requirement | Schedules retest; does not prove the retest passed. |
| Risk acceptance record | Governance owner with evidence owner input | When a finding remains open by decision | Residual risk, rationale, owner, expiry or review date, claim limits, and compensating controls | Supports decision review; does not prove the risk is acceptable outside the named scope. |
AWOSS-VAL-L2-003: Repeatable Fixtures And Review Queries Level 2
Requirement summary
Use repeatable validation fixtures, review checklists, policy tests, adversarial prompts, context-boundary tests, or evidence queries for recurring production reviews where practical.
Why it exists
One-off validation is hard to compare over time. Repeatable fixtures and queries let a team check whether an approval gate, denied-action path, source-trust assumption, context boundary, sensitive-data rule, log reconstruction path, or rollback path still behaves as expected.
Why this level
This belongs at Level 2 because repeatability is needed once the scoped system is used in production or expanded. The fixtures may still be manual or semi-automated, but they should be stable enough to rerun.
Evidence examples
| Evidence | Likely owner/provider | When collected | What it should show | Claim limit |
|---|---|---|---|---|
| Validation fixture catalog | Evidence owner with control owners | Before recurring review and after fixture changes | Fixture ID, covered controls, safe setup, expected behavior, evidence fields, owner, and next review trigger | Supports repeatability; does not prove coverage of unlisted scenarios. |
| Review checklist or evidence query set | Evidence or audit owner | During recurring review and after evidence-source changes | Questions or queries for approvals, denials, source drift, context state, sensitive handling, logs, and findings | Helps consistent review; does not prove evidence sources are complete. |
| Fixture run record | Evidence owner or test owner | Each validation run | Fixture version, scoped system state, expected result, actual result, receipt IDs, findings, and retest status | Shows a named fixture result; does not prove complete workspace safety. |
AWOSS-VAL-L3-001: Recurring High-Impact Validation And Drift Review Level 3
Requirement summary
Perform recurring validation for high-impact workflows, including boundary enforcement, runtime action control, context-poisoning resistance, sensitive-data handling, logging integrity, and incident or rollback procedures, with review of drift, monitoring signals, and human-intervention records where applicable.
Why it exists
Agentic systems drift. Models, prompts, instructions, memory, retrieval corpora, tools, connectors, source versions, permissions, policies, logs, monitoring rules, providers, and business workflows can change after the first review.
Why this level
This belongs at Level 3 because high-impact workflows need stronger ongoing assurance. The focus is not continuous perfection; it is a recurring and trigger-driven review that can detect material drift and preserve evidence.
Evidence examples
| Evidence | Likely owner/provider | When collected | What it should show | Claim limit |
|---|---|---|---|---|
| Recurring validation schedule | Governance or evidence owner | Before high-impact use and after review-cadence changes | Covered workflows, cadence, triggers, owners, fixtures, evidence sources, and escalation path | Defines cadence; does not prove reviews are effective. |
| Drift review packet | Evidence owner with runtime, source, context, log, and governance owners | On schedule and after material changes | Model, prompt, source, tool, connector, policy, context, data, log, finding, monitoring, and provider changes reviewed | Supports drift review; does not prove all drift was detected. |
| Production-log sample review | Evidence or audit owner | Periodically and after incidents or monitoring signals | Sampled workflow, receipt IDs, reconstruction result, missing fields, findings, and retest triggers | Reviews selected records only; does not prove all production activity is safe. |
AWOSS-VAL-L3-002: Separated Or Qualified Review Level 3
Requirement summary
Use separated, independent, or qualified review for high-assurance validation where feasible, and record the reviewer relationship or qualification basis.
Why it exists
Builders are often too close to their own controls. A separated reviewer, qualified internal reviewer, model risk reviewer, red team, or external assessor can challenge assumptions, evidence quality, finding closure, and claim posture.
Why this level
This belongs at Level 3 because it adds stronger assurance and governance discipline for high-impact workflows. It also requires careful claim language because awoss does not yet define an assessor qualification or independence model.
Evidence examples
| Evidence | Likely owner/provider | When collected | What it should show | Claim limit |
|---|---|---|---|---|
| Reviewer relationship record | Governance or evidence owner | Before high-assurance review and at review completion | Reviewer identity or role, relationship to build team, independence or separation basis, conflicts, and scope | Shows relationship; does not prove auditor independence or certification. |
| Qualification basis note | Governance owner or review lead | Before relying on review conclusions | Experience, role, training, domain knowledge, red-team responsibility, or external engagement scope relevant to the review | Supports reviewer selection; does not create an awoss assessor credential. |
| Challenge review summary | Separated reviewer, red team, or qualified reviewer | At review completion | Evidence challenged, findings opened, assumptions questioned, accepted limitations, and management response | Supports high-assurance review; does not prove complete security. |
AWOSS-VAL-L3-003: Adversarial And Abuse-Case Exercises Level 3
Requirement summary
Include adversarial testing, red-team exercises, tabletop exercises, or abuse-case testing for material agentic workspace risks, including source-trust abuse, context manipulation, tool misuse, sensitive-data exposure, and incident-response paths.
Why it exists
Happy-path testing does not show how the system behaves when a document contains hostile instructions, a connector exposes too much data, a tool tries an unsafe action, a source changes unexpectedly, a secret appears in a prompt, or responders need to stop and reconstruct a harmful workflow.
Why this level
This belongs at Level 3 because adversarial and incident-style testing is stronger, riskier, and more specialized than basic production validation. It should be scoped, harmless by default, and tied to findings and retests.
Evidence examples
| Evidence | Likely owner/provider | When collected | What it should show | Claim limit |
|---|---|---|---|---|
| Abuse-case scenario list | Security, evidence, or red-team owner with control owners | Before adversarial review and after risk changes | Source-trust, context poisoning, tool misuse, sensitive-data exposure, logging, rollback, and incident scenarios | Defines scenarios; does not prove all abuse paths are covered. |
| Red-team or adversarial test summary | Security, red-team, or evidence owner | After approved adversarial exercise | Safe payload or fixture references, expected behavior, actual behavior, findings, remediation, and retest plan | Supports scenario review; does not prove prompt-injection resistance or complete safety. |
| Tabletop exercise packet | Governance or incident owner with evidence owner | During scheduled exercises and after major incidents | Roles, decisions, evidence retrieved, stop or rollback path, escalation route, claim-limit decision, and improvement backlog | Tests decision-making and evidence retrieval; does not prove technical controls operated in production. |
External Mapping Notes
The completed crosswalk treats AWOSS-VAL as the broadest-covered awoss family. It is shaped by verification, testing, monitoring, human oversight, recurring review, vulnerability scoring, red-team, threat-modeling, risk management, and improvement themes across many sources.
Relevant external-source signals include:
- EU AI Act official sources inform oversight, monitoring, input review, and validation evidence angles, but
AWOSS-VALdoes not prove legal compliance, conformity assessment, high-risk classification, or other legal judgments. - OWASP AISVS informs output controls, adversarial tests, drift review, and kill-switch or emergency exercises, but current public AISVS material does not create an
awosscertification or complete-safety claim. - AIUC-1 is useful as a commercial comparator for annual review, quarterly testing, human review, and intervention records, but there is no AIUC-1 certificate equivalence.
- OWASP Agentic Skills Top 10, OWASP AIVSS, CSA AICM, CSA MAESTRO, NIST AI RMF, NIST AI 600-1, ISO/IEC 42001, ISO/IEC 23894, Five Eyes guidance, and MITRE ATLAS inform selected testing, assessment, monitoring, red-team, risk-review, and remediation practices, but none of those sources by itself validates the complete agentic workspace boundary.
- The tooling research notes show that practical validation support exists across eval frameworks, guardrail tests, red-team scanners, traces, logs, issue trackers, and governance records, but current tooling remains fragmented across products and layers.
Formal Standard Link
Use this guide with the formal AWOSS-VAL candidate requirements. If the guide and the standard draft disagree, the standard draft controls.