Family Guides

AWOSS-VAL: Validation, Testing, And Review

AWOSS-VAL turns control promises into tests. The goal is to show which agent paths were checked, what passed, what failed, who owns the fixes or accepted risks, and when the review has to run again.

Paper controls are not enough for agentic work. Approval screens, sandboxes, logging settings, source reviews, DLP rules, and policy gates can all look reasonable until a real workflow reads context, calls tools, uses connectors, writes files, runs commands, stores memory, requests approvals, handles sensitive data, or triggers a downstream business action.

The output should be a reviewable validation packet: coverage, fixtures, findings, retests, owner decisions, and independent challenge for higher-impact systems where appropriate.

What This Family Covers

In scope:

  • Validation coverage matrices that say which candidate controls were checked, how they were checked, and which controls were not checked.
  • Review artifacts with scope, method, reviewer or owner, date, evidence references, finding status, assumptions, and claim limits.
  • Gap, exception, residual-risk, and untested-control records discovered during validation.
  • Pre-production and pre-expansion tests for approval gates, denied-action paths, source-trust controls, sensitive-data controls, logging controls, human oversight paths, incident handling, and rollback procedures.
  • Finding lifecycle records that connect validation failures to owner, remediation, accepted risk, target date, retest trigger, and closure state.
  • Repeatable validation fixtures, review checklists, policy tests, adversarial prompts, context-boundary tests, safe evidence queries, and production-log samples.
  • Recurring validation for high-impact workflows after model, prompt, source, connector, policy, boundary, data, evidence-store, monitoring, or provider changes.
  • Separated, independent, or qualified review for high-assurance validation where feasible.
  • Adversarial testing, red-team exercises, tabletop exercises, and abuse-case scenarios for material agentic workspace risks.

Out of scope:

  • Creating a general awoss certification, assessor, auditor, or public conformance program.
  • Proving that a model evaluation, red-team scan, benchmark, or single guardrail test validates the complete scoped workspace.
  • Guaranteeing absence of prompt injection, data leakage, unsafe tool use, source drift, logging gaps, or governance failure.
  • Replacing legal, regulatory, privacy, safety, employment, procurement, or sector-specific review.
  • Storing raw exploit payloads, prompts, screenshots, media, credentials, personal data, customer records, or confidential documents where synthetic fixtures, masked samples, hashes, summaries, or protected references are sufficient.

Level Summary

Levels are cumulative. Level 2 builds on Level 1, and Level 3 builds on both.

LevelPlain-language meaningWhy this level existsTypical evidence
Level 1The organization knows what was checked, has at least one review artifact, and records known gaps instead of overclaiming.A scoped system cannot support assurance discussions until reviewed controls, methods, gaps, and assumptions are visible.Coverage matrix, review artifact, gap register, assumptions record, untested-control list.
Level 2Production or expanded use is preceded by meaningful tests of important gates, denied paths, data handling, logging, oversight, and rollback, with findings tracked to decision or retest.Managed production use needs repeatable validation and a finding lifecycle, not one-off screenshots or informal signoff.Pre-production test plan, fixture results, denial receipts, approval test records, finding tracker, retest records.
Level 3High-impact workflows are revalidated over time, challenged by separated or qualified reviewers where feasible, and tested against adversarial or incident scenarios.High-assurance environments need recurring review, drift checks, independent challenge, and abuse-case coverage for material risks.Scheduled validation runs, drift review, production-log sample, independent review summary, red-team or tabletop report.

Candidate Controls

AWOSS-VAL-L1-001: Validation Coverage Matrix Level 1

Requirement summary

Identify which candidate controls were reviewed by documentation, configuration inspection, sampled evidence, manual test, automated test, monitoring review, or not reviewed in the current draft assessment.

Why it exists

Without a coverage matrix, a team may mistake a few screenshots, eval runs, or policy notes for complete validation. The matrix makes the review method explicit and shows where no review happened.

Why this level

This belongs at Level 1 because it is the foundation for honest assurance. It does not require advanced tooling, but it requires naming the controls, methods, evidence references, and gaps.

Evidence examples

EvidenceLikely owner/providerWhen collectedWhat it should showClaim limit
Control coverage matrixEvidence or audit owner with family control ownersBefore assurance discussion and after material scope or control changesCandidate controls, method used for each, evidence reference, reviewer or owner, and not-reviewed statusShows review coverage; does not prove controls were effective.
Untested-control registerEvidence or audit ownerDuring each validation passControls, workflows, data classes, tools, or scenarios not reviewed and whyPrevents overclaiming; does not prove untested paths are low risk.
Review-method taxonomyEvidence or audit ownerBefore validation planning and after method changesDefinitions for documentation review, configuration inspection, sampled evidence, manual test, automated test, monitoring review, and no reviewStandardizes method labels; does not prove method quality.

AWOSS-VAL-L1-002: Minimum Review Artifact Level 1

Requirement summary

Maintain at least one validation or review artifact for the scoped boundary before using awoss candidate controls in internal assurance discussions. Include scope, method, reviewer or owner, date, and finding status.

Why it exists

Internal assurance claims need a durable record. A conversation, meeting memory, or undocumented walkthrough cannot show later what was reviewed, by whom, against which boundary, or with what result.

Why this level

This belongs at Level 1 because every scoped system should have at least one review packet before anyone discusses awoss readiness, mapping, or control support.

Evidence examples

EvidenceLikely owner/providerWhen collectedWhat it should showClaim limit
Validation review packetEvidence or audit ownerBefore internal assurance discussion and after material review updatesScoped boundary, reviewed controls, method, reviewer or owner, date, findings, gaps, and evidence referencesSupports review of selected controls; does not prove conformance or complete coverage.
Reviewer signoff noteReviewer, control owner, or evidence ownerAt review completionReviewer identity or role, relationship to system, scope reviewed, result, and open findingsRecords review participation; does not prove reviewer independence or assessor qualification.
Sample evidence bundleEvidence owner with runtime, source, log, and governance ownersDuring validation packet preparationRepresentative receipts, logs, configuration exports, test results, and redacted references tied to the scoped boundarySupports sampled review; does not prove all workflows were tested.

AWOSS-VAL-L1-003: Known Gaps And Assumptions Level 1

Requirement summary

Record known gaps, assumptions, exceptions, residual risks, or untested controls discovered during review.

Why it exists

A useful validation pass should make uncertainty visible. Hidden assumptions and unstated exceptions are a common source of overclaiming, especially when hosted products, local desktop agents, connectors, logs, and governance records expose different evidence.

Why this level

This belongs at Level 1 because transparent gap recording is required before stronger testing, retesting, or independent review can be meaningful.

Evidence examples

EvidenceLikely owner/providerWhen collectedWhat it should showClaim limit
Gap and assumption registerEvidence or governance ownerDuring review and after findings, incidents, provider changes, or scope changesKnown gaps, assumptions, untested controls, exception references, residual risks, owners, and review datesShows acknowledged limitations; does not make the risk acceptable by itself.
Residual-risk noteGovernance owner with control owner inputWhen a gap cannot be remediated before useRisk description, affected controls, evidence basis, mitigation, owner, and expiry or review dateSupports governance review; does not prove legal or business acceptability.
Claim-limit updateGovernance or evidence ownerWhen a gap affects internal or external wordingClaims that must be blocked, narrowed, delayed, or reviewed because of validation resultsControls wording; does not prove the underlying risk is fixed.

AWOSS-VAL-L2-001: Pre-Production And Expansion Tests Level 2

Requirement summary

Test or review approval gates, denied-action paths, source-trust controls, sensitive-data controls, and logging controls before production deployment or material boundary expansion. Include human oversight paths and incident or rollback procedures for high-impact workflows.

Why it exists

Production use and boundary expansion are where paper controls often fail. A new connector, memory source, source package, workflow, approval policy, file path, SaaS action, or data class can introduce paths that were never exercised.

Why this level

This belongs at Level 2 because managed production use needs practical testing of important gates and bad paths, not only a control inventory.

Evidence examples

EvidenceLikely owner/providerWhen collectedWhat it should showClaim limit
Pre-production validation planEvidence owner with runtime, workspace, source, data, log, and governance ownersBefore production deployment or material boundary expansionApproval, denial, source-trust, sensitive-data, logging, oversight, incident, and rollback tests to runDefines tests; does not prove they passed.
Denied-path and approval test resultRuntime or evidence ownerBefore production use and after policy or workflow changesSafe fixture, expected deny or approval path, actual result, receipt ID, reviewer, and finding if bypassedValidates named paths only; does not prove all bypasses are closed.
Rollback or emergency procedure drillRuntime, workspace, or incident ownerBefore high-impact production use and after rollback-path changesTest workflow, stop or rollback action, restored state, owner signoff, and gapsTests selected rollback path; does not prove every downstream side effect is reversible.

AWOSS-VAL-L2-002: Finding Lifecycle And Retest Triggers Level 2

Requirement summary

Track validation findings, remediation status, risk acceptance, owners, target dates, and retest or review triggers for material gaps.

Why it exists

A failed test should not disappear into a chat thread, spreadsheet, or informal TODO. Material validation findings need a lifecycle that records who owns the decision, what changed, whether risk was accepted, and when the issue must be retested.

Why this level

This belongs at Level 2 because production validation needs closed-loop management. Level 1 can record gaps; Level 2 must track material findings to remediation, acceptance, or retest.

Evidence examples

EvidenceLikely owner/providerWhen collectedWhat it should showClaim limit
Validation finding recordEvidence or security owner with affected control ownerWhen a validation gap is foundFinding ID, affected controls, scenario, severity or impact, owner, evidence reference, and statusTracks finding state; does not prove remediation is sufficient.
Retest trigger recordEvidence owner or release ownerWhen remediation, risk acceptance, scope change, or provider change occursTrigger, required retest, owner, target date, fixture or scenario, and closure requirementSchedules retest; does not prove the retest passed.
Risk acceptance recordGovernance owner with evidence owner inputWhen a finding remains open by decisionResidual risk, rationale, owner, expiry or review date, claim limits, and compensating controlsSupports decision review; does not prove the risk is acceptable outside the named scope.

AWOSS-VAL-L2-003: Repeatable Fixtures And Review Queries Level 2

Requirement summary

Use repeatable validation fixtures, review checklists, policy tests, adversarial prompts, context-boundary tests, or evidence queries for recurring production reviews where practical.

Why it exists

One-off validation is hard to compare over time. Repeatable fixtures and queries let a team check whether an approval gate, denied-action path, source-trust assumption, context boundary, sensitive-data rule, log reconstruction path, or rollback path still behaves as expected.

Why this level

This belongs at Level 2 because repeatability is needed once the scoped system is used in production or expanded. The fixtures may still be manual or semi-automated, but they should be stable enough to rerun.

Evidence examples

EvidenceLikely owner/providerWhen collectedWhat it should showClaim limit
Validation fixture catalogEvidence owner with control ownersBefore recurring review and after fixture changesFixture ID, covered controls, safe setup, expected behavior, evidence fields, owner, and next review triggerSupports repeatability; does not prove coverage of unlisted scenarios.
Review checklist or evidence query setEvidence or audit ownerDuring recurring review and after evidence-source changesQuestions or queries for approvals, denials, source drift, context state, sensitive handling, logs, and findingsHelps consistent review; does not prove evidence sources are complete.
Fixture run recordEvidence owner or test ownerEach validation runFixture version, scoped system state, expected result, actual result, receipt IDs, findings, and retest statusShows a named fixture result; does not prove complete workspace safety.

AWOSS-VAL-L3-001: Recurring High-Impact Validation And Drift Review Level 3

Requirement summary

Perform recurring validation for high-impact workflows, including boundary enforcement, runtime action control, context-poisoning resistance, sensitive-data handling, logging integrity, and incident or rollback procedures, with review of drift, monitoring signals, and human-intervention records where applicable.

Why it exists

Agentic systems drift. Models, prompts, instructions, memory, retrieval corpora, tools, connectors, source versions, permissions, policies, logs, monitoring rules, providers, and business workflows can change after the first review.

Why this level

This belongs at Level 3 because high-impact workflows need stronger ongoing assurance. The focus is not continuous perfection; it is a recurring and trigger-driven review that can detect material drift and preserve evidence.

Evidence examples

EvidenceLikely owner/providerWhen collectedWhat it should showClaim limit
Recurring validation scheduleGovernance or evidence ownerBefore high-impact use and after review-cadence changesCovered workflows, cadence, triggers, owners, fixtures, evidence sources, and escalation pathDefines cadence; does not prove reviews are effective.
Drift review packetEvidence owner with runtime, source, context, log, and governance ownersOn schedule and after material changesModel, prompt, source, tool, connector, policy, context, data, log, finding, monitoring, and provider changes reviewedSupports drift review; does not prove all drift was detected.
Production-log sample reviewEvidence or audit ownerPeriodically and after incidents or monitoring signalsSampled workflow, receipt IDs, reconstruction result, missing fields, findings, and retest triggersReviews selected records only; does not prove all production activity is safe.

AWOSS-VAL-L3-002: Separated Or Qualified Review Level 3

Requirement summary

Use separated, independent, or qualified review for high-assurance validation where feasible, and record the reviewer relationship or qualification basis.

Why it exists

Builders are often too close to their own controls. A separated reviewer, qualified internal reviewer, model risk reviewer, red team, or external assessor can challenge assumptions, evidence quality, finding closure, and claim posture.

Why this level

This belongs at Level 3 because it adds stronger assurance and governance discipline for high-impact workflows. It also requires careful claim language because awoss does not yet define an assessor qualification or independence model.

Evidence examples

EvidenceLikely owner/providerWhen collectedWhat it should showClaim limit
Reviewer relationship recordGovernance or evidence ownerBefore high-assurance review and at review completionReviewer identity or role, relationship to build team, independence or separation basis, conflicts, and scopeShows relationship; does not prove auditor independence or certification.
Qualification basis noteGovernance owner or review leadBefore relying on review conclusionsExperience, role, training, domain knowledge, red-team responsibility, or external engagement scope relevant to the reviewSupports reviewer selection; does not create an awoss assessor credential.
Challenge review summarySeparated reviewer, red team, or qualified reviewerAt review completionEvidence challenged, findings opened, assumptions questioned, accepted limitations, and management responseSupports high-assurance review; does not prove complete security.

AWOSS-VAL-L3-003: Adversarial And Abuse-Case Exercises Level 3

Requirement summary

Include adversarial testing, red-team exercises, tabletop exercises, or abuse-case testing for material agentic workspace risks, including source-trust abuse, context manipulation, tool misuse, sensitive-data exposure, and incident-response paths.

Why it exists

Happy-path testing does not show how the system behaves when a document contains hostile instructions, a connector exposes too much data, a tool tries an unsafe action, a source changes unexpectedly, a secret appears in a prompt, or responders need to stop and reconstruct a harmful workflow.

Why this level

This belongs at Level 3 because adversarial and incident-style testing is stronger, riskier, and more specialized than basic production validation. It should be scoped, harmless by default, and tied to findings and retests.

Evidence examples

EvidenceLikely owner/providerWhen collectedWhat it should showClaim limit
Abuse-case scenario listSecurity, evidence, or red-team owner with control ownersBefore adversarial review and after risk changesSource-trust, context poisoning, tool misuse, sensitive-data exposure, logging, rollback, and incident scenariosDefines scenarios; does not prove all abuse paths are covered.
Red-team or adversarial test summarySecurity, red-team, or evidence ownerAfter approved adversarial exerciseSafe payload or fixture references, expected behavior, actual behavior, findings, remediation, and retest planSupports scenario review; does not prove prompt-injection resistance or complete safety.
Tabletop exercise packetGovernance or incident owner with evidence ownerDuring scheduled exercises and after major incidentsRoles, decisions, evidence retrieved, stop or rollback path, escalation route, claim-limit decision, and improvement backlogTests decision-making and evidence retrieval; does not prove technical controls operated in production.

External Mapping Notes

The completed crosswalk treats AWOSS-VAL as the broadest-covered awoss family. It is shaped by verification, testing, monitoring, human oversight, recurring review, vulnerability scoring, red-team, threat-modeling, risk management, and improvement themes across many sources.

Relevant external-source signals include:

  • EU AI Act official sources inform oversight, monitoring, input review, and validation evidence angles, but AWOSS-VAL does not prove legal compliance, conformity assessment, high-risk classification, or other legal judgments.
  • OWASP AISVS informs output controls, adversarial tests, drift review, and kill-switch or emergency exercises, but current public AISVS material does not create an awoss certification or complete-safety claim.
  • AIUC-1 is useful as a commercial comparator for annual review, quarterly testing, human review, and intervention records, but there is no AIUC-1 certificate equivalence.
  • OWASP Agentic Skills Top 10, OWASP AIVSS, CSA AICM, CSA MAESTRO, NIST AI RMF, NIST AI 600-1, ISO/IEC 42001, ISO/IEC 23894, Five Eyes guidance, and MITRE ATLAS inform selected testing, assessment, monitoring, red-team, risk-review, and remediation practices, but none of those sources by itself validates the complete agentic workspace boundary.
  • The tooling research notes show that practical validation support exists across eval frameworks, guardrail tests, red-team scanners, traces, logs, issue trackers, and governance records, but current tooling remains fragmented across products and layers.

Use this guide with the formal AWOSS-VAL candidate requirements. If the guide and the standard draft disagree, the standard draft controls.