12.3 AI Bills of Materials (AI-BOM)¶
As organizations increasingly integrate machine learning models into their products and operations, traditional SBOMs prove insufficient. A software bill of materials tells you what code libraries are in your application, but when that application includes an ML model, critical questions remain unanswered: What data was used to train this model? Which base model was it fine-tuned from? What configuration parameters affect its behavior? These questions matter for security, compliance, and operational reasons that parallel—but extend beyond—traditional software supply chain concerns.
This section introduces the AI Bill of Materials (AI-BOM) as an emerging extension of SBOM concepts to AI/ML systems, examining standards, implementation approaches, and the regulatory pressures driving adoption.
Why Traditional SBOMs Are Insufficient¶
SBOMs designed for traditional software capture libraries, packages, and dependencies—components that execute code. AI/ML systems include additional artifact types that SBOMs weren't designed to represent.
The Gap:
| Traditional Software | AI/ML Systems |
|---|---|
| Libraries, packages | Models (weights, architectures) |
| Source code | Training data |
| Configuration files | Hyperparameters |
| Build scripts | Training pipelines |
| Binary executables | Inference configurations |
A traditional SBOM for a Python ML application would capture PyTorch, TensorFlow, or scikit-learn as dependencies, but would miss:
- The pre-trained model downloaded from Hugging Face
- The dataset used for fine-tuning
- The training configuration that shaped model behavior
- The relationship between base model and fine-tuned version
Security Implications of the Gap:
Without AI-specific inventory, organizations cannot answer critical questions:
- Does this model contain a backdoor? (Model poisoning assessment)
- Was training data appropriately sourced? (Data provenance)
- Is the model vulnerable to known attacks? (Vulnerability correlation)
- What licenses apply to this model? (License compliance)
- What are the model's known limitations? (Operational risk)
Book 1, Chapter 10 detailed ML-specific supply chain risks—pickle deserialization attacks, model backdoors, dataset poisoning. AI-BOM provides the visibility necessary to assess and manage these risks.
What an AI-BOM Includes¶
An AI-BOM extends traditional SBOM concepts to capture AI/ML-specific components and their relationships.
Core Component Types:
Models: - Model architecture (transformer, CNN, etc.) - Weights and parameters - Version and lineage (base model, fine-tuned variants) - Format (SafeTensors, ONNX, pickle, etc.) - Hash for integrity verification - Performance metrics and benchmarks
Datasets: - Dataset identifier and version - Source and provenance - Composition (what data types, distributions) - Preprocessing applied - Known limitations and biases - Licensing and usage restrictions
Training Pipelines: - Training code and scripts - Hyperparameters used - Random seeds (for reproducibility) - Hardware and environment - Training duration and cost
Configurations: - Inference parameters (temperature, top-p, etc.) - Quantization settings - Deployment constraints - Safety configurations
Dependencies: - ML frameworks (PyTorch, TensorFlow) - Libraries (transformers, datasets) - Traditional software dependencies
Component Relationships:
AI-BOMs must capture relationships between components:
Fine-tuned Model
├── DERIVED_FROM: Base Model (Llama-2-7B)
├── TRAINED_ON: Dataset (internal-qa-pairs-v3)
├── USING: Training Pipeline (qa-finetune-pipeline)
└── DEPENDS_ON: Libraries (transformers, peft, torch)
These relationships enable lineage tracking and impact assessment.
CycloneDX ML-BOM Profile¶
CycloneDX version 1.5 (June 2023) introduced the Machine Learning BOM (ML-BOM) profile, extending the existing specification to support AI/ML components. The current version 1.7 (October 2025) includes further enhancements.
ML-BOM Component Types:
CycloneDX defines specific component types for ML:
{
"components": [
{
"type": "machine-learning-model",
"name": "sentiment-classifier-v2",
"version": "2.1.0",
"modelCard": {
"modelParameters": {
"architecture": {
"family": "transformer",
"name": "BERT"
},
"quantization": {
"bitsOfPrecision": 16
}
},
"inputs": [
{
"format": "text"
}
],
"outputs": [
{
"format": "classification"
}
]
}
},
{
"type": "data",
"name": "sentiment-training-data",
"version": "1.0.0",
"data": {
"type": "dataset",
"contents": {
"count": 50000,
"type": "samples"
},
"governance": {
"custodian": "Data Science Team",
"owner": "Acme Corp"
}
}
}
]
}
Key ML-BOM Fields:
| Field | Purpose |
|---|---|
modelCard |
Structured model documentation |
modelParameters |
Architecture, quantization, hyperparameters |
inputs / outputs |
Data format specifications |
considerations |
Ethical, performance, and safety notes |
data.type |
dataset, configuration, source-code, etc. |
data.governance |
Ownership and custodianship |
data.contents |
What the dataset contains |
Advantages of CycloneDX ML-BOM:
- Integrates with existing CycloneDX tooling
- Supports VEX for vulnerability exploitation status
- Designed for security use cases
- Active community development
SPDX 3.0 AI Profile¶
SPDX version 3.0 (2024) introduces AI and dataset profiles as part of a major specification update.
SPDX 3.0 Structure:
SPDX 3.0 uses a modular profile system:
- Core profile: Basic SBOM elements
- Software profile: Software packages and files
- AI profile: Models and AI systems
- Dataset profile: Training and evaluation data
- Build profile: Build process information
AI Profile Elements:
{
"@type": "ai_AIPackage",
"name": "sentiment-classifier",
"ai_autonomyType": "supervised",
"ai_domain": ["nlp", "sentiment-analysis"],
"ai_energyConsumption": {
"ai_finetuningEnergyConsumption": "50 kWh",
"ai_inferenceEnergyConsumption": "0.001 kWh/request"
},
"ai_informationAboutTraining": "Fine-tuned on proprietary dataset...",
"ai_limitation": "May exhibit bias toward English idioms",
"ai_modelDataPreprocessing": "Tokenized using BERT tokenizer",
"ai_safetyRiskAssessment": "Low risk - sentiment classification only"
}
Dataset Profile Elements:
{
"@type": "dataset_DatasetPackage",
"name": "sentiment-training-data",
"dataset_dataCollectionProcess": "Web scraping with manual validation",
"dataset_dataPreprocessing": "Removed PII, normalized text",
"dataset_knownBias": "Overrepresents social media language",
"dataset_sensitivePersonalInformation": "none",
"dataset_intendedUse": "Training sentiment classifiers"
}
SPDX vs. CycloneDX for AI-BOM:
| Aspect | SPDX 3.0 | CycloneDX 1.6+ |
|---|---|---|
| AI support maturity | New (2024) | Established (2023) |
| Dataset support | Dedicated profile | Integrated |
| Energy tracking | Native fields | Extension |
| Legal/license focus | Strong | Moderate |
| Security focus | Moderate | Strong |
| Tooling | Emerging | Growing |
Both standards are viable; choice depends on organizational context and existing tooling.
Model Cards and AI-BOM¶
Model cards are structured documentation about ML models, originally proposed by Mitchell et al. in 2019. AI-BOMs and model cards serve complementary purposes.
Relationship:
- Model cards: Human-readable documentation about model characteristics, intended use, and limitations
- AI-BOM: Machine-readable inventory of model components and dependencies
Both CycloneDX and SPDX integrate model card concepts:
- CycloneDX includes
modelCardas a component field - SPDX AI profile fields mirror model card categories
Model Card Elements in AI-BOM:
| Model Card Section | AI-BOM Representation |
|---|---|
| Model Details | Component metadata, version |
| Intended Use | Considerations, domain fields |
| Factors | Input/output specifications |
| Metrics | Performance benchmarks |
| Ethical Considerations | Safety assessment, bias information |
| Caveats and Recommendations | Limitations, known issues |
Practical Integration:
Generate model cards from AI-BOM data for human consumption; maintain AI-BOM as source of truth for automation:
AI-BOM (machine-readable)
│
├──► Vulnerability scanning
├──► License compliance
├──► Dependency tracking
│
└──► Model Card (human-readable)
│
└──► Developer documentation
End-user transparency
Regulatory compliance
Model Provenance and Lineage¶
Model provenance documents where a model came from and how it was created. Model lineage tracks relationships between models across versions and fine-tuning.
Why Provenance Matters:
- Security: Verify model wasn't tampered with
- Compliance: Demonstrate model creation process
- Reproducibility: Enable recreation of model
- Attribution: Track intellectual property and licensing
Provenance Elements:
provenance:
origin:
source: "huggingface.co/meta-llama/Llama-2-7b-hf"
download_date: "2024-01-15"
checksum: "sha256:abc123..."
training:
start_date: "2024-01-16"
end_date: "2024-01-18"
environment:
hardware: "8x A100 80GB"
framework: "transformers 4.36.0"
code_repository: "github.com/acme/model-training"
code_commit: "def456..."
lineage:
base_model: "meta-llama/Llama-2-7b-hf"
relationship: "fine-tuned"
modifications:
- "LoRA adaptation for QA"
- "Quantized to 4-bit"
Signing and Verification:
Model provenance should be cryptographically signed:
# Sign model and provenance attestation
cosign sign-blob --key model-signing.key model.safetensors
cosign attest --key model-signing.key --predicate provenance.json model.safetensors
Dataset Integrity and Provenance¶
Datasets shape model behavior as much as architecture does. Dataset documentation is essential for AI-BOM completeness.
Dataset Provenance Elements:
- Source: Where data originated
- Collection method: How data was gathered
- Preprocessing: Transformations applied
- Composition: What the data contains
- Known biases: Recognized limitations
- Licensing: Usage rights and restrictions
Documentation Challenges:
Dataset provenance is often poorly documented:
- Web-scraped data has unclear provenance
- Aggregated datasets obscure original sources
- Preprocessing steps may not be recorded
- Consent and licensing may be ambiguous
Best Practices:
- Document data sources at collection time
- Record all preprocessing transformations
- Capture dataset statistics and distributions
- Note known limitations and biases
- Maintain clear licensing information
Detecting Malicious Model Artifacts¶
AI-BOMs enable security analysis of model artifacts, complementing the malicious model detection discussed in Book 1, Chapter 10.
Detection Approaches:
Format Analysis: - Identify file format (pickle, SafeTensors, ONNX) - Flag high-risk formats (pickle) for additional scrutiny - Verify format matches claimed type
Content Scanning:
# Scan pickle files for code execution
picklescan --path model.pkl
# Scan model for secrets
detect-secrets scan model-config.json
# Verify SafeTensors integrity
python -c "from safetensors import safe_open; safe_open('model.safetensors', framework='pt')"
Provenance Verification: - Verify cryptographic signatures - Check provenance attestations - Compare against known-good hashes
Behavioral Analysis: - Test model in sandboxed environment - Monitor for unexpected network activity - Check for anomalous resource usage
AI-BOM Security Fields:
{
"components": [
{
"type": "machine-learning-model",
"name": "classifier",
"hashes": [
{
"alg": "SHA-256",
"content": "abc123..."
}
],
"properties": [
{
"name": "format",
"value": "safetensors"
},
{
"name": "security-scan-date",
"value": "2024-01-15"
},
{
"name": "pickle-scan-result",
"value": "not-applicable"
}
]
}
]
}
Regulatory Drivers¶
Regulatory requirements increasingly mandate AI transparency, driving AI-BOM adoption.
EU AI Act (2024):
The EU AI Act requires documentation for AI systems:
- Technical documentation throughout lifecycle
- Record-keeping of training, validation, and testing data
- Transparency obligations for certain AI systems
- Risk assessment and mitigation documentation
AI-BOM provides the foundation for meeting these requirements.
Specific Requirements:
| Requirement | AI-BOM Contribution |
|---|---|
| Training data documentation | Dataset provenance |
| Model capabilities and limitations | Model card elements |
| Risk assessment | Safety considerations |
| Version control | Lineage tracking |
| Modification tracking | Change history |
Other Regulatory Drivers:
- FDA AI/ML guidance: Medical device AI documentation (enforceable since October 2023)
- NIST AI RMF: Risk management framework
- China AI regulations: Algorithm registration requirements
Compliance Integration:
AI-BOM supports compliance workflows:
AI-BOM Generation
│
├──► EU AI Act compliance reports
├──► FDA submission documentation
├──► Risk assessment evidence
└──► Audit trail maintenance
Tooling Landscape¶
AI-BOM tooling is less mature than traditional SBOM tooling but developing rapidly.
Generation Tools:
| Tool | Type | AI-BOM Support |
|---|---|---|
| Syft | Scanner | Limited (model file detection) |
| CycloneDX CLI | Generator | ML-BOM profile support |
| SPDX Tools | Generator | SPDX 3.0 AI profile |
| MLflow | ML Platform | Model tracking, exportable |
| Hugging Face Hub | Model Registry | Model card export |
| DVC | Data Version Control | Dataset tracking |
Current Gaps:
- Automated model scanning: Limited tools for comprehensive model analysis
- Dataset lineage: Manual tracking required in most cases
- Training pipeline capture: Integration with ML platforms needed
- Cross-platform correlation: Linking models across registries
Emerging Solutions:
- ONNX Model Hub: Standardized model format with metadata
- Hugging Face Transformers: Model card generation utilities
- MLflow Model Registry: Model lifecycle management
- Weights & Biases: Experiment and artifact tracking
Recommendations¶
For ML Engineers:
-
Start documenting now. Even without formal AI-BOM tools, begin capturing model provenance, dataset sources, and training configurations.
-
Use SafeTensors. Avoid pickle-based formats where possible. SafeTensors eliminates code execution risks during model loading.
-
Track model lineage. Document base models, fine-tuning relationships, and version history.
-
Integrate with ML platforms. Use MLflow, Weights & Biases, or similar tools to capture training metadata systematically.
For Security Practitioners:
-
Extend SBOM programs to AI. Include AI-BOM in software inventory requirements. Don't treat models as black boxes.
-
Scan model artifacts. Implement pickle scanning and format verification for downloaded models.
-
Verify model provenance. Check signatures and attestations for models from external sources.
-
Assess dataset risks. Understand what data trained models you depend on.
For Compliance Teams:
-
Map regulatory requirements. Identify which AI transparency requirements apply to your organization and products.
-
Establish AI-BOM standards. Define organizational requirements for AI documentation before regulations force rushed adoption.
-
Build audit trails. Capture AI-BOM information contemporaneously; reconstructing history is difficult.
-
Prepare for EU AI Act. If selling AI systems in Europe, AI-BOM capability will be essential for compliance.
AI-BOM extends supply chain transparency to a new category of artifacts with distinct security and compliance considerations. While tooling and standards are still maturing, the regulatory trajectory is clear: organizations building and deploying AI systems will need to document what's in them. Starting now—even with imperfect tools—builds the institutional knowledge and practices that will be required when regulations fully take effect.