12.3 AI Bills of Materials (AI-BOM)¶

As organizations increasingly integrate machine learning models into their products and operations, traditional SBOMs prove insufficient. A software bill of materials tells you what code libraries are in your application, but when that application includes an ML model, critical questions remain unanswered: What data was used to train this model? Which base model was it fine-tuned from? What configuration parameters affect its behavior? These questions matter for security, compliance, and operational reasons that parallel—but extend beyond—traditional software supply chain concerns.

This section introduces the AI Bill of Materials (AI-BOM) as an emerging extension of SBOM concepts to AI/ML systems, examining standards, implementation approaches, and the regulatory pressures driving adoption.

Why Traditional SBOMs Are Insufficient¶

SBOMs designed for traditional software capture libraries, packages, and dependencies—components that execute code. AI/ML systems include additional artifact types that SBOMs weren't designed to represent.

The Gap:

Traditional Software	AI/ML Systems
Libraries, packages	Models (weights, architectures)
Source code	Training data
Configuration files	Hyperparameters
Build scripts	Training pipelines
Binary executables	Inference configurations

A traditional SBOM for a Python ML application would capture PyTorch, TensorFlow, or scikit-learn as dependencies, but would miss:

The pre-trained model downloaded from Hugging Face
The dataset used for fine-tuning
The training configuration that shaped model behavior
The relationship between base model and fine-tuned version

Security Implications of the Gap:

Without AI-specific inventory, organizations cannot answer critical questions:

Does this model contain a backdoor? (Model poisoning assessment)
Was training data appropriately sourced? (Data provenance)
Is the model vulnerable to known attacks? (Vulnerability correlation)
What licenses apply to this model? (License compliance)
What are the model's known limitations? (Operational risk)

Book 1, Chapter 10 detailed ML-specific supply chain risks—pickle deserialization attacks, model backdoors, dataset poisoning. AI-BOM provides the visibility necessary to assess and manage these risks.

What an AI-BOM Includes¶

An AI-BOM extends traditional SBOM concepts to capture AI/ML-specific components and their relationships.

Core Component Types:

Models: - Model architecture (transformer, CNN, etc.) - Weights and parameters - Version and lineage (base model, fine-tuned variants) - Format (SafeTensors, ONNX, pickle, etc.) - Hash for integrity verification - Performance metrics and benchmarks

Datasets: - Dataset identifier and version - Source and provenance - Composition (what data types, distributions) - Preprocessing applied - Known limitations and biases - Licensing and usage restrictions

Training Pipelines: - Training code and scripts - Hyperparameters used - Random seeds (for reproducibility) - Hardware and environment - Training duration and cost

Configurations: - Inference parameters (temperature, top-p, etc.) - Quantization settings - Deployment constraints - Safety configurations

Dependencies: - ML frameworks (PyTorch, TensorFlow) - Libraries (transformers, datasets) - Traditional software dependencies

Component Relationships:

AI-BOMs must capture relationships between components:

Fine-tuned Model
    ├── DERIVED_FROM: Base Model (Llama-2-7B)
    ├── TRAINED_ON: Dataset (internal-qa-pairs-v3)
    ├── USING: Training Pipeline (qa-finetune-pipeline)
    └── DEPENDS_ON: Libraries (transformers, peft, torch)

These relationships enable lineage tracking and impact assessment.

CycloneDX ML-BOM Profile¶

CycloneDX version 1.5 (June 2023) introduced the Machine Learning BOM (ML-BOM) profile, extending the existing specification to support AI/ML components. The current version 1.7 (October 2025) includes further enhancements.

ML-BOM Component Types:

CycloneDX defines specific component types for ML:

{
  "components": [
    {
      "type": "machine-learning-model",
      "name": "sentiment-classifier-v2",
      "version": "2.1.0",
      "modelCard": {
        "modelParameters": {
          "architecture": {
            "family": "transformer",
            "name": "BERT"
          },
          "quantization": {
            "bitsOfPrecision": 16
          }
        },
        "inputs": [
          {
            "format": "text"
          }
        ],
        "outputs": [
          {
            "format": "classification"
          }
        ]
      }
    },
    {
      "type": "data",
      "name": "sentiment-training-data",
      "version": "1.0.0",
      "data": {
        "type": "dataset",
        "contents": {
          "count": 50000,
          "type": "samples"
        },
        "governance": {
          "custodian": "Data Science Team",
          "owner": "Acme Corp"
        }
      }
    }
  ]
}

Key ML-BOM Fields:

Field	Purpose
`modelCard`	Structured model documentation
`modelParameters`	Architecture, quantization, hyperparameters
`inputs` / `outputs`	Data format specifications
`considerations`	Ethical, performance, and safety notes
`data.type`	dataset, configuration, source-code, etc.
`data.governance`	Ownership and custodianship
`data.contents`	What the dataset contains

Advantages of CycloneDX ML-BOM:

Integrates with existing CycloneDX tooling
Supports VEX for vulnerability exploitation status
Designed for security use cases
Active community development

SPDX 3.0 AI Profile¶

SPDX version 3.0 (2024) introduces AI and dataset profiles as part of a major specification update.

SPDX 3.0 Structure:

SPDX 3.0 uses a modular profile system:

Core profile: Basic SBOM elements
Software profile: Software packages and files
AI profile: Models and AI systems
Dataset profile: Training and evaluation data
Build profile: Build process information

AI Profile Elements:

{
  "@type": "ai_AIPackage",
  "name": "sentiment-classifier",
  "ai_autonomyType": "supervised",
  "ai_domain": ["nlp", "sentiment-analysis"],
  "ai_energyConsumption": {
    "ai_finetuningEnergyConsumption": "50 kWh",
    "ai_inferenceEnergyConsumption": "0.001 kWh/request"
  },
  "ai_informationAboutTraining": "Fine-tuned on proprietary dataset...",
  "ai_limitation": "May exhibit bias toward English idioms",
  "ai_modelDataPreprocessing": "Tokenized using BERT tokenizer",
  "ai_safetyRiskAssessment": "Low risk - sentiment classification only"
}

Dataset Profile Elements:

{
  "@type": "dataset_DatasetPackage",
  "name": "sentiment-training-data",
  "dataset_dataCollectionProcess": "Web scraping with manual validation",
  "dataset_dataPreprocessing": "Removed PII, normalized text",
  "dataset_knownBias": "Overrepresents social media language",
  "dataset_sensitivePersonalInformation": "none",
  "dataset_intendedUse": "Training sentiment classifiers"
}

SPDX vs. CycloneDX for AI-BOM:

Aspect	SPDX 3.0	CycloneDX 1.6+
AI support maturity	New (2024)	Established (2023)
Dataset support	Dedicated profile	Integrated
Energy tracking	Native fields	Extension
Legal/license focus	Strong	Moderate
Security focus	Moderate	Strong
Tooling	Emerging	Growing

Both standards are viable; choice depends on organizational context and existing tooling.

Model Cards and AI-BOM¶

Model cards are structured documentation about ML models, originally proposed by Mitchell et al. in 2019. AI-BOMs and model cards serve complementary purposes.

Relationship:

Model cards: Human-readable documentation about model characteristics, intended use, and limitations
AI-BOM: Machine-readable inventory of model components and dependencies

Both CycloneDX and SPDX integrate model card concepts:

CycloneDX includes modelCard as a component field
SPDX AI profile fields mirror model card categories

Model Card Elements in AI-BOM:

Model Card Section	AI-BOM Representation
Model Details	Component metadata, version
Intended Use	Considerations, domain fields
Factors	Input/output specifications
Metrics	Performance benchmarks
Ethical Considerations	Safety assessment, bias information
Caveats and Recommendations	Limitations, known issues

Practical Integration:

Generate model cards from AI-BOM data for human consumption; maintain AI-BOM as source of truth for automation:

AI-BOM (machine-readable)
    │
    ├──► Vulnerability scanning
    ├──► License compliance
    ├──► Dependency tracking
    │
    └──► Model Card (human-readable)
              │
              └──► Developer documentation
                   End-user transparency
                   Regulatory compliance

Model Provenance and Lineage¶

Model provenance documents where a model came from and how it was created. Model lineage tracks relationships between models across versions and fine-tuning.

Why Provenance Matters:

Security: Verify model wasn't tampered with
Compliance: Demonstrate model creation process
Reproducibility: Enable recreation of model
Attribution: Track intellectual property and licensing

Provenance Elements:

provenance:
  origin:
    source: "huggingface.co/meta-llama/Llama-2-7b-hf"
    download_date: "2024-01-15"
    checksum: "sha256:abc123..."

  training:
    start_date: "2024-01-16"
    end_date: "2024-01-18"
    environment:
      hardware: "8x A100 80GB"
      framework: "transformers 4.36.0"
    code_repository: "github.com/acme/model-training"
    code_commit: "def456..."

  lineage:
    base_model: "meta-llama/Llama-2-7b-hf"
    relationship: "fine-tuned"
    modifications:
      - "LoRA adaptation for QA"
      - "Quantized to 4-bit"

Signing and Verification:

Model provenance should be cryptographically signed:

# Sign model and provenance attestation
cosign sign-blob --key model-signing.key model.safetensors
cosign attest --key model-signing.key --predicate provenance.json model.safetensors

Dataset Integrity and Provenance¶

Datasets shape model behavior as much as architecture does. Dataset documentation is essential for AI-BOM completeness.

Dataset Provenance Elements:

Source: Where data originated
Collection method: How data was gathered
Preprocessing: Transformations applied
Composition: What the data contains
Known biases: Recognized limitations
Licensing: Usage rights and restrictions

Documentation Challenges:

Dataset provenance is often poorly documented:

Web-scraped data has unclear provenance
Aggregated datasets obscure original sources
Preprocessing steps may not be recorded
Consent and licensing may be ambiguous

Best Practices:

Document data sources at collection time
Record all preprocessing transformations
Capture dataset statistics and distributions
Note known limitations and biases
Maintain clear licensing information

Detecting Malicious Model Artifacts¶

AI-BOMs enable security analysis of model artifacts, complementing the malicious model detection discussed in Book 1, Chapter 10.

Detection Approaches:

Format Analysis: - Identify file format (pickle, SafeTensors, ONNX) - Flag high-risk formats (pickle) for additional scrutiny - Verify format matches claimed type

Content Scanning:

# Scan pickle files for code execution
picklescan --path model.pkl

# Scan model for secrets
detect-secrets scan model-config.json

# Verify SafeTensors integrity
python -c "from safetensors import safe_open; safe_open('model.safetensors', framework='pt')"

Provenance Verification: - Verify cryptographic signatures - Check provenance attestations - Compare against known-good hashes

Behavioral Analysis: - Test model in sandboxed environment - Monitor for unexpected network activity - Check for anomalous resource usage

AI-BOM Security Fields:

{
  "components": [
    {
      "type": "machine-learning-model",
      "name": "classifier",
      "hashes": [
        {
          "alg": "SHA-256",
          "content": "abc123..."
        }
      ],
      "properties": [
        {
          "name": "format",
          "value": "safetensors"
        },
        {
          "name": "security-scan-date",
          "value": "2024-01-15"
        },
        {
          "name": "pickle-scan-result",
          "value": "not-applicable"
        }
      ]
    }
  ]
}

Regulatory Drivers¶

Regulatory requirements increasingly mandate AI transparency, driving AI-BOM adoption.

EU AI Act (2024):

The EU AI Act requires documentation for AI systems:

Technical documentation throughout lifecycle
Record-keeping of training, validation, and testing data
Transparency obligations for certain AI systems
Risk assessment and mitigation documentation

AI-BOM provides the foundation for meeting these requirements.

Specific Requirements:

Requirement	AI-BOM Contribution
Training data documentation	Dataset provenance
Model capabilities and limitations	Model card elements
Risk assessment	Safety considerations
Version control	Lineage tracking
Modification tracking	Change history

Other Regulatory Drivers:

FDA AI/ML guidance: Medical device AI documentation (enforceable since October 2023)
NIST AI RMF: Risk management framework
China AI regulations: Algorithm registration requirements

Compliance Integration:

AI-BOM supports compliance workflows:

AI-BOM Generation
    │
    ├──► EU AI Act compliance reports
    ├──► FDA submission documentation
    ├──► Risk assessment evidence
    └──► Audit trail maintenance

Tooling Landscape¶

AI-BOM tooling is less mature than traditional SBOM tooling but developing rapidly.

Generation Tools:

Tool	Type	AI-BOM Support
Syft	Scanner	Limited (model file detection)
CycloneDX CLI	Generator	ML-BOM profile support
SPDX Tools	Generator	SPDX 3.0 AI profile
MLflow	ML Platform	Model tracking, exportable
Hugging Face Hub	Model Registry	Model card export
DVC	Data Version Control	Dataset tracking

Current Gaps:

Automated model scanning: Limited tools for comprehensive model analysis
Dataset lineage: Manual tracking required in most cases
Training pipeline capture: Integration with ML platforms needed
Cross-platform correlation: Linking models across registries

Emerging Solutions:

ONNX Model Hub: Standardized model format with metadata
Hugging Face Transformers: Model card generation utilities
MLflow Model Registry: Model lifecycle management
Weights & Biases: Experiment and artifact tracking

Recommendations¶

For ML Engineers:

Start documenting now. Even without formal AI-BOM tools, begin capturing model provenance, dataset sources, and training configurations.
Use SafeTensors. Avoid pickle-based formats where possible. SafeTensors eliminates code execution risks during model loading.
Track model lineage. Document base models, fine-tuning relationships, and version history.
Integrate with ML platforms. Use MLflow, Weights & Biases, or similar tools to capture training metadata systematically.

For Security Practitioners:

Extend SBOM programs to AI. Include AI-BOM in software inventory requirements. Don't treat models as black boxes.
Scan model artifacts. Implement pickle scanning and format verification for downloaded models.
Verify model provenance. Check signatures and attestations for models from external sources.
Assess dataset risks. Understand what data trained models you depend on.

For Compliance Teams:

Map regulatory requirements. Identify which AI transparency requirements apply to your organization and products.
Establish AI-BOM standards. Define organizational requirements for AI documentation before regulations force rushed adoption.
Build audit trails. Capture AI-BOM information contemporaneously; reconstructing history is difficult.
Prepare for EU AI Act. If selling AI systems in Europe, AI-BOM capability will be essential for compliance.

AI-BOM extends supply chain transparency to a new category of artifacts with distinct security and compliance considerations. While tooling and standards are still maturing, the regulatory trajectory is clear: organizations building and deploying AI systems will need to document what's in them. Starting now—even with imperfect tools—builds the institutional knowledge and practices that will be required when regulations fully take effect.