Skip to content

12.3 AI Bills of Materials (AI-BOM)

As organizations increasingly integrate machine learning models into their products and operations, traditional SBOMs prove insufficient. A software bill of materials tells you what code libraries are in your application, but when that application includes an ML model, critical questions remain unanswered: What data was used to train this model? Which base model was it fine-tuned from? What configuration parameters affect its behavior? These questions matter for security, compliance, and operational reasons that parallel—but extend beyond—traditional software supply chain concerns.

This section introduces the AI Bill of Materials (AI-BOM) as an emerging extension of SBOM concepts to AI/ML systems, examining standards, implementation approaches, and the regulatory pressures driving adoption.

Why Traditional SBOMs Are Insufficient

SBOMs designed for traditional software capture libraries, packages, and dependencies—components that execute code. AI/ML systems include additional artifact types that SBOMs weren't designed to represent.

The Gap:

Traditional Software AI/ML Systems
Libraries, packages Models (weights, architectures)
Source code Training data
Configuration files Hyperparameters
Build scripts Training pipelines
Binary executables Inference configurations

A traditional SBOM for a Python ML application would capture PyTorch, TensorFlow, or scikit-learn as dependencies, but would miss:

  • The pre-trained model downloaded from Hugging Face
  • The dataset used for fine-tuning
  • The training configuration that shaped model behavior
  • The relationship between base model and fine-tuned version

Security Implications of the Gap:

Without AI-specific inventory, organizations cannot answer critical questions:

  • Does this model contain a backdoor? (Model poisoning assessment)
  • Was training data appropriately sourced? (Data provenance)
  • Is the model vulnerable to known attacks? (Vulnerability correlation)
  • What licenses apply to this model? (License compliance)
  • What are the model's known limitations? (Operational risk)

Book 1, Chapter 10 detailed ML-specific supply chain risks—pickle deserialization attacks, model backdoors, dataset poisoning. AI-BOM provides the visibility necessary to assess and manage these risks.

What an AI-BOM Includes

An AI-BOM extends traditional SBOM concepts to capture AI/ML-specific components and their relationships.

Core Component Types:

Models: - Model architecture (transformer, CNN, etc.) - Weights and parameters - Version and lineage (base model, fine-tuned variants) - Format (SafeTensors, ONNX, pickle, etc.) - Hash for integrity verification - Performance metrics and benchmarks

Datasets: - Dataset identifier and version - Source and provenance - Composition (what data types, distributions) - Preprocessing applied - Known limitations and biases - Licensing and usage restrictions

Training Pipelines: - Training code and scripts - Hyperparameters used - Random seeds (for reproducibility) - Hardware and environment - Training duration and cost

Configurations: - Inference parameters (temperature, top-p, etc.) - Quantization settings - Deployment constraints - Safety configurations

Dependencies: - ML frameworks (PyTorch, TensorFlow) - Libraries (transformers, datasets) - Traditional software dependencies

Component Relationships:

AI-BOMs must capture relationships between components:

Fine-tuned Model
    ├── DERIVED_FROM: Base Model (Llama-2-7B)
    ├── TRAINED_ON: Dataset (internal-qa-pairs-v3)
    ├── USING: Training Pipeline (qa-finetune-pipeline)
    └── DEPENDS_ON: Libraries (transformers, peft, torch)

These relationships enable lineage tracking and impact assessment.

CycloneDX ML-BOM Profile

CycloneDX version 1.5 (June 2023) introduced the Machine Learning BOM (ML-BOM) profile, extending the existing specification to support AI/ML components. The current version 1.7 (October 2025) includes further enhancements.

ML-BOM Component Types:

CycloneDX defines specific component types for ML:

{
  "components": [
    {
      "type": "machine-learning-model",
      "name": "sentiment-classifier-v2",
      "version": "2.1.0",
      "modelCard": {
        "modelParameters": {
          "architecture": {
            "family": "transformer",
            "name": "BERT"
          },
          "quantization": {
            "bitsOfPrecision": 16
          }
        },
        "inputs": [
          {
            "format": "text"
          }
        ],
        "outputs": [
          {
            "format": "classification"
          }
        ]
      }
    },
    {
      "type": "data",
      "name": "sentiment-training-data",
      "version": "1.0.0",
      "data": {
        "type": "dataset",
        "contents": {
          "count": 50000,
          "type": "samples"
        },
        "governance": {
          "custodian": "Data Science Team",
          "owner": "Acme Corp"
        }
      }
    }
  ]
}

Key ML-BOM Fields:

Field Purpose
modelCard Structured model documentation
modelParameters Architecture, quantization, hyperparameters
inputs / outputs Data format specifications
considerations Ethical, performance, and safety notes
data.type dataset, configuration, source-code, etc.
data.governance Ownership and custodianship
data.contents What the dataset contains

Advantages of CycloneDX ML-BOM:

  • Integrates with existing CycloneDX tooling
  • Supports VEX for vulnerability exploitation status
  • Designed for security use cases
  • Active community development

SPDX 3.0 AI Profile

SPDX version 3.0 (2024) introduces AI and dataset profiles as part of a major specification update.

SPDX 3.0 Structure:

SPDX 3.0 uses a modular profile system:

  • Core profile: Basic SBOM elements
  • Software profile: Software packages and files
  • AI profile: Models and AI systems
  • Dataset profile: Training and evaluation data
  • Build profile: Build process information

AI Profile Elements:

{
  "@type": "ai_AIPackage",
  "name": "sentiment-classifier",
  "ai_autonomyType": "supervised",
  "ai_domain": ["nlp", "sentiment-analysis"],
  "ai_energyConsumption": {
    "ai_finetuningEnergyConsumption": "50 kWh",
    "ai_inferenceEnergyConsumption": "0.001 kWh/request"
  },
  "ai_informationAboutTraining": "Fine-tuned on proprietary dataset...",
  "ai_limitation": "May exhibit bias toward English idioms",
  "ai_modelDataPreprocessing": "Tokenized using BERT tokenizer",
  "ai_safetyRiskAssessment": "Low risk - sentiment classification only"
}

Dataset Profile Elements:

{
  "@type": "dataset_DatasetPackage",
  "name": "sentiment-training-data",
  "dataset_dataCollectionProcess": "Web scraping with manual validation",
  "dataset_dataPreprocessing": "Removed PII, normalized text",
  "dataset_knownBias": "Overrepresents social media language",
  "dataset_sensitivePersonalInformation": "none",
  "dataset_intendedUse": "Training sentiment classifiers"
}

SPDX vs. CycloneDX for AI-BOM:

Aspect SPDX 3.0 CycloneDX 1.6+
AI support maturity New (2024) Established (2023)
Dataset support Dedicated profile Integrated
Energy tracking Native fields Extension
Legal/license focus Strong Moderate
Security focus Moderate Strong
Tooling Emerging Growing

Both standards are viable; choice depends on organizational context and existing tooling.

Model Cards and AI-BOM

Model cards are structured documentation about ML models, originally proposed by Mitchell et al. in 2019. AI-BOMs and model cards serve complementary purposes.

Relationship:

  • Model cards: Human-readable documentation about model characteristics, intended use, and limitations
  • AI-BOM: Machine-readable inventory of model components and dependencies

Both CycloneDX and SPDX integrate model card concepts:

  • CycloneDX includes modelCard as a component field
  • SPDX AI profile fields mirror model card categories

Model Card Elements in AI-BOM:

Model Card Section AI-BOM Representation
Model Details Component metadata, version
Intended Use Considerations, domain fields
Factors Input/output specifications
Metrics Performance benchmarks
Ethical Considerations Safety assessment, bias information
Caveats and Recommendations Limitations, known issues

Practical Integration:

Generate model cards from AI-BOM data for human consumption; maintain AI-BOM as source of truth for automation:

AI-BOM (machine-readable)
    ├──► Vulnerability scanning
    ├──► License compliance
    ├──► Dependency tracking
    └──► Model Card (human-readable)
              └──► Developer documentation
                   End-user transparency
                   Regulatory compliance

Model Provenance and Lineage

Model provenance documents where a model came from and how it was created. Model lineage tracks relationships between models across versions and fine-tuning.

Why Provenance Matters:

  • Security: Verify model wasn't tampered with
  • Compliance: Demonstrate model creation process
  • Reproducibility: Enable recreation of model
  • Attribution: Track intellectual property and licensing

Provenance Elements:

provenance:
  origin:
    source: "huggingface.co/meta-llama/Llama-2-7b-hf"
    download_date: "2024-01-15"
    checksum: "sha256:abc123..."

  training:
    start_date: "2024-01-16"
    end_date: "2024-01-18"
    environment:
      hardware: "8x A100 80GB"
      framework: "transformers 4.36.0"
    code_repository: "github.com/acme/model-training"
    code_commit: "def456..."

  lineage:
    base_model: "meta-llama/Llama-2-7b-hf"
    relationship: "fine-tuned"
    modifications:
      - "LoRA adaptation for QA"
      - "Quantized to 4-bit"

Signing and Verification:

Model provenance should be cryptographically signed:

# Sign model and provenance attestation
cosign sign-blob --key model-signing.key model.safetensors
cosign attest --key model-signing.key --predicate provenance.json model.safetensors

Dataset Integrity and Provenance

Datasets shape model behavior as much as architecture does. Dataset documentation is essential for AI-BOM completeness.

Dataset Provenance Elements:

  • Source: Where data originated
  • Collection method: How data was gathered
  • Preprocessing: Transformations applied
  • Composition: What the data contains
  • Known biases: Recognized limitations
  • Licensing: Usage rights and restrictions

Documentation Challenges:

Dataset provenance is often poorly documented:

  • Web-scraped data has unclear provenance
  • Aggregated datasets obscure original sources
  • Preprocessing steps may not be recorded
  • Consent and licensing may be ambiguous

Best Practices:

  1. Document data sources at collection time
  2. Record all preprocessing transformations
  3. Capture dataset statistics and distributions
  4. Note known limitations and biases
  5. Maintain clear licensing information

Detecting Malicious Model Artifacts

AI-BOMs enable security analysis of model artifacts, complementing the malicious model detection discussed in Book 1, Chapter 10.

Detection Approaches:

Format Analysis: - Identify file format (pickle, SafeTensors, ONNX) - Flag high-risk formats (pickle) for additional scrutiny - Verify format matches claimed type

Content Scanning:

# Scan pickle files for code execution
picklescan --path model.pkl

# Scan model for secrets
detect-secrets scan model-config.json

# Verify SafeTensors integrity
python -c "from safetensors import safe_open; safe_open('model.safetensors', framework='pt')"

Provenance Verification: - Verify cryptographic signatures - Check provenance attestations - Compare against known-good hashes

Behavioral Analysis: - Test model in sandboxed environment - Monitor for unexpected network activity - Check for anomalous resource usage

AI-BOM Security Fields:

{
  "components": [
    {
      "type": "machine-learning-model",
      "name": "classifier",
      "hashes": [
        {
          "alg": "SHA-256",
          "content": "abc123..."
        }
      ],
      "properties": [
        {
          "name": "format",
          "value": "safetensors"
        },
        {
          "name": "security-scan-date",
          "value": "2024-01-15"
        },
        {
          "name": "pickle-scan-result",
          "value": "not-applicable"
        }
      ]
    }
  ]
}

Regulatory Drivers

Regulatory requirements increasingly mandate AI transparency, driving AI-BOM adoption.

EU AI Act (2024):

The EU AI Act requires documentation for AI systems:

  • Technical documentation throughout lifecycle
  • Record-keeping of training, validation, and testing data
  • Transparency obligations for certain AI systems
  • Risk assessment and mitigation documentation

AI-BOM provides the foundation for meeting these requirements.

Specific Requirements:

Requirement AI-BOM Contribution
Training data documentation Dataset provenance
Model capabilities and limitations Model card elements
Risk assessment Safety considerations
Version control Lineage tracking
Modification tracking Change history

Other Regulatory Drivers:

  • FDA AI/ML guidance: Medical device AI documentation (enforceable since October 2023)
  • NIST AI RMF: Risk management framework
  • China AI regulations: Algorithm registration requirements

Compliance Integration:

AI-BOM supports compliance workflows:

AI-BOM Generation
    ├──► EU AI Act compliance reports
    ├──► FDA submission documentation
    ├──► Risk assessment evidence
    └──► Audit trail maintenance

Tooling Landscape

AI-BOM tooling is less mature than traditional SBOM tooling but developing rapidly.

Generation Tools:

Tool Type AI-BOM Support
Syft Scanner Limited (model file detection)
CycloneDX CLI Generator ML-BOM profile support
SPDX Tools Generator SPDX 3.0 AI profile
MLflow ML Platform Model tracking, exportable
Hugging Face Hub Model Registry Model card export
DVC Data Version Control Dataset tracking

Current Gaps:

  • Automated model scanning: Limited tools for comprehensive model analysis
  • Dataset lineage: Manual tracking required in most cases
  • Training pipeline capture: Integration with ML platforms needed
  • Cross-platform correlation: Linking models across registries

Emerging Solutions:

  • ONNX Model Hub: Standardized model format with metadata
  • Hugging Face Transformers: Model card generation utilities
  • MLflow Model Registry: Model lifecycle management
  • Weights & Biases: Experiment and artifact tracking

Recommendations

For ML Engineers:

  1. Start documenting now. Even without formal AI-BOM tools, begin capturing model provenance, dataset sources, and training configurations.

  2. Use SafeTensors. Avoid pickle-based formats where possible. SafeTensors eliminates code execution risks during model loading.

  3. Track model lineage. Document base models, fine-tuning relationships, and version history.

  4. Integrate with ML platforms. Use MLflow, Weights & Biases, or similar tools to capture training metadata systematically.

For Security Practitioners:

  1. Extend SBOM programs to AI. Include AI-BOM in software inventory requirements. Don't treat models as black boxes.

  2. Scan model artifacts. Implement pickle scanning and format verification for downloaded models.

  3. Verify model provenance. Check signatures and attestations for models from external sources.

  4. Assess dataset risks. Understand what data trained models you depend on.

For Compliance Teams:

  1. Map regulatory requirements. Identify which AI transparency requirements apply to your organization and products.

  2. Establish AI-BOM standards. Define organizational requirements for AI documentation before regulations force rushed adoption.

  3. Build audit trails. Capture AI-BOM information contemporaneously; reconstructing history is difficult.

  4. Prepare for EU AI Act. If selling AI systems in Europe, AI-BOM capability will be essential for compliance.

AI-BOM extends supply chain transparency to a new category of artifacts with distinct security and compliance considerations. While tooling and standards are still maturing, the regulatory trajectory is clear: organizations building and deploying AI systems will need to document what's in them. Starting now—even with imperfect tools—builds the institutional knowledge and practices that will be required when regulations fully take effect.