DEV Community

Jeremiah Say
Jeremiah Say

Posted on

Building a SHA-256 audit trail for emission factor provenance in WordPress

Carbon accounting tools have a trust problem that most developers don't think about until it's too late.

A sustainability officer files a CSRD disclosure. An auditor asks: "Where does this emission factor come from?" The tool says 2.57 kg CO₂e/litre for diesel. The auditor wants the source workbook, the specific tab, the row number, and the date someone verified the cell against the published file. Without that, the number is a claim, not evidence.

Most carbon platforms can't answer that question. At GreenCalculus.com, I designed the data layer so every single emission factor carries a _provenance block that answers it — and the automated verification pipeline hashes every source workbook against the issuing body's published file to prove the data hasn't drifted.

Here's how the whole thing works.


The problem with emission factor databases

Emission factors go stale in predictable ways:

  1. Silent copy errors — a value gets transcribed from a source document with a rounding error or wrong unit and propagates for years
  2. Version drift — DEFRA, IEA, EPA publish annual updates; any hardcoded factor is wrong by the following June
  3. GWP basis ambiguity — the same gas has different CO₂e values depending on whether you use AR5 or AR6 GWP-100; most databases don't disclose which one
  4. No chain of custody — you know the number but not who extracted it, from which cell, on which date

All four of these are audit failures, not just data quality issues. Under CSRD and ISO 14064, an auditor can request the full methodology chain for any material emission category. If your platform can't produce it, you have a liability.


The _provenance block

Every fuel, grid, and refrigerant factor in the GreenCalculus Master Brain carries a _provenance subkey. Here's a real example — diesel:

'diesel' => [
    'factor'   => 2.57082,
    'unit'     => 'kg CO2e per litre',
    'scope'    => 1,
    'source'   => 'DEFRA_2025',
    'gwp'      => 'AR5_100',
    'name'     => 'Diesel (average biofuel blend)',

    '_provenance' => [
        'source'           => 'UK Government GHG Conversion Factors for Company Reporting, Year 2025 Version 1, DESNZ, June 2025',
        'tab'              => 'Fuels',
        'row_label'        => 'Diesel (average biofuel blend)',
        'row_number'       => 87,
        'column'           => 'D — kg CO2e (per litre)',
        'gwp_basis'        => 'IPCC AR5 GWP-100',
        'sourced_by'       => 'Jeremiah Say',
        'sourced_on'       => '2026-05-06',
        'verified'         => true,
        'verified_against' => 'ghg-conversion-factors-2025-full-set.xlsx',
        'verified_date'    => '2026-05-08',
    ],
],
Enter fullscreen mode Exit fullscreen mode

Every field is mandatory on Tier-1 sources (DEFRA, IEA, EPA). The row_number and column fields mean anyone with the source workbook can find the exact cell in under 30 seconds. sourced_by and sourced_on create a named, timestamped human record. verified_against points to the specific file that was hashed.


SHA-256 hashing the source workbooks

The verified_against field names a specific file. That file needs to be verifiably the same as what the issuing body published. This is where SHA-256 comes in.

When a new DEFRA or IEA release drops, the workflow is:

  1. Download the official workbook
  2. Hash it immediately before touching it
  3. Store the hash in the Master Brain metadata
  4. On every deploy, re-hash the stored file and compare

Here's the PHP that handles step 2–4:

/**
 * Hash a source workbook and return the hex digest.
 * Store this at intake. Re-run on deploy to verify no drift.
 *
 * @param  string $filepath  Absolute path to the workbook file.
 * @return string|false      SHA-256 hex digest, or false on failure.
 */
function gc_hash_source_workbook( string $filepath ) {
    if ( ! file_exists( $filepath ) || ! is_readable( $filepath ) ) {
        error_log( sprintf( '[GC Provenance] File not found or unreadable: %s', $filepath ) );
        return false;
    }

    return hash_file( 'sha256', $filepath );
}

/**
 * Verify a stored workbook hash against the live file on disk.
 * Run this in a deploy hook or WP-CLI command before publishing
 * any Master Brain update.
 *
 * @param  string $filepath       Absolute path to the workbook file.
 * @param  string $expected_hash  SHA-256 hex digest stored at intake.
 * @return bool                   True if file matches stored hash.
 */
function gc_verify_source_workbook( string $filepath, string $expected_hash ): bool {
    $live_hash = gc_hash_source_workbook( $filepath );

    if ( $live_hash === false ) {
        return false;
    }

    $match = hash_equals( $expected_hash, $live_hash );

    if ( ! $match ) {
        error_log( sprintf(
            '[GC Provenance] HASH MISMATCH: %s — expected %s, got %s',
            basename( $filepath ),
            $expected_hash,
            $live_hash
        ) );
    }

    return $match;
}
Enter fullscreen mode Exit fullscreen mode

hash_equals() is used instead of === to prevent timing attacks — a minor point for offline files, but a correct habit.

The stored hashes live in the Master Brain metadata block:

'meta' => [
    'version'    => '2025.6',
    'updated'    => '2026-05-09',
    'gwp_basis'  => 'AR5_100',

    'source_workbook_hashes' => [
        'DEFRA_2025' => [
            'filename' => 'ghg-conversion-factors-2025-full-set.xlsx',
            'sha256'   => 'a3f8c2...', // full 64-char hex digest
            'verified' => '2026-05-08',
        ],
        'EPA_2024' => [
            'filename' => 'egrid2022_summary_tables.pdf',
            'sha256'   => 'b91d47...',
            'verified' => '2026-05-09',
        ],
    ],
],
Enter fullscreen mode Exit fullscreen mode

Running the verification check on deploy

The verification runs as a WP-CLI command that blocks the deploy if any hash fails:

/**
 * WP-CLI command: verify all source workbook hashes.
 * Run before publishing any Master Brain version bump.
 *
 * Usage: wp gc verify-provenance
 */
if ( defined( 'WP_CLI' ) && WP_CLI ) {

    WP_CLI::add_command( 'gc verify-provenance', function () {

        $brain  = gc_get_master_brain_data();
        $hashes = $brain['meta']['source_workbook_hashes'] ?? [];
        $dir    = WP_CONTENT_DIR . '/gc-source-workbooks/';
        $passed = 0;
        $failed = 0;

        foreach ( $hashes as $source_id => $entry ) {
            $filepath = $dir . $entry['filename'];
            $ok       = gc_verify_source_workbook( $filepath, $entry['sha256'] );

            if ( $ok ) {
                WP_CLI::success( "{$source_id}: {$entry['filename']} — hash verified ✓" );
                $passed++;
            } else {
                WP_CLI::error( "{$source_id}: {$entry['filename']} — HASH MISMATCH ✗", false );
                $failed++;
            }
        }

        WP_CLI::line( "---" );
        WP_CLI::line( "Passed: {$passed} | Failed: {$failed}" );

        if ( $failed > 0 ) {
            WP_CLI::error( "Provenance check failed. Do not publish this Master Brain version." );
            exit( 1 );
        }

        WP_CLI::success( "All source workbooks verified. Safe to publish." );
    } );
}
Enter fullscreen mode Exit fullscreen mode

exit(1) means a CI pipeline that runs wp gc verify-provenance will block the deploy automatically if any workbook has drifted.


The _provenance enforcement check

Hash verification proves the workbook is intact. But it doesn't prove every factor in the Master Brain was actually extracted from it. For that, there's a second check that scans every entry and flags any that are missing _provenance:

/**
 * WP-CLI command: audit all Master Brain factor entries for _provenance.
 *
 * Usage: wp gc audit-provenance
 */
if ( defined( 'WP_CLI' ) && WP_CLI ) {

    WP_CLI::add_command( 'gc audit-provenance', function () {

        $brain    = gc_get_master_brain_data();
        $sections = [ 'fuels', 'fuels_wtt', 'grid', 'refrigerants', 'agriculture' ];
        $missing  = [];

        foreach ( $sections as $section ) {
            $entries = $brain[ $section ] ?? [];

            foreach ( $entries as $key => $entry ) {
                // Skip non-array entries and nested subgroups
                if ( ! is_array( $entry ) || ! isset( $entry['factor'] ) ) {
                    continue;
                }

                if ( empty( $entry['_provenance'] ) ) {
                    $missing[] = "{$section}.{$key}";
                } elseif ( empty( $entry['_provenance']['verified'] ) || $entry['_provenance']['verified'] !== true ) {
                    $missing[] = "{$section}.{$key} (unverified)";
                }
            }
        }

        if ( empty( $missing ) ) {
            WP_CLI::success( "All factors carry verified _provenance blocks." );
            return;
        }

        WP_CLI::warning( count( $missing ) . " factor(s) missing or unverified provenance:" );
        foreach ( $missing as $key ) {
            WP_CLI::line( "  — {$key}" );
        }

        exit( 1 );
    } );
}
Enter fullscreen mode Exit fullscreen mode

How this surfaces in the schema graph

The verification pipeline isn't just internal tooling. It's represented in the site's Schema.org entity graph as a SoftwareApplication node:

function gc_get_engineering_node(): array {
    return [
        '@type'          => 'SoftwareApplication',
        '@id'            => home_url( '/#engineering' ),
        'name'           => 'GreenCalculus Engineering',
        'description'    => 'Automated verification pipeline. Performs SHA-256 hash verification of source workbooks against the issuing body\'s published files, enforces cell-by-cell provenance attribution on every emission factor, and cross-checks methodology prose against the data layer before publication.',
        'featureList'    => 'SHA-256 hash verification of source workbooks; cell-by-cell _provenance enforcement; dual GWP basis disclosure (AR5/AR6); 30-day Tier-1 update SLA; prose-vs-data cross-validation; public changelog audit graph',
        'softwareVersion' => '2026.05',
    ];
}
Enter fullscreen mode Exit fullscreen mode

Every calculator and methodology page then references this entity as reviewedBy:

$schema['reviewedBy'] = [ '@id' => home_url( '/#engineering' ) ];
Enter fullscreen mode Exit fullscreen mode

This is Schema.org's explicit signal for "who fact-checked this content." It's backed by real machinery — not a fabricated review board. An auditor who follows the @id to /governance/ will find the full description of what the pipeline actually does. The schema points to real code, not a badge.


The discipline this enforces

The whole system only works if it's enforced at authorship time, not patched in later. The rule at GreenCalculus is:

No factor enters the Master Brain without a complete _provenance block. verified: true means a human has opened the source workbook, found the cell, confirmed the value, and signed it with their name and date.

When the DEFRA 2025 release dropped the UK grid factor from 0.207 to 0.177 — a 15% change — the update produced a new _provenance block with a new sourced_on date, a new verified_date, and a public changelog entry stating the old and new values explicitly. The SHA-256 hash of the new workbook replaced the old one in the metadata. The CI pipeline re-ran the hash check on the next deploy and passed.

That's the full chain: source document → hash → cell reference → human verification → named attribution → public changelog. Every link is traceable. Every link is automated where possible and named where it requires human judgment.

For CSRD filers who need to point an auditor at a methodology chain, that's not optional infrastructure. It's the product.


The full GreenCalculus Master Brain data layer, verification pipeline, and public changelog are at greencalculus.com.

Top comments (0)