validation-suite

Usage

To use MRMU Validation Suite in a project. Import functions from:

from validation_suite import (
    compare_dataframes,
    vif_check,
    psi_check,
    csi_check,
    backtesting_report,
    ValidationResult,
)

1. compare_dataframes():

Compares two DataFrames (e.g. model owner output vs validator dry run).

Returns a ValidationResult with:

Example

from validation_suite import compare_dataframes

result_strict = compare_dataframes(
    df_reference=df_model_owner,
    df_challenger=df_dry_run,
    key_cols=["obligor_id"],
    numeric_tol=1e-6,
    label_reference="Model Owner v2.3",
    label_challenger="Validator Dry Run",
)

print(f"Status   : {result_strict.status}")
print(f"Warnings : {result_strict.warnings if result_strict.warnings else 'None'}")
print()
result_strict.summary_df.sort_values("max_abs_diff", ascending=False)

results in:

column max_abs_diff mean_abs_diff rows_exceeding_tol
pd_score 0.007647 0.000059 8.000000
index 0.000000 0.000000 0.000000
lgd 0.000000 0.000000 0.000000
ead 0.000000 0.000000 0.000000

Columns with rows_exceeding_tol > 0 require additional analysis. A small numerical difference (e.g., max_abs_diff < 0.005) may be acceptable if it can be explained by rounding differences between platforms

2. vif_check():

Computes VIF for each feature. SR11-7 standard threshold = 10. Values > 10 indicate problematic multicollinearity.

Example

from validation_suite import vif_check

rng2 = np.random.default_rng(7)
n_obs = 1000

df_features_clean = pd.DataFrame(
    {
        "ltv": rng2.uniform(0.30, 0.95, n_obs),
        "dti": rng2.uniform(0.10, 0.60, n_obs),
        "credit_age_yrs": rng2.uniform(1, 30, n_obs),
        "utilization_rate": rng2.uniform(0, 1, n_obs),
        "num_delinquencies": rng2.poisson(lam=0.4, size=n_obs).astype(float),
    }
)

result_vif_clean = vif_check(
    df_features_clean, feature_cols=FEATURE_COLS, threshold=10.0
)

Messing with features and correlating it will produce as output something like:

# Feature VIF Flag
1 dti 20543658.327543 HIGH
5 dti_annualized 20543522.726991 HIGH
0 ltv 6.611564 MODERATE
2 credit_age_yrs 3.856893 OK
3 utilization_rate 3.500687 OK
4 num_delinquencies 1.398969 OK

3. psi_check()

Population Stability Index (PSI) — detects distributional shift between a reference (development/prior) sample and a monitoring (current) sample.

SR 11-7 context: Monitors whether the input population the model is applied to has drifted from the population it was developed on. Elevated PSI (> 0.25) is a trigger for model recalibration review.

Parameters

Returns

ValidationResult

Example

from validation_suite import psi_check

PSI_COLUMNS = ["pd_score", "ltv", "dti"]

result_psi = psi_check(
    expected=df_dev,
    actual=df_monitor,
    columns=PSI_COLUMNS,
    bins=10,  # deciles
)

print(f"Status general: {result_psi.status}")
print()
if result_psi.warnings:
    print("Warnings:")
    for w in result_psi.warnings:
        print(f"  ⚠  {w}")
    print()

result_psi.summary_df.sort_values("psi_value", ascending=False)

results in:

Status general: FAIL

Warnings

PSI Summary Table

# Variable PSI Value Flag # Bins Shifted N Expected N Actual
0 pd_score 7.399166 UNSTABLE 10 5000 5000
2 dti 0.032043 STABLE 1 5000 5000
1 ltv 0.002046 STABLE 0 5000 5000

3. csi_check()

Characteristic Stability Index (CSI) — measures distributional shift of the model score within each segment of a categorical variable.

SR 11-7 context: Where PSI measures overall population shift, CSI pinpoints which segment is driving instability. Commonly applied to risk grades, product lines, or geographic segments in PD/LGD/EAD models.

Parameters

Returns

ValidationResult

Example

from validation_suite import csi_check

result_csi = csi_check(
    expected=df_dev_seg,
    actual=df_monitor_seg,
    score_col=SCORE_COL,
    segment_col=SEGMENT_COL,
    bins=10,
)

print(f"Status general: {result_csi.status}")
print()
if result_csi.warnings:
    print("Warnings:")
    for w in result_csi.warnings:
        print(f"  ⚠  {w}")
    print()

result_csi.summary_df.sort_values("csi_value", ascending=False)

results in:

Status general: FAIL

Warnings

CSI Summary Table

# Segment CSI Value Flag N Expected N Actual Segment % Expected Segment % Actual
2 CCC 5.139687 UNSTABLE 800 800 0.333333 0.333333
1 BBB 0.841326 UNSTABLE 800 800 0.333333 0.333333
0 AAA 0.015184 STABLE 800 800 0.333333 0.333333

4. backtesting_report()

Comprehensive backtesting report for binary classification models (PD, fraud, prepayment) aligned with SR 11-7 performance requirements.

Metrics computed

Discrimination:

Calibration:

Parameters

Returns

ValidationResult

Example

from validation_suite import backtesting_report()

result_bt_good = backtesting_report(
    y_true=y_true,
    y_score=y_score_good,
    model_name="PD-RETAIL-V2.3",
    n_hl_groups=10,
    n_cal_buckets=10,
    alpha=0.05,
)

print(f"Status        : {result_bt_good.status}")
print(f"Traffic light : {result_bt_good.details['traffic_light']}")
print()
if result_bt_good.warnings:
    for w in result_bt_good.warnings:
        print(f"  ⚠  {w}")
else:
    print("  Sin warnings.")
print()

cols = [
    "model_name",
    "n_observations",
    "n_defaults",
    "auc_roc",
    "gini",
    "ks_statistic",
    "brier_score",
    "hl_p_value",
    "binomial_p_value",
    "traffic_light",
    "status",
]
result_bt_good.summary_df[cols]

results in:

Status : PASS

Traffic light: GREEN

model_name n_observations n_defaults auc_roc gini ks_statistic brier_score hl_p_value binomial_p_value traffic_light status
PD-RETAIL-V2.3 3000 280 0.693480 0.386959 0.306828 0.081629 0.587496 0.317816 GREEN PASS