Replication — Eval

Goal

Show the test-set performance of the four trained models on the held-out final 12 months of district-month data. Numbers below are loaded from the metrics CSV produced by evaluator.py — they are the same numbers reported in the Results chapter.

Performance table

Table 1: Test-set performance across the four models on the held-out final 12 months.

	Model	RMSE (cases)	MAE (cases)	R²	MAPE (%)
0	Random Forest	131.43	72.10	0.8563	41.48
1	XGBoost	129.07	69.50	0.8614	44.64
2	MLP	128.07	71.45	0.8635	43.50
3	Weighted Ensemble	126.20	68.07	0.8675	41.01

Feature–target correlations

Table 2: Pearson correlation of each engineered feature with log(1 + cases), district-month panel.

	Feature	Pearson_r_with_log_cases
0	cases_lag1	0.731
1	temp_mean_lag1	0.599
2	temp_roll3	0.559
3	precip_lag1	0.313
4	monsoon	0.245
5	precip_roll3	0.208
6	humidity_lag1	0.180
7	flood_lag1	0.122
8	month_sin	-0.133
9	month_cos	-0.226

Interpretation

The Weighted Ensemble is the headline winner — R² = 0.8675, RMSE = 126.20, MAE = 68.07, MAPE = 41.01 %. It edges every individual model on every metric, validating the use of structurally distinct learners whose errors are partially uncorrelated.
All four architectures converge to within 0.012 R² of each other (Random Forest 0.8563, XGBoost 0.8614, MLP 0.8635, Ensemble 0.8675). This is not noise: it indicates the predictive ceiling is set by the climate × autoregressive signal in the data, not by the choice of model family.
XGBoost is the operational pick when only a single model can be deployed — lowest MAE among the individual models (69.50), interpretable gain-based feature importance, graceful handling of missing values, and reproducible under a fixed random seed.
Random Forest has the lowest MAPE among individuals (41.5 %), meaning marginally better proportional accuracy on low-incidence districts. It remains the conservative baseline.
MLP edges out the tree-based individual models on overall R² (0.8635) but at the cost of higher absolute-error variance and stricter input-scaling requirements.

Where the trained pickles live

The pipeline persists trained models to code/output/models/ next to the source code (see Train). The file layout is:

code/output/models/
├── rf_model.pkl
├── xgb_model.pkl
├── mlp_model.pkl     (or lstm_model.h5 if TensorFlow is installed)
├── scaler_X.pkl
└── scaler_y.pkl

The directory is gitignored by default — model artifacts are regenerable from source and shouldn’t be committed.

To regenerate from scratch, see Train.

Tip

The figures supporting this evaluation — model comparison, correlation heatmap, ecological-zone distribution, national trend — are on the Figures page.

--- title: "Replication — Eval" --- ## Goal Show the test-set performance of the four trained models on the held-out final 12 months of district-month data. Numbers below are loaded from the metrics CSV produced by `evaluator.py` — they are the same numbers reported in the [Results](../paper/04-results.qmd) chapter. ## Performance table ```{python} #| label: tbl-perf #| tbl-cap: "Test-set performance across the four models on the held-out final 12 months." #| echo: false import pandas as pd metrics = pd.read_csv("../tables/performance_metrics.csv") metrics.style.format({ "RMSE (cases)": "{:.2f}", "MAE (cases)": "{:.2f}", "R²": "{:.4f}", "MAPE (%)": "{:.2f}", }) ``` ## Feature–target correlations ```{python} #| label: tbl-corr #| tbl-cap: "Pearson correlation of each engineered feature with log(1 + cases), district-month panel." #| echo: false import pandas as pd corr = pd.read_csv("../tables/lag_correlation.csv") corr.style.format({"Pearson_r_with_log_cases": "{:.3f}"}) ``` ## Interpretation - **The Weighted Ensemble is the headline winner** — R² = 0.8675, RMSE = 126.20, MAE = 68.07, MAPE = 41.01 %. It edges every individual model on every metric, validating the use of structurally distinct learners whose errors are partially uncorrelated. - **All four architectures converge to within 0.012 R² of each other** (Random Forest 0.8563, XGBoost 0.8614, MLP 0.8635, Ensemble 0.8675). This is not noise: it indicates the predictive ceiling is set by the *climate × autoregressive signal in the data*, not by the choice of model family. - **XGBoost** is the operational pick when only a single model can be deployed — lowest MAE among the individual models (69.50), interpretable gain-based feature importance, graceful handling of missing values, and reproducible under a fixed random seed. - **Random Forest** has the lowest MAPE among individuals (41.5 %), meaning marginally better proportional accuracy on low-incidence districts. It remains the conservative baseline. - **MLP** edges out the tree-based individual models on overall R² (0.8635) but at the cost of higher absolute-error variance and stricter input-scaling requirements. ## Where the trained pickles live The pipeline persists trained models to `code/output/models/` next to the source code (see [Train](train.qmd)). The file layout is: ```text code/output/models/ ├── rf_model.pkl ├── xgb_model.pkl ├── mlp_model.pkl (or lstm_model.h5 if TensorFlow is installed) ├── scaler_X.pkl └── scaler_y.pkl ``` The directory is gitignored by default — model artifacts are regenerable from source and shouldn't be committed. To regenerate from scratch, see [Train](train.qmd). ::: {.callout-tip} The figures supporting this evaluation — model comparison, correlation heatmap, ecological-zone distribution, national trend — are on the [Figures](../figures.qmd) page. :::