Replication — Train

Goal

Reproduce the three predictive models — Random Forest, XGBoost, and the Multi-Layer Perceptron (fallback for LSTM) — on the engineered feature panel built on the Data page. XGBoost is selected as the operational model (see Eval for the rationale).

Reference pipeline

The canonical training pipeline lives in this repo at code/:

code/
├── README.md          quick-start + reproducibility checklist
├── requirements.txt   pinned dependencies (pandas 2.2.3, xgboost 2.1.3, …)
├── main.py            full pipeline (RF + XGB + LSTM/MLP + ensemble)
├── paper_main.py      paper pipeline (3 models, 4 publication figures)
└── src/
    ├── loader.py      load CSVs from <repo>/data/ (env-var overridable)
    ├── processor.py   district normalisation + BS→AD calendar conversion
    ├── engineer.py    lag features, rolling means, log-transform target
    ├── models.py      RF / XGBoost / MLP / LSTM training + persistence
    ├── evaluator.py   RMSE / MAE / R² / MAPE
    ├── visualizer.py  figure rendering
    └── reporter.py    text-format analysis report

To re-run end-to-end from the repo root:

pip install -r code/requirements.txt
python code/paper_main.py     # 3 models + paper figures, ~2 min on CPU
# or, for the full ensemble pipeline:
python code/main.py

Both scripts are deterministic given the random seed (42, hard-coded in code/src/models.py) and land within rounding of the headline numbers rendered on this site.

Live training — quick smoke test

The cell below trains Random Forest and XGBoost on the engineered panel inline, then prints both models’ R² on the held-out final 12 months. This is the same code path the production scripts use — the production scripts just additionally persist artifacts.

Code
import sys, os, warnings
warnings.filterwarnings("ignore")
sys.path.insert(0, os.path.abspath("../code"))

import numpy as np
import pandas as pd

from src.loader    import DataLoader
from src.processor import DataProcessor
from src.engineer  import FeatureEngineer, FEATURE_COLS, TARGET_COL, TARGET_RAW
from src.models    import TyphoidModels
from src.evaluator import Evaluator

# 1. Reproduce the engineered panel (same pipeline as the Data page).
raw    = DataLoader().load_all()
proc   = DataProcessor()
merged = proc.merge_datasets(
    proc.process_typhoid(raw["typhoid"]),
    proc.process_climate(raw["climate"]),
    proc.process_flood(raw["flood"]),
)
df = FeatureEngineer().create_features(merged).sort_values("date_dt").reset_index(drop=True)

# 2. Strictly chronological train/test split — last 12 months held out.
unique_dates = df["date_dt"].sort_values().unique()
split_date   = unique_dates[len(unique_dates) - 12]
train, test  = df[df["date_dt"] < split_date], df[df["date_dt"] >= split_date]

X_train, y_train = train[FEATURE_COLS], train[TARGET_COL]
X_test,  y_test  = test[FEATURE_COLS],  test[TARGET_COL]
y_test_raw       = test[TARGET_RAW].values

print(f"Train: {len(train):>5} rows   ({train['date_ad'].min()}{train['date_ad'].max()})")
print(f"Test : {len(test):>5} rows   ({test['date_ad'].min()}{test['date_ad'].max()})")

# 3. Train RF + XGB. (MLP omitted in the smoke test to keep build fast;
#    main.py runs all three plus ensemble.)
models = TyphoidModels()
models.train_rf(X_train, y_train)
val_idx = int(len(X_train) * 0.9)
models.train_xgb(
    X_train.iloc[:val_idx], y_train.iloc[:val_idx],
    X_train.iloc[val_idx:], y_train.iloc[val_idx:],
)

# 4. Evaluate on the held-out 12 months.
ev = Evaluator()
results = [
    ev.evaluate(y_test_raw, models.predict_rf(X_test),  "Random Forest"),
    ev.evaluate(y_test_raw, models.predict_xgb(X_test), "XGBoost"),
]
pd.DataFrame(results)
Train:  6411 rows   (2015-05 → 2022-11)
Test :   916 rows   (2022-12 → 2023-12)
Model RMSE MAE R2 MAPE (%)
0 Random Forest 131.38 72.09 0.8564 41.52
1 XGBoost 128.31 68.97 0.8630 44.45

The two rows above should land within ±0.005 R² of the headline numbers in the performance metrics table.

Hyperparameter choices

All hyperparameter choices are pinned in code/src/models.py:

Model Key settings
Random Forest n_estimators=300, max_depth=10, min_samples_leaf=5, random_state=42, n_jobs=-1
XGBoost n_estimators=300, learning_rate=0.05, max_depth=4, subsample=0.8, colsample_bytree=0.8, reg_alpha=0.1, reg_lambda=1.0, random_state=42
MLP fallback hidden_layer_sizes=(128, 64, 32), activation="relu", solver="adam", learning_rate_init=0.001, max_iter=500, early_stopping=True, random_state=42
LSTM (TF, if available) Two stacked layers (64 → 32), dropout 0.2, Adam @ 1e-3, MSE loss, 50 epochs, batch 32, n_steps=3

Trained model artifacts

Running the full pipeline persists artifacts to code/output/models/:

File Size Description
rf_model.pkl ~16 MB Random Forest (300 trees, depth 10)
xgb_model.pkl ~500 KB XGBoost (300 estimators, depth 4, L1=0.1, L2=1.0)
mlp_model.pkl ~290 KB MLP fallback (128 → 64 → 32, Adam, early stopping)
scaler_X.pkl ~1 KB feature standardisation
scaler_y.pkl ~0.5 KB target standardisation

These are not committed — they are large, regenerable from source, and *.pkl / *.h5 are blocked in .gitignore. Re-train in place using either the smoke-test cell above or the production scripts.

Warning

Pickled models are not portable across scikit-learn / XGBoost versions. The numbers on this site were generated with the versions pinned in code/requirements.txt. Re-pickle when upgrading.

Note

Continue to Eval to view the held-out test metrics.