Reproduce the three predictive models — Random Forest, XGBoost, and the Multi-Layer Perceptron (fallback for LSTM) — on the engineered feature panel built on the Data page. XGBoost is selected as the operational model (see Eval for the rationale).
Reference pipeline
The canonical training pipeline lives in this repo at code/:
pip install -r code/requirements.txtpython code/paper_main.py # 3 models + paper figures, ~2 min on CPU# or, for the full ensemble pipeline:python code/main.py
Both scripts are deterministic given the random seed (42, hard-coded in code/src/models.py) and land within rounding of the headline numbers rendered on this site.
Live training — quick smoke test
The cell below trains Random Forest and XGBoost on the engineered panel inline, then prints both models’ R² on the held-out final 12 months. This is the same code path the production scripts use — the production scripts just additionally persist artifacts.
Code
import sys, os, warningswarnings.filterwarnings("ignore")sys.path.insert(0, os.path.abspath("../code"))import numpy as npimport pandas as pdfrom src.loader import DataLoaderfrom src.processor import DataProcessorfrom src.engineer import FeatureEngineer, FEATURE_COLS, TARGET_COL, TARGET_RAWfrom src.models import TyphoidModelsfrom src.evaluator import Evaluator# 1. Reproduce the engineered panel (same pipeline as the Data page).raw = DataLoader().load_all()proc = DataProcessor()merged = proc.merge_datasets( proc.process_typhoid(raw["typhoid"]), proc.process_climate(raw["climate"]), proc.process_flood(raw["flood"]),)df = FeatureEngineer().create_features(merged).sort_values("date_dt").reset_index(drop=True)# 2. Strictly chronological train/test split — last 12 months held out.unique_dates = df["date_dt"].sort_values().unique()split_date = unique_dates[len(unique_dates) -12]train, test = df[df["date_dt"] < split_date], df[df["date_dt"] >= split_date]X_train, y_train = train[FEATURE_COLS], train[TARGET_COL]X_test, y_test = test[FEATURE_COLS], test[TARGET_COL]y_test_raw = test[TARGET_RAW].valuesprint(f"Train: {len(train):>5} rows ({train['date_ad'].min()} → {train['date_ad'].max()})")print(f"Test : {len(test):>5} rows ({test['date_ad'].min()} → {test['date_ad'].max()})")# 3. Train RF + XGB. (MLP omitted in the smoke test to keep build fast;# main.py runs all three plus ensemble.)models = TyphoidModels()models.train_rf(X_train, y_train)val_idx =int(len(X_train) *0.9)models.train_xgb( X_train.iloc[:val_idx], y_train.iloc[:val_idx], X_train.iloc[val_idx:], y_train.iloc[val_idx:],)# 4. Evaluate on the held-out 12 months.ev = Evaluator()results = [ ev.evaluate(y_test_raw, models.predict_rf(X_test), "Random Forest"), ev.evaluate(y_test_raw, models.predict_xgb(X_test), "XGBoost"),]pd.DataFrame(results)
Two stacked layers (64 → 32), dropout 0.2, Adam @ 1e-3, MSE loss, 50 epochs, batch 32, n_steps=3
Trained model artifacts
Running the full pipeline persists artifacts to code/output/models/:
File
Size
Description
rf_model.pkl
~16 MB
Random Forest (300 trees, depth 10)
xgb_model.pkl
~500 KB
XGBoost (300 estimators, depth 4, L1=0.1, L2=1.0)
mlp_model.pkl
~290 KB
MLP fallback (128 → 64 → 32, Adam, early stopping)
scaler_X.pkl
~1 KB
feature standardisation
scaler_y.pkl
~0.5 KB
target standardisation
These are not committed — they are large, regenerable from source, and *.pkl / *.h5 are blocked in .gitignore. Re-train in place using either the smoke-test cell above or the production scripts.
Warning
Pickled models are not portable across scikit-learn / XGBoost versions. The numbers on this site were generated with the versions pinned in code/requirements.txt. Re-pickle when upgrading.
Note
Continue to Eval to view the held-out test metrics.
Source Code
---title: "Replication — Train"---## GoalReproduce the three predictive models — **Random Forest**, **XGBoost**,and the **Multi-Layer Perceptron** (fallback for LSTM) — on theengineered feature panel built on the [Data](data.qmd) page.**XGBoost is selected as the operational model** (see[Eval](eval.qmd) for the rationale).## Reference pipelineThe canonical training pipeline lives in this repo at[`code/`](https://github.com/baralsamrat/baralsamrat.github.io/tree/main/code):```textcode/├── README.md quick-start + reproducibility checklist├── requirements.txt pinned dependencies (pandas 2.2.3, xgboost 2.1.3, …)├── main.py full pipeline (RF + XGB + LSTM/MLP + ensemble)├── paper_main.py paper pipeline (3 models, 4 publication figures)└── src/ ├── loader.py load CSVs from <repo>/data/ (env-var overridable) ├── processor.py district normalisation + BS→AD calendar conversion ├── engineer.py lag features, rolling means, log-transform target ├── models.py RF / XGBoost / MLP / LSTM training + persistence ├── evaluator.py RMSE / MAE / R² / MAPE ├── visualizer.py figure rendering └── reporter.py text-format analysis report```To re-run end-to-end from the repo root:```bashpip install -r code/requirements.txtpython code/paper_main.py # 3 models + paper figures, ~2 min on CPU# or, for the full ensemble pipeline:python code/main.py```Both scripts are deterministic given the random seed (42, hard-coded in[`code/src/models.py`](https://github.com/baralsamrat/baralsamrat.github.io/blob/main/code/src/models.py))and land within rounding of the headline numbers rendered on this site.## Live training — quick smoke testThe cell below trains Random Forest and XGBoost on the engineered panelinline, then prints both models' R² on the held-out final 12 months.**This is the same code path the production scripts use** — theproduction scripts just additionally persist artifacts.```{python}#| label: train-live#| code-fold: showimport sys, os, warningswarnings.filterwarnings("ignore")sys.path.insert(0, os.path.abspath("../code"))import numpy as npimport pandas as pdfrom src.loader import DataLoaderfrom src.processor import DataProcessorfrom src.engineer import FeatureEngineer, FEATURE_COLS, TARGET_COL, TARGET_RAWfrom src.models import TyphoidModelsfrom src.evaluator import Evaluator# 1. Reproduce the engineered panel (same pipeline as the Data page).raw = DataLoader().load_all()proc = DataProcessor()merged = proc.merge_datasets( proc.process_typhoid(raw["typhoid"]), proc.process_climate(raw["climate"]), proc.process_flood(raw["flood"]),)df = FeatureEngineer().create_features(merged).sort_values("date_dt").reset_index(drop=True)# 2. Strictly chronological train/test split — last 12 months held out.unique_dates = df["date_dt"].sort_values().unique()split_date = unique_dates[len(unique_dates) -12]train, test = df[df["date_dt"] < split_date], df[df["date_dt"] >= split_date]X_train, y_train = train[FEATURE_COLS], train[TARGET_COL]X_test, y_test = test[FEATURE_COLS], test[TARGET_COL]y_test_raw = test[TARGET_RAW].valuesprint(f"Train: {len(train):>5} rows ({train['date_ad'].min()} → {train['date_ad'].max()})")print(f"Test : {len(test):>5} rows ({test['date_ad'].min()} → {test['date_ad'].max()})")# 3. Train RF + XGB. (MLP omitted in the smoke test to keep build fast;# main.py runs all three plus ensemble.)models = TyphoidModels()models.train_rf(X_train, y_train)val_idx =int(len(X_train) *0.9)models.train_xgb( X_train.iloc[:val_idx], y_train.iloc[:val_idx], X_train.iloc[val_idx:], y_train.iloc[val_idx:],)# 4. Evaluate on the held-out 12 months.ev = Evaluator()results = [ ev.evaluate(y_test_raw, models.predict_rf(X_test), "Random Forest"), ev.evaluate(y_test_raw, models.predict_xgb(X_test), "XGBoost"),]pd.DataFrame(results)```The two rows above should land within ±0.005 R² of the headline numbersin the [performance metrics table](../tables/performance_metrics.csv).## Hyperparameter choicesAll hyperparameter choices are pinned in [`code/src/models.py`](https://github.com/baralsamrat/baralsamrat.github.io/blob/main/code/src/models.py):| Model | Key settings ||---|---|| Random Forest | `n_estimators=300, max_depth=10, min_samples_leaf=5, random_state=42, n_jobs=-1` || XGBoost | `n_estimators=300, learning_rate=0.05, max_depth=4, subsample=0.8, colsample_bytree=0.8, reg_alpha=0.1, reg_lambda=1.0, random_state=42` || MLP fallback | `hidden_layer_sizes=(128, 64, 32), activation="relu", solver="adam", learning_rate_init=0.001, max_iter=500, early_stopping=True, random_state=42` || LSTM (TF, if available) | Two stacked layers (64 → 32), dropout 0.2, Adam @ 1e-3, MSE loss, 50 epochs, batch 32, `n_steps=3` |## Trained model artifactsRunning the full pipeline persists artifacts to `code/output/models/`:| File | Size | Description ||---|---:|---|| `rf_model.pkl` | ~16 MB | Random Forest (300 trees, depth 10) || `xgb_model.pkl` | ~500 KB | XGBoost (300 estimators, depth 4, L1=0.1, L2=1.0) || `mlp_model.pkl` | ~290 KB | MLP fallback (128 → 64 → 32, Adam, early stopping) || `scaler_X.pkl` | ~1 KB | feature standardisation || `scaler_y.pkl` | ~0.5 KB | target standardisation |These are **not committed** — they are large, regenerable from source,and `*.pkl` / `*.h5` are blocked in `.gitignore`. Re-train in place usingeither the smoke-test cell above or the production scripts.::: {.callout-warning}**Pickled models are not portable across scikit-learn / XGBoost versions.**The numbers on this site were generated with the versions pinned in[`code/requirements.txt`](https://github.com/baralsamrat/baralsamrat.github.io/blob/main/code/requirements.txt). Re-pickle when upgrading.:::::: {.callout-note}Continue to **[Eval](eval.qmd)** to view the held-out test metrics.:::