Replication — Data

Goal

Load the three raw source datasets exactly as the production pipeline does, harmonise them into a single monthly district-level panel, and engineer the lag + rolling features used by every downstream model.

By the end of this page you should have a df DataFrame matching the feature schema in Methodology and a deterministic train/test split keyed on date.

Source datasets

Four CSVs live in data/ at the repo root and are committed because they are small (< 400 KB each) and aggregated (no PII):

File Description
data/typhoid_data_1.csv District-month case counts, primary export (BS calendar)
data/typhoid_data_2.csv Aggregated secondary source for trailing months
data/climate_data.csv ERA5-Land + CHIRPS climate variables, district-month
data/flood_data.csv DRR-portal flood event records (daily, aggregated to monthly)

The full preprocessing pipeline lives at code/ in the same repo (see code/README.md).

Load + peek

Code
import pandas as pd

typhoid = pd.read_csv("../data/typhoid_data_1.csv")
print(f"rows: {len(typhoid):,}   cols: {typhoid.shape[1]}")
typhoid.head()
rows: 9,791   cols: 3
Unnamed: 0 Unnamed: 1 Unnamed: 2
0 periodname organisationunitname Outpatient Morbidity-Communicable-Water/Food B...
1 Asar 2072 503 PYUTHAN 701
2 Asar 2072 109 PANCHTHAR 147
3 Asar 2072 112 MORANG 1,356
4 Asar 2072 708 KAILALI 1,554
Code
climate = pd.read_csv("../data/climate_data.csv")
print(f"rows: {len(climate):,}   cols: {climate.shape[1]}")
climate.head()
rows: 8,316   cols: 7
period district Min temperature (ERA5-Land) Max air temperature (ERA5-Land) Air temperature (ERA5-Land) Precipitation (CHIRPS) Relative humidity (ERA5-Land)
0 201501 101 Taplejung -31.8 16.0 -6.2 11.18 47.9
1 201502 101 Taplejung -29.3 18.8 -5.1 21.34 57.6
2 201503 101 Taplejung -22.3 20.4 -1.9 84.47 63.1
3 201504 101 Taplejung -15.0 22.3 0.5 46.29 72.6
4 201505 101 Taplejung -11.8 23.0 4.7 146.13 80.2
Code
flood = pd.read_csv("../data/flood_data.csv")
print(f"rows: {len(flood):,}   cols: {flood.shape[1]}")
flood.head()
rows: 874   cols: 3
period district flood incidence
0 20230615 101 Taplejung 1
1 20230617 101 Taplejung 1
2 20230620 101 Taplejung 1
3 20190617 101 Taplejung 1
4 20230813 101 Taplejung 1

Full pipeline — load → preprocess → merge → feature-engineer

The four lines below reproduce the full pipeline using the in-repo code/src/ modules. They are the same DataLoader, DataProcessor, and FeatureEngineer that code/main.py and code/paper_main.py call.

Code
import sys, os
sys.path.insert(0, os.path.abspath("../code"))

from src.loader    import DataLoader
from src.processor import DataProcessor
from src.engineer  import FeatureEngineer, FEATURE_COLS, TARGET_COL, TARGET_RAW

raw    = DataLoader().load_all()
proc   = DataProcessor()
merged = proc.merge_datasets(
    proc.process_typhoid(raw["typhoid"]),
    proc.process_climate(raw["climate"]),
    proc.process_flood(raw["flood"]),
)

df     = FeatureEngineer().create_features(merged)
print(f"Merged panel  : {merged.shape[0]:,} rows × {merged.shape[1]} columns")
print(f"After lag-drop: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"Date range    : {df['date_ad'].min()}{df['date_ad'].max()}")
print(f"Districts     : {df['district'].nunique()}")
print(f"Features      : {FEATURE_COLS}")
df.head()
Merged panel  : 7,558 rows × 9 columns
After lag-drop: 7,327 rows × 22 columns
Date range    : 2015-05 → 2023-12
Districts     : 77
Features      : ['precip_lag1', 'temp_mean_lag1', 'humidity_lag1', 'flood_lag1', 'precip_roll3', 'temp_roll3', 'monsoon', 'month_sin', 'month_cos', 'cases_lag1']
district date_ad temp_min temp_max temp_mean precip humidity cases flood_events date_dt ... humidity_lag1 flood_lag1 cases_lag1 precip_roll3 temp_roll3 month monsoon month_sin month_cos log_cases
0 ACHHAM 2015-05 1.3 34.7 21.1 46.17 47.5 447.0 0.0 2015-05-01 ... 63.6 0.0 235.0 81.830000 9.666667 5 0 5.000000e-01 -8.660254e-01 6.104793
1 ACHHAM 2015-06 8.0 35.6 21.9 184.49 62.4 497.0 0.0 2015-06-01 ... 47.5 0.0 447.0 82.713333 14.400000 6 1 1.224647e-16 -1.000000e+00 6.210600
2 ACHHAM 2015-07 10.8 28.4 20.5 315.79 87.4 452.0 0.0 2015-07-01 ... 62.4 0.0 497.0 127.120000 18.400000 7 1 -5.000000e-01 -8.660254e-01 6.115892
3 ACHHAM 2015-08 10.9 28.4 20.1 258.30 88.6 865.0 0.0 2015-08-01 ... 87.4 0.0 452.0 182.150000 21.166667 8 1 -8.660254e-01 -5.000000e-01 6.763885
4 ACHHAM 2015-09 5.7 28.3 19.4 108.11 82.8 459.0 0.0 2015-09-01 ... 88.6 0.0 865.0 252.860000 20.833333 9 1 -1.000000e+00 -1.836970e-16 6.131226

5 rows × 22 columns

Pre-computed descriptive statistics

For quick reference, the descriptive statistics that appear in Methodology — computed once after the merge + feature engineering pipeline above — live in tables/descriptive_statistics.csv:

Table 1: Descriptive statistics of the engineered feature panel.
precip_lag1 temp_mean_lag1 humidity_lag1 flood_lag1 precip_roll3 temp_roll3 monsoon month_sin month_cos cases_lag1 cases log_cases
count 7327.00 7327.00 7327.00 7327.00 7327.00 7327.00 7327.00 7327.00 7327.00 7327.00 7327.00 7327.00
mean 138.30 14.61 74.49 0.12 138.31 14.54 0.38 -0.11 0.03 465.88 462.53 5.63
std 155.96 9.26 12.70 0.50 128.77 8.95 0.48 0.68 0.73 449.10 447.90 1.17
min 0.00 -18.00 25.70 0.00 1.47 -17.17 0.00 -1.00 -1.00 1.00 1.00 0.69
25% 12.54 9.30 65.80 0.00 25.19 9.33 0.00 -0.87 -0.87 148.00 146.00 4.99
50% 69.86 15.80 75.20 0.00 91.00 16.03 0.00 -0.00 0.00 328.00 325.00 5.79
75% 234.42 21.40 85.90 0.00 230.84 21.13 1.00 0.50 0.87 639.00 633.00 6.45
max 1100.22 31.50 95.70 8.00 618.66 29.77 1.00 1.00 1.00 2771.75 2771.75 7.93

Feature engineering reference

The code/src/engineer.py module constructs:

  • One-month lags: precip_lag1, temp_mean_lag1, humidity_lag1, flood_lag1, cases_lag1
  • Three-month rolling means (computed on shift-1 to prevent leakage): precip_roll3, temp_roll3
  • Cyclical month encodings: month_sin, month_cos — places December and January adjacent on a unit circle
  • Monsoon indicator: monsoon (1 if month ∈ {6, 7, 8, 9})
  • Log-transformed target: log_cases = log(1 + cases) to stabilise variance over the heavy right tail typical of communicable-disease surveillance
Note

Continue to Train to fit the three predictive models on the engineered panel built here.