Replication — Data

Goal

Load the three raw source datasets exactly as the production pipeline does, harmonise them into a single monthly district-level panel, and engineer the lag + rolling features used by every downstream model.

By the end of this page you should have a df DataFrame matching the feature schema in Methodology and a deterministic train/test split keyed on date.

Source datasets

Four CSVs live in data/ at the repo root and are committed because they are small (< 400 KB each) and aggregated (no PII):

File	Description
`data/typhoid_data_1.csv`	District-month case counts, primary export (BS calendar)
`data/typhoid_data_2.csv`	Aggregated secondary source for trailing months
`data/climate_data.csv`	ERA5-Land + CHIRPS climate variables, district-month
`data/flood_data.csv`	DRR-portal flood event records (daily, aggregated to monthly)

The full preprocessing pipeline lives at code/ in the same repo (see code/README.md).

Load + peek

Code

import pandas as pd

typhoid = pd.read_csv("../data/typhoid_data_1.csv")
print(f"rows: {len(typhoid):,}   cols: {typhoid.shape[1]}")
typhoid.head()

rows: 9,791   cols: 3

	Unnamed: 0	Unnamed: 1	Unnamed: 2
0	periodname	organisationunitname	Outpatient Morbidity-Communicable-Water/Food B...
1	Asar 2072	503 PYUTHAN	701
2	Asar 2072	109 PANCHTHAR	147
3	Asar 2072	112 MORANG	1,356
4	Asar 2072	708 KAILALI	1,554

Code

climate = pd.read_csv("../data/climate_data.csv")
print(f"rows: {len(climate):,}   cols: {climate.shape[1]}")
climate.head()

rows: 8,316   cols: 7

	period	district	Min temperature (ERA5-Land)	Max air temperature (ERA5-Land)	Air temperature (ERA5-Land)	Precipitation (CHIRPS)	Relative humidity (ERA5-Land)
0	201501	101 Taplejung	-31.8	16.0	-6.2	11.18	47.9
1	201502	101 Taplejung	-29.3	18.8	-5.1	21.34	57.6
2	201503	101 Taplejung	-22.3	20.4	-1.9	84.47	63.1
3	201504	101 Taplejung	-15.0	22.3	0.5	46.29	72.6
4	201505	101 Taplejung	-11.8	23.0	4.7	146.13	80.2

Code

flood = pd.read_csv("../data/flood_data.csv")
print(f"rows: {len(flood):,}   cols: {flood.shape[1]}")
flood.head()

rows: 874   cols: 3

	period	district	flood incidence
0	20230615	101 Taplejung	1
1	20230617	101 Taplejung	1
2	20230620	101 Taplejung	1
3	20190617	101 Taplejung	1
4	20230813	101 Taplejung	1

Full pipeline — load → preprocess → merge → feature-engineer

The four lines below reproduce the full pipeline using the in-repo code/src/ modules. They are the same DataLoader, DataProcessor, and FeatureEngineer that code/main.py and code/paper_main.py call.

Code

import sys, os
sys.path.insert(0, os.path.abspath("../code"))

from src.loader    import DataLoader
from src.processor import DataProcessor
from src.engineer  import FeatureEngineer, FEATURE_COLS, TARGET_COL, TARGET_RAW

raw    = DataLoader().load_all()
proc   = DataProcessor()
merged = proc.merge_datasets(
    proc.process_typhoid(raw["typhoid"]),
    proc.process_climate(raw["climate"]),
    proc.process_flood(raw["flood"]),
)

df     = FeatureEngineer().create_features(merged)
print(f"Merged panel  : {merged.shape[0]:,} rows × {merged.shape[1]} columns")
print(f"After lag-drop: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"Date range    : {df['date_ad'].min()} → {df['date_ad'].max()}")
print(f"Districts     : {df['district'].nunique()}")
print(f"Features      : {FEATURE_COLS}")
df.head()

Merged panel  : 7,558 rows × 9 columns
After lag-drop: 7,327 rows × 22 columns
Date range    : 2015-05 → 2023-12
Districts     : 77
Features      : ['precip_lag1', 'temp_mean_lag1', 'humidity_lag1', 'flood_lag1', 'precip_roll3', 'temp_roll3', 'monsoon', 'month_sin', 'month_cos', 'cases_lag1']

	district	date_ad	temp_min	temp_max	temp_mean	precip	humidity	cases	date_dt	...	humidity_lag1	cases_lag1	precip_roll3	temp_roll3	month	monsoon	month_sin	month_cos	log_cases
0	ACHHAM	2015-05	1.3	34.7	21.1	46.17	47.5	447.0	2015-05-01	...	63.6	235.0	81.830000	9.666667	5	0	5.000000e-01	-8.660254e-01	6.104793
1	ACHHAM	2015-06	8.0	35.6	21.9	184.49	62.4	497.0	2015-06-01	...	47.5	447.0	82.713333	14.400000	6	1	1.224647e-16	-1.000000e+00	6.210600
2	ACHHAM	2015-07	10.8	28.4	20.5	315.79	87.4	452.0	2015-07-01	...	62.4	497.0	127.120000	18.400000	7	1	-5.000000e-01	-8.660254e-01	6.115892
3	ACHHAM	2015-08	10.9	28.4	20.1	258.30	88.6	865.0	2015-08-01	...	87.4	452.0	182.150000	21.166667	8	1	-8.660254e-01	-5.000000e-01	6.763885
4	ACHHAM	2015-09	5.7	28.3	19.4	108.11	82.8	459.0	2015-09-01	...	88.6	865.0	252.860000	20.833333	9	1	-1.000000e+00	-1.836970e-16	6.131226

5 rows × 22 columns

Pre-computed descriptive statistics

For quick reference, the descriptive statistics that appear in Methodology — computed once after the merge + feature engineering pipeline above — live in tables/descriptive_statistics.csv:

Table 1: Descriptive statistics of the engineered feature panel.

	precip_lag1	temp_mean_lag1	humidity_lag1	flood_lag1	precip_roll3	temp_roll3	monsoon	month_sin	month_cos	cases_lag1	cases	log_cases
count	7327.00	7327.00	7327.00	7327.00	7327.00	7327.00	7327.00	7327.00	7327.00	7327.00	7327.00	7327.00
mean	138.30	14.61	74.49	0.12	138.31	14.54	0.38	-0.11	0.03	465.88	462.53	5.63
std	155.96	9.26	12.70	0.50	128.77	8.95	0.48	0.68	0.73	449.10	447.90	1.17
min	0.00	-18.00	25.70	0.00	1.47	-17.17	0.00	-1.00	-1.00	1.00	1.00	0.69
25%	12.54	9.30	65.80	0.00	25.19	9.33	0.00	-0.87	-0.87	148.00	146.00	4.99
50%	69.86	15.80	75.20	0.00	91.00	16.03	0.00	-0.00	0.00	328.00	325.00	5.79
75%	234.42	21.40	85.90	0.00	230.84	21.13	1.00	0.50	0.87	639.00	633.00	6.45
max	1100.22	31.50	95.70	8.00	618.66	29.77	1.00	1.00	1.00	2771.75	2771.75	7.93

Feature engineering reference

The code/src/engineer.py module constructs:

One-month lags: precip_lag1, temp_mean_lag1, humidity_lag1, flood_lag1, cases_lag1
Three-month rolling means (computed on shift-1 to prevent leakage): precip_roll3, temp_roll3
Cyclical month encodings: month_sin, month_cos — places December and January adjacent on a unit circle
Monsoon indicator: monsoon (1 if month ∈ {6, 7, 8, 9})
Log-transformed target: log_cases = log(1 + cases) to stabilise variance over the heavy right tail typical of communicable-disease surveillance

Note

Continue to Train to fit the three predictive models on the engineered panel built here.