Load the three raw source datasets exactly as the production pipeline does, harmonise them into a single monthly district-level panel, and engineer the lag + rolling features used by every downstream model.
By the end of this page you should have a df DataFrame matching the feature schema in Methodology and a deterministic train/test split keyed on date.
Source datasets
Four CSVs live in data/ at the repo root and are committed because they are small (< 400 KB each) and aggregated (no PII):
File
Description
data/typhoid_data_1.csv
District-month case counts, primary export (BS calendar)
Full pipeline — load → preprocess → merge → feature-engineer
The four lines below reproduce the full pipeline using the in-repo code/src/ modules. They are the sameDataLoader, DataProcessor, and FeatureEngineer that code/main.py and code/paper_main.py call.
Merged panel : 7,558 rows × 9 columns
After lag-drop: 7,327 rows × 22 columns
Date range : 2015-05 → 2023-12
Districts : 77
Features : ['precip_lag1', 'temp_mean_lag1', 'humidity_lag1', 'flood_lag1', 'precip_roll3', 'temp_roll3', 'monsoon', 'month_sin', 'month_cos', 'cases_lag1']
district
date_ad
temp_min
temp_max
temp_mean
precip
humidity
cases
flood_events
date_dt
...
humidity_lag1
flood_lag1
cases_lag1
precip_roll3
temp_roll3
month
monsoon
month_sin
month_cos
log_cases
0
ACHHAM
2015-05
1.3
34.7
21.1
46.17
47.5
447.0
0.0
2015-05-01
...
63.6
0.0
235.0
81.830000
9.666667
5
0
5.000000e-01
-8.660254e-01
6.104793
1
ACHHAM
2015-06
8.0
35.6
21.9
184.49
62.4
497.0
0.0
2015-06-01
...
47.5
0.0
447.0
82.713333
14.400000
6
1
1.224647e-16
-1.000000e+00
6.210600
2
ACHHAM
2015-07
10.8
28.4
20.5
315.79
87.4
452.0
0.0
2015-07-01
...
62.4
0.0
497.0
127.120000
18.400000
7
1
-5.000000e-01
-8.660254e-01
6.115892
3
ACHHAM
2015-08
10.9
28.4
20.1
258.30
88.6
865.0
0.0
2015-08-01
...
87.4
0.0
452.0
182.150000
21.166667
8
1
-8.660254e-01
-5.000000e-01
6.763885
4
ACHHAM
2015-09
5.7
28.3
19.4
108.11
82.8
459.0
0.0
2015-09-01
...
88.6
0.0
865.0
252.860000
20.833333
9
1
-1.000000e+00
-1.836970e-16
6.131226
5 rows × 22 columns
Pre-computed descriptive statistics
For quick reference, the descriptive statistics that appear in Methodology — computed once after the merge + feature engineering pipeline above — live in tables/descriptive_statistics.csv:
Table 1: Descriptive statistics of the engineered feature panel.
Log-transformed target: log_cases = log(1 + cases) to stabilise variance over the heavy right tail typical of communicable-disease surveillance
Note
Continue to Train to fit the three predictive models on the engineered panel built here.
Source Code
---title: "Replication — Data"---## GoalLoad the three raw source datasets exactly as the production pipelinedoes, harmonise them into a single monthly district-level panel, andengineer the lag + rolling features used by every downstream model.By the end of this page you should have a `df` DataFrame matching thefeature schema in [Methodology](../paper/03-methodology.qmd) and adeterministic train/test split keyed on date.## Source datasetsFour CSVs live in `data/` at the repo root and are committed becausethey are small (< 400 KB each) and aggregated (no PII):| File | Description ||---|---|| `data/typhoid_data_1.csv` | District-month case counts, primary export (BS calendar) || `data/typhoid_data_2.csv` | Aggregated secondary source for trailing months || `data/climate_data.csv` | ERA5-Land + CHIRPS climate variables, district-month || `data/flood_data.csv` | DRR-portal flood event records (daily, aggregated to monthly) |The full preprocessing pipeline lives at [`code/`](https://github.com/baralsamrat/baralsamrat.github.io/tree/main/code) in the same repo (see [`code/README.md`](https://github.com/baralsamrat/baralsamrat.github.io/blob/main/code/README.md)).## Load + peek```{python}#| label: load-typhoid#| code-fold: showimport pandas as pdtyphoid = pd.read_csv("../data/typhoid_data_1.csv")print(f"rows: {len(typhoid):,} cols: {typhoid.shape[1]}")typhoid.head()``````{python}#| label: load-climate#| code-fold: showclimate = pd.read_csv("../data/climate_data.csv")print(f"rows: {len(climate):,} cols: {climate.shape[1]}")climate.head()``````{python}#| label: load-flood#| code-fold: showflood = pd.read_csv("../data/flood_data.csv")print(f"rows: {len(flood):,} cols: {flood.shape[1]}")flood.head()```## Full pipeline — load → preprocess → merge → feature-engineerThe four lines below reproduce the full pipeline using the in-repo[`code/src/`](https://github.com/baralsamrat/baralsamrat.github.io/tree/main/code/src)modules. They are the *same* `DataLoader`, `DataProcessor`, and`FeatureEngineer` that `code/main.py` and `code/paper_main.py` call.```{python}#| label: pipeline#| code-fold: showimport sys, ossys.path.insert(0, os.path.abspath("../code"))from src.loader import DataLoaderfrom src.processor import DataProcessorfrom src.engineer import FeatureEngineer, FEATURE_COLS, TARGET_COL, TARGET_RAWraw = DataLoader().load_all()proc = DataProcessor()merged = proc.merge_datasets( proc.process_typhoid(raw["typhoid"]), proc.process_climate(raw["climate"]), proc.process_flood(raw["flood"]),)df = FeatureEngineer().create_features(merged)print(f"Merged panel : {merged.shape[0]:,} rows × {merged.shape[1]} columns")print(f"After lag-drop: {df.shape[0]:,} rows × {df.shape[1]} columns")print(f"Date range : {df['date_ad'].min()} → {df['date_ad'].max()}")print(f"Districts : {df['district'].nunique()}")print(f"Features : {FEATURE_COLS}")df.head()```## Pre-computed descriptive statisticsFor quick reference, the descriptive statistics that appear in[Methodology](../paper/03-methodology.qmd) — computed once after themerge + feature engineering pipeline above — live in[`tables/descriptive_statistics.csv`](../tables/descriptive_statistics.csv):```{python}#| label: tbl-desc#| tbl-cap: "Descriptive statistics of the engineered feature panel."#| echo: falseimport pandas as pddesc = pd.read_csv("../tables/descriptive_statistics.csv", index_col=0)desc.round(2)```## Feature engineering referenceThe [`code/src/engineer.py`](https://github.com/baralsamrat/baralsamrat.github.io/blob/main/code/src/engineer.py) module constructs:- **One-month lags**: `precip_lag1`, `temp_mean_lag1`, `humidity_lag1`, `flood_lag1`, `cases_lag1`- **Three-month rolling means** (computed on shift-1 to prevent leakage): `precip_roll3`, `temp_roll3`- **Cyclical month encodings**: `month_sin`, `month_cos` — places December and January adjacent on a unit circle- **Monsoon indicator**: `monsoon` (1 if month ∈ {6, 7, 8, 9})- **Log-transformed target**: `log_cases = log(1 + cases)` to stabilise variance over the heavy right tail typical of communicable-disease surveillance::: {.callout-note}Continue to **[Train](train.qmd)** to fit the three predictive models onthe engineered panel built here.:::