3 — Methodology
3.1 — Research design and analytical framework
This study employed a retrospective ecological research design to examine the spatiotemporal relationship between climate variability, extreme flood events, and typhoid fever incidence across Nepal. An ecological approach was appropriate given the study’s reliance on aggregated secondary data and its objective of identifying population-level climate–health patterns rather than individual-level risk factors. The analytical framework integrated descriptive epidemiology, inferential generalized linear mixed models (GLMMs), and ML-based prediction to capture both explanatory and predictive dimensions of climate-sensitive disease dynamics. The GLMM component estimated fixed-effect climate–disease associations while accounting for unobserved district-level heterogeneity through random effects. ML models were applied as a complementary predictive layer designed to capture the nonlinear threshold effects and interaction terms that parametric models cannot easily prespecify.
3.2 — Study area and temporal scope
The study covered all 77 administrative districts of Nepal, spanning three major physiographic zones:
- Terai plains — elevation < 300 m, flood-prone, subtropical monsoon climate, high population density (~450 persons/km²), reliance on tube wells and open water sources particularly susceptible to flood-driven contamination.
- Mid-hill region — intermediate elevation and rainfall, moderate exposure.
- Himalayan mountain belt — high altitude, lower disease incidence but vulnerable to landslides and glacial lake outburst floods.
The temporal scope extended from January 2015 to December 2023, capturing nine complete years of typhoid surveillance and climate observations, including multiple extreme monsoon seasons, the 2016 flood peak, and the post-TCV introduction period. Monthly aggregation was used to reflect disease transmission cycles while maintaining sufficient temporal resolution for lag analysis.
3.3 — Data sources
Typhoid incidence data
Typhoid fever incidence data were obtained from Nepal’s HMIS, which records outpatient morbidity reported by public health facilities nationwide. The dataset comprised monthly counts of clinically diagnosed enteric fever cases aggregated at the district level. Although HMIS data are syndromic and may underreport laboratory-confirmed cases relative to true burden, they provide the most comprehensive and continuous national surveillance coverage available for Nepal (Ministry of Health and Population, Government of Nepal, 2023). Two HMIS datasets were merged to cover the full study period; national and provincial aggregate records were excluded to maintain district-level consistency.
Climate variables
Monthly precipitation estimates were obtained from the Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS) (Funk et al., 2015), which has been validated for complex Himalayan terrain. Air temperature and relative humidity were sourced from the ERA5-Land reanalysis dataset (Muñoz-Sabater et al., 2021), selected for its high spatial resolution and established use in climate–health research. District-level climate values were generated by spatially averaging grid cells intersecting each district boundary.
Extreme climate event indicator
Flood occurrence data were obtained from Nepal’s Disaster Risk Reduction (DRR) portal, recorded by district and date, and aggregated to monthly counts. Floods were treated as count-based indicators reflecting hydro-meteorological shocks capable of disrupting WASH infrastructure and contaminating water supplies. Although event severity and spatial extent were not consistently recorded, flood frequency has been validated as a robust proxy for extreme exposure at the district population level (Bhandari et al., 2022).
3.4 — Data pre-processing and feature engineering
All datasets were harmonized into a district–month panel using consistent temporal formats. Nepali calendar dates in HMIS records were converted to Gregorian equivalents using standardized month mappings, and district names were standardized against Nepal’s official administrative boundary list to ensure accurate merges across all three datasets.
Missing values were limited to less than 10% of observations overall and were handled using conservative imputation methods. Climate variables with short consecutive gaps (≤ 2 months) were linearly interpolated. Missing typhoid counts were imputed using district-specific seasonal averages calculated from non-missing years within the same calendar month. Extreme outliers — specifically, national or provincial aggregate entries erroneously recorded at the district level — were identified by their anomalously high case counts and excluded prior to analysis.
Feature engineering focused on capturing biologically and environmentally plausible transmission dynamics. Lagged predictor variables of one and two months were generated for flood occurrence and all climate variables. The lag range of 0–2 months was selected based on two complementary lines of evidence:
- The reported incubation period of Salmonella Typhi, which ranges from 6 to 30 days, implying that environmental exposures in one calendar month should primarily manifest clinically in the following month.
- Exploratory correlation analysis at the district-month level (n = 6,162), which confirmed that one-month-lagged climate features carry the strongest signal: lagged precipitation r = 0.313, lagged humidity r = 0.180, lagged flood frequency r = 0.122, and the autoregressive
cases_lag1term r = 0.731. The autoregressive dominance reflects persistent endemic foci in high-burden districts, while the lagged climate signal is biologically aligned with the S. Typhi incubation period plus a clinical-reporting delay.
No lags beyond two months were retained, as correlations fell below conventional significance at three months and beyond. Additional features include a binary monsoon season indicator (June–September = 1) and an ecological zone classification (Terai, Hill, Mountain). Typhoid case counts were log-transformed to reduce right skewness and stabilize variance prior to modeling.
3.5 — Machine learning models and model selection rationale
Three ML algorithms were employed, selected for their complementary strengths and suitability for the structure of the dataset — a medium-sized district-month panel with temporal dependence, right-skewed outcomes, and categorical features. These algorithms were chosen in preference to purely time-series approaches (e.g. ARIMA, SARIMA) because the research objective was to incorporate exogenous climate and flood predictors alongside temporal patterns, which ARIMA extensions handle poorly with many predictors and nonlinear effects (Choi et al., 2023; Dixon et al., 2023).
Random Forest (RF)
Used as the baseline ensemble method. RF is robust to overfitting with moderate sample sizes, handles mixed predictor types without scaling, and provides permutation-based variable importance rankings that are useful for epidemiological interpretation (Breiman, 2001).
XGBoost (Extreme Gradient Boosting)
Selected for its demonstrated superiority with structured tabular data, its built-in L1/L2 regularization, and its capacity to learn nonlinear threshold effects and interaction terms that are intrinsic to climate–health systems (Chen & Guestrin, 2016). The XGBoost objective is
\[ \mathcal{L}^{(t)} = \sum_{i=1}^{N}\ell\!\left(y_i,\hat{y}_i^{(t-1)} + f_t(\mathbf{x}_i)\right) + \Omega(f_t), \tag{1}\]
where \(\Omega(f) = \gamma T + \tfrac{1}{2}\lambda\lVert w\rVert^2\) is the regularisation penalty over leaf weights \(w\) and tree size \(T\).
Long Short-Term Memory (LSTM)
Included to capture sequential temporal dependencies in the monthly time series — particularly the lagged effect of flood events, which tree-based methods treat as independent observations (Hochreiter & Schmidhuber, 1997). LSTM was treated as an exploratory model given the relatively short available sequence length (9 years × 12 months = 108 time steps at national level).
3.6 — Model training, validation, tuning, and ensemble construction
The dataset was first divided chronologically into a training subset (the earliest 80% of observations) and a held-out test subset (the most recent 20%), strictly preserving temporal order to prevent any form of data leakage from future time points into model training. This chronological split is essential for time-series evaluation: random splits would allow the model to observe data from 2022 when predicting 2018, which artificially inflates apparent performance.
Within the training subset only, TimeSeriesSplit cross-validation with k = 5 folds was applied for hyperparameter tuning. In each fold, training observations were drawn from earlier time points and validation observations from later time points within the training window, maintaining temporal ordering throughout. Grid search was used to identify optimal hyperparameters by minimizing validation-set RMSE. Final model performance was then assessed exclusively on the held-out test set, which the model never encountered during training or tuning.
For machine learning models, all continuous predictor features were standardized (zero mean, unit standard deviation) prior to training. Early stopping was applied during LSTM training to prevent overfitting on the limited sequence length. XGBoost regularization parameters (reg_lambda, reg_alpha) were tuned via grid search. Model stability was assessed through sensitivity checks across monsoon (June–September) and non-monsoon subperiods.
Ensemble construction
An ensemble model was constructed by combining the predictions of all three algorithms using weighted averaging. Weights were assigned a priori (not tuned on the test set) to up-weight XGBoost relative to the other learners, on the basis of its consistently lower RMSE on similar epidemiological count-data panels:
- RF weight: 0.25
- XGBoost weight: 0.50
- LSTM / MLP weight: 0.25
The ensemble prediction for each observation is then
\[ \hat{Y}_{\text{ens}} = 0.25 \,\hat{Y}_{\text{RF}} + 0.50 \,\hat{Y}_{\text{XGB}} + 0.25 \,\hat{Y}_{\text{LSTM/MLP}}. \]
This a priori weighting prevents test-set leakage that would otherwise arise from R²-tuned weights, while still preserving the complementary strengths of each algorithm. Ensemble performance was evaluated on the same held-out test set as the individual models. The exact weights are implemented in code/src/models.py:weighted_ensemble.
3.7 — Performance evaluation metrics
Model performance was evaluated using four regression metrics appropriate for continuous case-count outcomes:
- RMSE (Root Mean Squared Error) and MAE (Mean Absolute Error) for absolute error magnitude.
- R² (coefficient of determination) for the proportion of variance explained.
- MAPE (Mean Absolute Percentage Error) for proportional accuracy interpretable by public health practitioners.
Bootstrapped 95% confidence intervals (n = 1,000 resamples) were generated to assess metric stability across the test set.
3.8 — Climate change scenario analysis and uncertainty
To assess future typhoid risk, scenario-based projections were generated under the SSP2-4.5 (moderate emissions) and SSP5-8.5 (high emissions) climate pathways. Historical climate inputs (precipitation, temperature, flood frequency) were perturbed in accordance with CMIP6 regional projections for Nepal by 2050 relative to the 1995–2014 baseline (Intergovernmental Panel on Climate Change, 2022; Shrestha et al., 2025). Specifically:
| Scenario | Δ Precipitation | Δ Temperature | Δ Flood frequency |
|---|---|---|---|
| SSP2-4.5 (2050) | +15% | +1.0 °C | +20% |
| SSP5-8.5 (2050) | +30% | +1.5 °C | +40% |
Monte Carlo simulation (n = 1,000 random draws) was used to propagate uncertainty in climate inputs through the trained XGBoost model, generating a distribution of projected typhoid incidence from which the median and 95% uncertainty interval were extracted.
Projection caveats. These projections carry several important caveats:
- They are scenario-based and assume that the historical statistical relationship between climate variables and typhoid incidence remains stable through 2050, without accounting for potential changes in WASH infrastructure quality, TCV coverage, population density, or pathogen characteristics.
- The projection model is trained on nine years of observations; longer training periods would improve projection robustness.
- The uncertainty intervals reflect variability in climate inputs only, not model parameter uncertainty.
The projections should therefore be interpreted as indicative estimates of the direction and relative magnitude of climate-driven risk changes, rather than as precise forecasts.