Subnational Spatial Epidemiology of HIV Prevalence and
Socioeconomic Determinants in Ghana: A 260-District Analysis
Bivariate LISA, Spatial Lag Regression, LASSO, and Random Forest with SHAP Interpretation
Valentine Golden Ghanem, MSc Public Health (Distinction), MSc Data Science (Distinction)
Ghana COCOBOD Cocoa Clinic, Accra, Ghana
ORCID: 0009-0002-8332-0220
Background

HIV spatial heterogeneity in sub-Saharan Africa reflects complex interactions between behavioural factors, socioeconomic determinants, and health system access. Ghana's national prevalence of 1.7% masks substantial subnational variation that standard aggregate summaries fail to capture.

Objective: Characterise the spatial distribution and socioeconomic determinants of HIV prevalence across Ghana's 260 health districts using spatial statistics and machine learning.

Study Design: Ecological cross-sectional analysis integrating DHS 2014/2022, Ghana Census 2021, and Ghana Statistical Service data across 260 districts in 16 administrative regions.

Methods

Spatial Analysis:

  • Global Moran's I (KNN k=4)
  • LISA with Rook contiguity (999 permutations, p<0.05)
  • Getis-Ord Gi* hotspot detection
  • Spatial lag regression (ML estimation, Queen contiguity)

Machine Learning:

  • LASSO with 10-fold CV (feature selection)
  • Random Forest with leave-one-region-out spatial CV (16 folds)
  • SHAP TreeExplainer (feature importance)

Software: Python (PySAL, scikit-learn, SHAP, GeoPandas). CRS: EPSG:32630.

Fig 1. Study Area & HIV Prevalence
HIV & Poverty Choropleth
Bivariate choropleth: HIV prevalence (left) and poverty incidence (right) across 260 districts. Northern districts show high poverty but low HIV; southern/central districts show inverse pattern.
Fig 2. LISA Cluster Map
LISA Cluster Map
Local Indicators of Spatial Association. 113 significant clusters identified: 59 Low-High (blue), 51 High-Low (red), 3 non-significant. Boundary-transition effects dominate. p<0.05, 999 permutations.
Fig 3. Pearson Correlation Matrix
Correlation Matrix
Full 19-variable symmetric matrix. Strongest correlates of HIV prevalence: VCT uptake (r=0.84), Female secondary education (r=0.81), Wife beating acceptance (r=−0.82), Illiteracy (r=−0.68).
Fig 4. SHAP Feature Importance
SHAP Summary
Top 10 Random Forest predictors via SHAP TreeExplainer. VCT Women (0.639) and Wife Beating Acceptance (0.241) dominate. Remaining 8 features contribute <0.03 each.
Key Findings

Global Moran's I: 0.907 (z=22.3, p<0.001)

Strong positive spatial autocorrelation confirmed.

Spatial Lag Regression:

  • ρ = 0.596 (p<0.001)
  • Pseudo-R² = 0.927
  • VCT uptake: β=0.039 (p<0.001)
  • Modern contraception: β=0.037 (p=0.002)

ML Model Comparison:

  • LASSO: R² = 0.933 (13/19 features)
  • RF Spatial CV: R² = 0.611

LASSO captures global linear structure; RF captures local spatial heterogeneity. Performance gap reflects 16 discrete regional strata in the data.

Conclusions
  • Spatial clustering is substantial — district-level HIV prevalence exhibits strong neighbourhood dependence (I=0.907)
  • Boundary-transition effects dominate LISA patterns (HL/LH), suggesting regional-scale epidemic geography rather than focal hotspots
  • VCT uptake and gender norms are the two most powerful predictors across all models
  • Ecological interpretation required — district-level associations do not imply individual-level causation
Policy Implications
  • Targeted VCT scale-up in low-testing districts (Northern, Savannah, North East regions)
  • Gender-based violence programmes in districts with high wife-beating acceptance (>50%)
  • Education investment — female secondary education shows strong protective association
  • Cross-boundary surveillance — spillover effects (HL/LH clusters) demand coordinated inter-district response
STROBE compliant — Ecological cross-sectional study. Reporting follows STROBE guidelines for observational studies.