
Figure 2: Linkage summary from input to output (using the original HIV Cohort of 4.2 million clients on ART in 2018 as an example)
NHLS Source Data and the
Linkage Challenge
The NHLS database contains hundreds of millions of lab records since 2004 from over 4,000 public health facilities nationwide. The database includes laboratory tests commonly used in care for HIV (CD4 counts, viral loads), TB (diagnostic and drug resistance tests), diabetes (HbA1C, blood glucose), and other lab-monitored conditions. These lab records were not created for longitudinal analysis: patient identifiers are inconsistently recorded, names may be misspelled or change over time, dates of birth may be missing or inaccurate, and individuals frequently move between facilities. As a result, deterministic linkage approaches are infeasible at national scale and would lead to substantial under-linkage.
To address this, researchers developed a probabilistic, graph-based record linkage algorithm specifically tailored to the structure, scale, and data quality of NHLS laboratory records.
Four-Step
Record Linkage Algorithm
Cohort Construction and
Strengths
A learning
Health System
By transforming routine laboratory data into a longitudinal, de-identified cohort, the NHLS National Lab Cohorts advance FAIR data principles:
- the data are findable,
- interoperable across projects,
- reusable for multiple research and policy purposes, and
- accessible under appropriate governance by NHLS’s Academic Affairs, Research, and Quality Assurance unit.
The NHLS cohorts support a learning health system, enabling continuous monitoring of programme performance, rapid evaluation of policy changes, and scientific evidence generation at national scale.