South Africa’s National Lab Cohorts were constructed from routine laboratory data generated by the National Health Laboratory Service (NHLS), which provides nearly all public-sector lab testing in the country.

The cohorts’ central methodological contribution is a national-scale probabilistic record linkage system, designed to construct longitudinal, patient-level data in the absence of a unique patient identifier.

Figure 2: Linkage summary from input to output (using the original HIV Cohort of 4.2 million clients on ART in 2018 as an example)

NHLS Source Data and the

Linkage Challenge

The NHLS database contains hundreds of millions of lab records since 2004 from over 4,000 public health facilities nationwide. The database includes laboratory tests commonly used in care for HIV (CD4 counts, viral loads), TB (diagnostic and drug resistance tests), diabetes (HbA1C, blood glucose), and other lab-monitored conditions. These lab records were not created for longitudinal analysis: patient identifiers are inconsistently recorded, names may be misspelled or change over time, dates of birth may be missing or inaccurate, and individuals frequently move between facilities. As a result, deterministic linkage approaches are infeasible at national scale and would lead to substantial under-linkage.

To address this, researchers developed a probabilistic, graph-based record linkage algorithm specifically tailored to the structure, scale, and data quality of NHLS laboratory records.

Four-Step

Record Linkage Algorithm

Cohort Construction and

Strengths

The linkage process produces a PSEUDONYMIZED UNIQUE PATIENT IDENTIFIER that is consistent across facilities and over time, while ensuring that no directly identifying information is released to researchers.

Once linked, laboratory records were assembled into LONGITUDINAL PATIENT HISTORIES, capturing patient trajectories over time. Because laboratory tests are ordered routinely for diagnosis and treatment monitoring for HIV, TB, and other conditions, these lab tests serve as strong proxies for ENGAGEMENT IN CARE, RETENTION, AND TREATMENT RESPONSE.

A major strength of the linkage approach is its ability to capture SILENT TRANSFERS between facilities: patients who move clinics or provinces remain observable as long as testing continues anywhere within the NHLS system.

Overall, South Africa’s National Lab Cohorts demonstrate how ADVANCED PROBABILISTIC AND GRAPH-BASED RECORD LINKAGE METHODS can transform routine health system data into a powerful national research and surveillance resource, even in settings without unique patient identifiers.

A learning

Health System

By transforming routine laboratory data into a longitudinal, de-identified cohort, the NHLS National Lab Cohorts advance FAIR data principles:

  • the data are findable,
  • interoperable across projects,
  • reusable for multiple research and policy purposes, and
  • accessible under appropriate governance by NHLS’s Academic Affairs, Research, and Quality Assurance unit.

The NHLS cohorts support a learning health system, enabling continuous monitoring of programme performance, rapid evaluation of policy changes, and scientific evidence generation at national scale.