Feedback

K-nearest neighbor algorithm for imputing missing longitudinal prenatal alcohol data

Affiliation
Department of Psychiatry ,Columbia University Irving Medical Center ,New York ,NY ,United States
Sania, Ayesha;
Affiliation
Department of Psychiatry ,Columbia University Irving Medical Center ,New York ,NY ,United States
Pini, Nicolò;
Affiliation
Research Triangle Institute ,Research Triangle Park ,Durham ,NC ,United States
Nelson, Morgan E.;
Affiliation
Department of Psychiatry ,Columbia University Irving Medical Center ,New York ,NY ,United States
Myers, Michael M.;
Affiliation
Department of Child and Adolescent Psychiatry ,NYU Grossman School of Medicine ,New York ,NY ,United States
Shuffrey, Lauren C.;
Affiliation
Department of Psychiatry ,Columbia University Irving Medical Center ,New York ,NY ,United States
Lucchini, Maristella;
Affiliation
Center for Pediatric and Community Research ,Avera Health ,Sioux Falls ,SD ,United States
Elliott, Amy J.;
Affiliation
Department of Obstetrics and Gynecology, Faculty of Medicine and Health Science, Stellenbosch University ,Cape Town ,Western Cape ,South Africa
Odendaal, Hein J.;
Affiliation
Department of Psychiatry ,Columbia University Irving Medical Center ,New York ,NY ,United States
Fifer, William P.

Aims The objective of this study is to illustrate the application of a machine learning algorithm, K Nearest Neighbor ( k-NN ) to impute missing alcohol data in a prospective study among pregnant women. Methods We used data from the Safe Passage study (n = 11,083). Daily alcohol consumption for the last reported drinking day and 30 days prior was recorded using the Timeline Follow back method, which generated a variable amount of missing data per participants. Of the 3.2 million person-days of observation, data were missing for 0.36 million (11.4%). Using the k-NN imputed values were weighted for the distances and matched for the day of the week. Since participants with no missing days were not comparable to those with missing data, segments of non-missing data from all participants were included as a reference. Validation was done after randomly deleting data for 5–15 consecutive days from the first trimester. Results We found that data from 5 nearest neighbors (i.e., K = 5) and segments of 55 days provided imputed values with least imputation error. After deleting data segments from the first trimester data set with no missing days, there was no difference between actual and predicted values for 64% of deleted segments. For 31% of the segments, imputed data were within +/−1 drink/day of the actual. Imputation accuracy varied by study site because of the differences in the magnitude of drinking and proportion of missing data. Conclusion k-NN can be used to impute missing data from longitudinal studies of alcohol during pregnancy with high accuracy.

Cite

Citation style:
Could not load citation form.

Access Statistic

Total:
Downloads:
Abtractviews:
Last 12 Month:
Downloads:
Abtractviews:

Rights

License Holder: Copyright © 2025 Sania, Pini, Nelson, Myers, Shuffrey, Lucchini, Elliott, Odendaal and Fifer.

Use and reproduction: