CSUF Logo
 

Student Research at USC

Making Something Out of Nothing: Data Imputation Techniques

by Shaina Sta. Cruz

laboratory

The current study on neurodevelopment is conducted as part of Dr. Kristi Clark’s lab at USC, with mentorship from Dr. Farshid Sepehrband.

Understanding the nuances of the brain in its development has far-reaching implications for research and clinical practice. The field of neurodevelopment has been examined using methods such as MRI and fMRI scans, and research in the field can allow us further understanding of the growth of the brain and central nervous system. Advances in computer science and data storage methods allow researchers to collect, store, and ultimately analyze the large neuroimaging datasets. The Philadelphia Neurodevelopment Cohort (PNC) study is one such example of the efforts made in examining neurodevelopment, with over 1000 participants imaged.

With the abundance of data in the biomedical field comes the issue of missing data. Missing data is a common problem in clinical, biomedical, and neuroimaging research, in that it may be difficult to ensure that complete data are collected from all individuals. This problem can result from anything from measurement error to human error, and it would be too costly and inefficient to throw away cases with missing data, or manually retrieve the missing items. Instead, researchers turn to statistical techniques, specifically data imputation, to address the issue. Data imputation comprises of techniques used to replace missing data with substituted values, all of which are computed based on other available information. One example of data imputation is simply replacing missing data with the mean or median of the variable. Though this is a simple solution, this technique may not accurately represent the true distribution for the variables. Since humans are similar neuroanatomically, researchers are able to use existing information in big datasets in order to extract the missing information; this is accomplished through more advanced, multivariate techniques.

The purpose of the study is to compare more advanced multivariate techniques with simpler approaches on analyzing neuroimaging data, specifically participants’ cortical thickness from the PNC study. The more advanced techniques comprise of machine learning and matrix completion, and the simpler techniques include mean and median imputation. Methods involved removing percentages of a full, complete dataset at random in order to create artificial datasets with missing values. The techniques were then used to impute the missing values, and comparisons between the original values and the missing values yielded error rates of the techniques. Initial findings revealed that the machine learning technique was able to recover missing values around .1 mean error with significant correlation with the ground truth (p < .0001). Future steps comprise of determining additional imputation methods as well as comparing the errors of techniques when it comes to predicting other key features of the brain, such as volume or surface area.

 

Big Data Discovery & Diversity

Program Director
Dr. Archana McEligot 
amceligot@fullerton.edu

 

Program Administrative Analyst
Mary Aboud
maboud@fullerton.edu