Abstract
OBJECTIVE: Passive sensing applications are limited by their inability to determine who is using a device, a critical concern in child mobile device use research, where devices are often shared between siblings or between a child and their parent. Our previous work leveraged behavioral biometrics to identify a target child user; however, it is unknown what type of training data is necessary for optimal model performance. This study evaluated model performance across different characteristics of training data. METHODS: Thirty-six children (11.3 ± 0.9 years, 56% female) self-selected a video or a game on iPads for 10 min while laying and for another 5 min while sitting. The SensorLog application captured iPad accelerometer and gyroscope data while the child interacted with the device. Machine learning algorithms including Neural Network (NN), Random Forest (RF), k-Nearest Neighbors (k-NN), and SwipeFormer were applied to determine the most important aspects of training data to optimize model performance. The aspects of training data evaluated included (1) varying the length (i.e., seconds of training data), (2) varying the user position (i.e., sitting, laying), and (3) varying the time proximity between training and testing data. F1 score was used to evaluate model performance. RESULTS: The SwipeFormer F1 scores were lowest when the training data was further from the test data (0 when training data was 11 min away from test data) and highest when training data was close to test data (0.91 when training data was the minute preceding test data). The SwipeFormer F1 scores were highest when predicting the user laying from laying (0.97) and sitting from sitting (0.94), and lowest when predicting the user sitting from laying (0) and laying from sitting (0). The length of training data had little impact on performance, with a SwipeFormer F1 score of 0.91 when training on one minute of data and a SwipeFormer F1 score of 0.94 when training on twelve minutes of data. DISCUSSION: Because researchers would likely be predicting users at different timepoints than their training data, research should focus on improving model performance for identifying users independent of time proximity for training and test data.