Times: 2026 Mar 28 from 10:45AM to 12:00PM (Central Time (US & Canada))
Abstract:
Bioinformatics is an important tool for genomics research, lowering the cost and reducing the time to wait for results. This research aims to use machine learning to predict the pathogenicity of variants of uncertain significant (VUS) for the Alpha-1 antitrypsin (AAT) protein associated with the SERPINA1 gene. Data used was collected through the Ensembl database. Moreover, Python was used to clean the data and create/train the models used for prediction. The dataset contains 196 more benign observations than pathogenic, so we removed 196 benign observations for training. However, we used the full data set of 484 observations for testing to determine accuracy over all known missense swaps associated with SERPINA1. Next, four different models were used for prediction with all parameters found using the hold-out method (aside from the neural network) with a generalization gap of 0.015. The first model focuses on using a neural network built using PyTorch. This neural network is a multilayer perceptron (MLP) that will use known cases of missense swaps and their pathogenicity and their mutation accessor score to predict the pathogenicity of all VUS. This is done using linear affine transformations and then passing that information into a Rectified Linear Unit activation function through one hidden layer. The output layer also makes use of a linear affine transformation and then utilizes a sigmoid activation function to map into the open interval $(0,1)$ and translate that into a Bernoulli distribution to predict benign $(0)$ or pathogenic $(1)$. The model was optimized using the stochastic gradient descent algorithm, with a learning rate of 0.01, a momentum of 0.9, and a batch size of 32. The model was trained using backpropagation with 10000 epochs and a log-loss loss function. The second model utilized $K$-Nearest Neighbors (KNN) via scikit-learn with $K = 10$. The third model uses a maximum-depth decision tree which also uses the hold-out method. In this model, we used Gini impurity as a measure of information. Our last model uses a random forest with a max-depth of 4 and $n$-estimators with $n = 4$. We built these random forests using the bootstrap method which builds decision trees by taking out random subsets of our training data for each decision tree. These four models resulted in an accuracy of determining whether any given VUS was benign or pathogenic of $86\%, 85\%, 89\%,$ and $90\%$, respectively. These findings contribute to the understanding of how pathogenicity scores influence clinical significance of missense swaps related to SERPINA1 and AAT which is associated with non-cystic fibrosis bronchiectasis while defining the importance of further investigation of these variants for improved diagnostics in clinical diagnosis.