Escalating the Quantity of Medical Data Using CTGAN: Diabetes Dataset
by Jihyung Kim
Abstract – The number of diabetes diagnoses is increasing sharply in the United States. It is a life- long disease that can cause serious symptoms such as blurred visions. Collecting medical data requires a consent form and goes through complicated procedures, which makes it harder. Conditional Generative Adversarial Network(CTGAN) can help to solve this problem. GAN is a Deep Learning model that manufactures synthetic data. CTGAN is basically GAN because it goes through very similar procedures, but CTGAN is for table data. We checked how accurate the fake data was to the real data using various machine learning models and deep learning. Logistic Regression(LR), Decision Tree(DT), KNN, Gradient Boosting(GB), Light Gradient Boosting Machine(LGBM), Support Vector Classifier(SVC), Gaussian, and Deep Neural Network(DNN) got 40.55%, 38.1%, 44.5%, 39%, 35.35%, 44.05%, 53.65%, 39.25%, and 34.2%, respectively. We applied GridSearch on two models: Random Forest(RF) and Light Gradient Boosting Machine(LGBM). Random Forest(RF) showed a bit better accuracy by performing 77.85% while Light Gradient Boosting Machine(LGBM) performed 76.65%. Then we decided to create a new dataset combining the fake data with a bit of real data. When we compared the new dataset with the pure real data, the accuracy scores from all models almost doubled, 100%, 77.3%, 88.5%, 93.9%, 93.05%, 88%, 93.9%, 75.05%, and 65.8%, respectively. Although we had to modify the model in order to reach a satisfactory result, CTGAN can become a very significant model for researchers who need a large amount of data.