The IYRC
  • Home
  • IYRC Summer Program 2022
    • Summer Program
  • IYRC Fall Conference 2022
    • Submission Guidelines
    • Presentation Guidelines
    • Registration
  • Past Conferences
    • IYRC Fall 2021 >
      • Authors
      • IYRC Fall 2021 Proceedings
      • Schedule
      • Keynote Speakers
    • IYRC Spring 2021 >
      • Authors
      • Scholarship
      • Schedule
      • Keynote Speakers
    • IYRC 2020 >
      • IYRC 2020 Proceedings
      • Guest Speakers
      • Events >
        • HCJI Panel
        • Guest Speaker - Molly Edwards
        • Guest Speaker - Paul Lewis
    • IYRC 2019 Proceedings
    • IYRC 2018 Proceedings
  • About
    • Programs
    • FAQ
    • Partnerships
    • Who we are
    • Gallery
    • Testimonials
  • Join our team
  • Contact Us
  • Product

Escalating the Quantity of Medical Data Using CTGAN: Diabetes Dataset

by Jihyung Kim
Category: STEM
Abstract – The number of diabetes diagnoses is increasing sharply in the United States. It is a life- long disease that can cause serious symptoms such as blurred visions. Collecting medical data requires a consent form and goes through complicated procedures, which makes it harder. Conditional Generative Adversarial Network(CTGAN) can help to solve this problem. GAN is a Deep Learning model that manufactures synthetic data. CTGAN is basically GAN because it goes through very similar procedures, but CTGAN is for table data. We checked how accurate the fake data was to the real data using various machine learning models and deep learning. Logistic Regression(LR), Decision Tree(DT), KNN, Gradient Boosting(GB), Light Gradient Boosting Machine(LGBM), Support Vector Classifier(SVC), Gaussian, and Deep Neural Network(DNN) got 40.55%, 38.1%, 44.5%, 39%, 35.35%, 44.05%, 53.65%, 39.25%, and 34.2%, respectively. We applied GridSearch on two models: Random Forest(RF) and Light Gradient Boosting Machine(LGBM). Random Forest(RF) showed a bit better accuracy by performing 77.85% while Light Gradient Boosting Machine(LGBM) performed 76.65%. Then we decided to create a new dataset combining the fake data with a bit of real data. When we compared the new dataset with the pure real data, the accuracy scores from all models almost doubled, 100%, 77.3%, 88.5%, 93.9%, 93.05%, 88%, 93.9%, 75.05%, and 65.8%, respectively. Although we had to modify the model in order to reach a satisfactory result, CTGAN can become a very significant model for researchers who need a large amount of data.
  • PAPER
  • PRESENTATION VIDEO
<
>
Download PDF
Contact us: ​info@the-iyrc.org
  • Home
  • IYRC Summer Program 2022
    • Summer Program
  • IYRC Fall Conference 2022
    • Submission Guidelines
    • Presentation Guidelines
    • Registration
  • Past Conferences
    • IYRC Fall 2021 >
      • Authors
      • IYRC Fall 2021 Proceedings
      • Schedule
      • Keynote Speakers
    • IYRC Spring 2021 >
      • Authors
      • Scholarship
      • Schedule
      • Keynote Speakers
    • IYRC 2020 >
      • IYRC 2020 Proceedings
      • Guest Speakers
      • Events >
        • HCJI Panel
        • Guest Speaker - Molly Edwards
        • Guest Speaker - Paul Lewis
    • IYRC 2019 Proceedings
    • IYRC 2018 Proceedings
  • About
    • Programs
    • FAQ
    • Partnerships
    • Who we are
    • Gallery
    • Testimonials
  • Join our team
  • Contact Us
  • Product