Supervised learning survey

05 Feb 2017

Two datasets were chosen from the UCI ML Database, including Adult Income Data (n=48,842) and Default of Credit Card Clients (n=30,000).

The survey involved data munging/cleaning, exploratory data analysis, feature selection, and application of machine learning. Five supervised learning algorithms were applied, optimized, and compared on the classification problems, including:

  • Decision Trees
  • Artificial Neural Networks
  • Gradient Boosting
  • Support Vector Machines (SVM)
  • k-Nearest Neighbors

Training was conducted with at least 3-fold validation, with 20% withheld for testing. Optimization of each algorithm was conducted with a grid search across 2-5 parameters, depending on the algorithm. The experimental results were on par with accepted benchmarks, and the accuracies are shown below:

                    Adult Income        Credit Card Default
    Decision Trees  0.852               0.837
    Neural Networks 0.843               0.844 *best*
    Gradient Boost  0.868 *best*        0.817
    SVM             0.846               0.819
    kNN             0.844               0.793

The study was conducted in Python, using sklearn.

Due to proprietary, privacy, or academic concerns, the source code is not publicly available, but can be happily provided upon request.


I'm a software engineer, proud veteran, and even prouder husband and father. I live and work in Silicon Valley, and love to learn about learning (EdTech), ML/AI/RL, and cybersecurity.