Supervised learning survey

05 Feb 2017

Two datasets were chosen from the UCI ML Database, including Adult Income Data (n=48,842) and Default of Credit Card Clients (n=30,000).

The survey involved data munging/cleaning, exploratory data analysis, feature selection, and application of machine learning. Five supervised learning algorithms were applied, optimized, and compared on the classification problems, including:

  • Decision Trees
  • Artificial Neural Networks
  • Gradient Boosting
  • Support Vector Machines (SVM)
  • k-Nearest Neighbors

Training was conducted with at least 3-fold validation, with 20% withheld for testing. Optimization of each algorithm was conducted with a grid search across 2-5 parameters, depending on the algorithm. The experimental results were on par with accepted benchmarks, and the accuracies are shown below:

                    Adult Income        Credit Card Default
    Decision Trees  0.852               0.837
    Neural Networks 0.843               0.844 *best*
    Gradient Boost  0.868 *best*        0.817
    SVM             0.846               0.819
    kNN             0.844               0.793

The study was conducted in Python, using sklearn.

The source code is not publicly available, but can be provided upon request.

Me

I'm a software engineer, proud veteran, and even prouder husband and father. I live and work in Silicon Valley, and love to learn about learning, machine learning, and cybersecurity.