In my last Mahout post I gave an introduction to the Logistic Regression SGD classifier using continuous data. Roy, one of the commenters of that post asked about how to classify on different types of data. Therefore I decided to write a quick post on using Mahout’s vector encoders on the bank marketing dataset referring to Mahout’s official documentation regarding this example and vector encoders in general.
Recap – Logistic Regression SGD classifier
In my previous Mahout blog I showed how to use the SGD classifier on the Iris dataset. This data consisted of numerical values, the length and width of sepals and petals. Based on this information we could classify the type of flower. However, what if you want to make predictions on different data types such as text and words? This is where vector encoders come in.
Feature Vector Encoders
Feature vector encoders are classes in Mahout that add categorical, word-like or text-like data into a vector by hashing the data on a fixed length vector. This allows us to classify on many types of data. Check out the Logistic Regression page of the Mahout documentation under the heading of ‘Feature vector encoding’. This page describes how to use the different vector encoder classes.
Bank Marketing Example
Mahout 0.9 contains a new example which uses the SGD classifier on UCI’s bank marketing dataset. In this example we will predict whether someone wants a term deposit based on information such as age, income, date of last contact and so on. See the Bank Marketing Example page of the Mahout documentation which shows you how to run the example.
Let me know if you have any questions. Feel free to suggest a new topic or let me know if you have any suggestions for improving the Mahout documentation.