Predicting Titanic survivors with decision tree classifiers

Project completed in week 2 (05.10.-09.10.20) of the Data Science Bootcamp at Spiced Academy in Berlin.

On the second week of the bootcamp, we started with Machine Learning (ML). If you think about it, ML surrounds us in everyday life: Netflix recommending you movies you might like, your smartphone camera detecting faces, self-driving cars recognizing passengers on the street, bank detecting credit card fraud – these are all applications of ML. They can be split into three main ML categories:

Classification: Logistic Regression, Decision Trees, Random Forest
Regression: Linear Regression, Regression Trees, SVR, Forecasting
Unsupervised: PCA, Clustering, t-SNE, Matrix factorization

This week we focused only on classification and applied logistic regression, decision tree, and random forest models on the Titanic dataset to predict passenger survival.

Logistic Regression separates the data points into two classes by a curve. It’s a fast and performant model, but the data might require prior extensive feature engineering (e.g. encoding categorical values numerically). Logistic Regression is rooted in statistics and dates back to 1958.
Decision Trees classify the data by “asking questions” about different features in the dataset, just like in a guessing game. You can specify the depth of the tree (on how many levels it should go), depending on the desired complexity.
Random Forest is basically an ensemble of decision trees. This model doesn’t require feature engineering, but tweaking the hyperparameters. The downside of Decision Trees, and therefore of Random Forest, is that they are prone to overfitting. Decision Trees/Random Forest are rather rooted in computer science and date back to 1986/1995.

‘Logistic regression’ is a terrible term. It should be called ‘binary classification with a sigmoid function’.

ONE OF OUR TEACHERS

Now, a few words about why Feature Engineering is so important (and annoying). An ML model like Logistic Regression doesn’t understand words (e.g. female, good, orange), so we need to transform them into numbers. But depending on the data type, we use different Encoders:

for ordinal data (if one value is better than another, e.g.: good-great-excellent, child-teenager-adult-senior): KBinsDiscretizer, LabelEncoder
for nominal data (if the values are equal or don’t have an inherent value, e.g.: male-female, Berlin-Vienna-Bucharest): OneHotEncoder

In the Titanic dataset, there were several columns that needed to be transformed in different ways, while leaving the others untouched. The best way to do this is with a ColumnTransformer, which applies all individual transformations in one step.

Evaluating Classifiers

After hours of wrangling the data and encoding it properly (e.g. extracting the honorifics from the passenger names), I applied the three models on the Titanic dataset and achieved the following accuracyscores:

	Logistic Regression	Decision Tree	Random Forest
train set	79.1 %	80.3 %	94.7 %
test set	81.6 %	76.5 %	81.0 %

It’s interesting to note that the Logistic Regression and Random Forest achieved similar scores on the test set (81%), but Logistic Regression performed worse, whereas Random Forest performed better on the train set. This means that Random Forest overfit (it learned too much detail from the train set and couldn’t generalized on the test set), whereas Logistic Regression learned less, but could apply that knowledge better on the test set. Another important point is that in Logistic Regression feature engineering is decisive, whereas Random Forest is more of a black box.

I’d say my scores are not that bad, but of course there’s room for improvement, by adding/removing/encoding other features or increasing the depth of the tree, for example.

Friday Lightning Talk

As last Friday, we had to give a talk on the weekly project or a topic related to it and I presented my Decision Tree and Random Forest Classifiers.