This visualization uses TensorFlow.js to train a neural network on
the titanic dataset and visualize how the predictions of the neural network evolve after every training epoch.
The colors of each row indicate the predicted survival probability for each passenger. Red indicates a
prediction that a passenger died. Green indicates a prediction that a passenger survived. The intensity of
the color indicates the magnitude of the prediction probability. As an example, a bright green passenger
represents a strong predicted probability for survival. We also plot the loss of our objective function
on the left of the table with D3.js.
The code for this visualization is hosted on github.
Questions to Ask Yourself:
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy. ( Description Source )
1. What features are used in training the model?
I use everything but the ticket and cabin to predict the survival column. The omitted features can be used for better predictions,
but I did not use them to reduce the feature engineering I had to do to get this up and running. You
can see how I preprocessed the data
here.
2. Where is the test/validation set?
I don't use one here. I train on the entire dataset. This visualization is about understanding how
predictions change when training a neural network, not about how the network generalizes. Although,
from my experiments, if you are observing an accuracy <= 0.78 you are most likely not overfitting. Any
accuracy higher than that is most likely the result of overfitting.
3. What neural architecture do you use?
I use a single hidden layer neural network with a sigmoid output layer. I also use the
lecun normal kernel initializer for more consistent results. You can see more information on the model
and training code
here.
4. What optimizer do you use?
I use adam with the default Keras parameters.
5. Why do the predictions become horrible after I sort the data?
How the data is sorted and displayed is the same order it is batched for training.
If your batch size is less than the entirety of dataset, you will be using mini-batch gradients.
Due to the small scale of this dataset and imbalance in classes, this network is prone to being lead
to a point in gradient space from which it cannot escape if the batches are too small and the
consecutive gradients calculated are from the same class. For example, try sorting by male and
changing your batch size to 20. The network will train on batches of 20 males, who mostly died,
until it gets to the females. By that point, the representation is so biased towards dead males
that it takes a lot of gradient energy to get to a location that also captures female survival.
6. Where can I find a more thorough analysis of using neural networks on this dataset?
Check out my python notebook on Kaggle.
7. What's with predicting something that has already happened?
This is often a confusing topic for people new to machine learning.
The idea is that the survival of a passenger is a function of the other observed variables. "Given the
observed variables, what features of a passenger lead to survival or death?" With our
neural network, we aim to learn this function. In real predictive analytics problems, we aim to learn a function
$$f(\text{observed variables}) = \text{target variables}$$ that can generalize to new data not seen in the training
set. This requires many considerations which are largely related to the bias-variance tradeoff.
Any more questions?
If you have any questions, submit an issue on
this repo and I'll get back to you.