Build linear model with estimator (Titanic dataset)

Storyboard

One application is the study of the data of the passengers of the Titanic and the probability of survival according to their characteristics. In this case, a model of a linear estimator type is used.

Code and data

LinearTitanic.ipynb

titanic_lin_train.csv

titanic_lin_eval.csv

>Model

ID:(1790, 0)



Setup for linear model

Description

>Top


Setup for linear model:

import os
import sys

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import clear_output
from six.moves import urllib

ID:(13828, 0)



Load dataset

Description

>Top


Load the passenger data for the maiden voyage of the Titanic:

import tensorflow.compat.v2.feature_column as fc
import tensorflow as tf

# Load dataset.
dftrain = pd.read_csv('titanic_lin_train.csv')
dfeval = pd.read_csv('titanic_lin_eval.csv')
dftest = pd.read_csv('titanic_lin_test.csv')
y_train = dftrain.pop('survived')
y_eval = dfeval.pop('survived')

ID:(13829, 0)



Show structures and data

Description

>Top


Show structures and data from the Titanic passenger dataset:

# show structure and data
dftrain.head()


ID:(13830, 0)



Show Data Statistics

Description

>Top


Show Titanic Passenger Data Statistics

# show statistics of data
dftrain.describe()


ID:(13831, 0)



Number of records to train and evaluate

Description

>Top


Number of records to train and evaluate:

# number of train and evaluation records
dftrain.shape[0], dfeval.shape[0]

(627, 264)

ID:(13832, 0)



Show the histogram (vertical) of the age

Description

>Top


To show the histogram (vertical) hist of the ages dftrain.age with interval of 20 years (bins = 20) is obtained from

# age histogram
dftrain.age.hist(bins=20)


ID:(13833, 0)



Show the histogram (horizontal) of the gender

Description

>Top


To show the histogram (horizontal) plot (kind = 'barh') of the gender dftrain.sex counting the number of passengers according to gender value_counts () you get

# gender histogram
dftrain.sex.value_counts().plot(kind='barh')


ID:(13834, 0)



Show the histogram (horizontal) of the classes

Description

>Top


To show the histogram (horizontal) plot (kind='barh') of the class dftrain['class'] counting the number of passengers according to class value_counts () you get

# class histogram
dftrain['class'].value_counts().plot(kind='barh')


ID:(13835, 0)



Show the histogram (horizontal) of the classes

Description

>Top


To show the histogram (horizontal) plot(kind='barh') of the survivors survived averaged mean() grouped by gender groupby('sex') with the label on the x-axis set_xlabel('% survive') of the string assembled by concatenating pd.concat([dftrain, y_train], axis = 1) of all the data dftrain and the abscissa y_train;

# survived histogram
pd.concat([dftrain, y_train], axis=1).groupby('sex').survived.mean().plot(kind='barh').set_xlabel('% survive')



ID:(13836, 0)



Form columns of data

Description

>Top


To form the tensor, the feature_columns data must be added according to whether they are categories or numeric.
The former are rescued via the vocabulary array and added with the append command, forming the column with the tf.feature_column.categorical_column_with_vocabulary_list command.
The latter are assigned directly with the append command with the column via its tf.feature_column.numeric_column designation.

# define tensor
feature_columns = []

# define categorical columns names
CATEGORICAL_COLUMNS = ['sex', 'class', 'deck', 'embark_town', 'alone']
# build columns of categorical variables
for feature_name in CATEGORICAL_COLUMNS:
  vocabulary = dftrain[feature_name].unique()
  feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary))

# define numeric columns names
FLOAT_COLUMNS = ['age', 'fare']
# build columns of numeric variables
for feature_name in FLOAT_COLUMNS:
  feature_columns.append(tf.feature_column.numeric_column(feature_name, dtype=tf.float32))

# define numeric columns names
NUMERIC_COLUMNS = ['n_siblings_spouses', 'parch']
# build columns of numeric variables
for feature_name in NUMERIC_COLUMNS:
  feature_columns.append(tf.feature_column.numeric_column(feature_name, dtype=tf.int32))

ID:(13837, 0)



Form set to train and evaluate

Description

>Top


A function is defined to generate arrangements for the training and evaluation processes:

# define function for making set
def make_input_fn(data_df, label_df, num_epochs=10, shuffle=True, batch_size=32):
  def input_function():
    ds = tf.data.Dataset.from_tensor_slices((dict(data_df), label_df))
    if shuffle:
      ds = ds.shuffle(1000)
    ds = ds.batch(batch_size).repeat(num_epochs)
    return ds
  return input_function


Generation of sets for training and evaluation:

# generate training input
train_input_fn = make_input_fn(dftrain, y_train)
# generate evaluating input
eval_input_fn = make_input_fn(dfeval, y_eval, num_epochs=1, shuffle=False)

ID:(13838, 0)



Training and Evaluation

Description

>Top


To perform the classification, the model is defined using the estimator.LinearClassifier function applied to the columns of the feature_columns data.

# define model
linear_model = tf.estimator.LinearClassifier(feature_columns=feature_columns)


Then we proceed to train the model by applying train on the model linear_model with the data to train train_input_fn

# define train model
linear_model.train(train_input_fn)


Finally it is evaluated with evaluate applied on the model linear_model and the data to evaluate eval_input_fn

# evaluate model
result = linear_model.evaluate(eval_input_fn)


El resultado final se imprime en mediante el comando print:

# print result
clear_output()
print(result)

Is obtained:

{'accuracy': 0.75757575, 'accuracy_baseline': 0.625, 'auc': 0.83067036, 'auc_precision_recall': 0.78773487, 'average_loss': 0.49594572, 'label/mean': 0.375, 'loss': 0.49112648, 'precision': 0.65217394, 'prediction/mean': 0.44832677, 'recall': 0.75757575, 'global_step': 200}

ID:(13841, 0)



Histogram of the probabilities of survival

Description

>Top


If the predicted probability of survival is evaluated based on its frequency:

# histogram of the probability
pred_dicts = list(linear_est.predict(eval_input_fn))
probs = pd.Series([pred['probabilities'][1] for pred in pred_dicts])

probs.plot(kind='hist', bins=20, title='predicted probabilities')


a distribution of the form is obtained:

ID:(13839, 0)



ROC Curve

Description

>Top


To use this forecast, a limit value of the property must be defined to define on which value the survival will be predicted and under which the non-survival will be forecast. To do this, the probability that survival is predicted and observed (true positive) must be evaluated and compared with the probability that survival is predicted when it is not survived (false positive).\\nThe true-positive TPR or sensitivity factor is defined as\\n\\n

$TPR=\displaystyle\frac{TP}{TP+FN}$

\\n\\nwhere true-positive TP the cases correctly predicted as positive and the false-negative FN corresponding to the cases predicted as negative that are negative.\\nThe false-positive FPR factor is defined as\\n\\n

$FPR=\displaystyle\frac{FP}{FP+TN}$



where false-positive FP the cases predicted as positive when it is actually false and the false-negative FN corresponding to the cases predicted as negative that are negative.

from sklearn.metrics import roc_curve
from matplotlib import pyplot as plt

fpr, tpr, _ = roc_curve(y_eval, probs)
plt.plot(fpr, tpr)
plt.title('ROC curve')
plt.xlabel('false positive rate')
plt.ylabel('true positive rate')
plt.xlim(0,)
plt.ylim(0,)


The representation of both probabilities is called a ROC (Receiver Operating Characteristic) diagram:

ID:(13840, 0)