Druyd:Knowledge

User:

Work

Druyd Chat

Deep Neural Network (DNN)

Build linear model with estimator (Titanic dataset)

Build DNN model with estimator (Titanic dataset)

Model for elections in USA (DNN)

Build linear model with estimator (Titanic dataset)

Storyboard

One application is the study of the data of the passengers of the Titanic and the probability of survival according to their characteristics. In this case, a model of a linear estimator type is used.

Code and data

LinearTitanic.ipynb

titanic_lin_train.csv

titanic_lin_eval.csv

ID:(1790, 0)

Setup for linear model

Description

Setup for linear model:

import os
import sys

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import clear_output
from six.moves import urllib

ID:(13828, 0)

Load dataset

Description

Load the passenger data for the maiden voyage of the Titanic:

import tensorflow.compat.v2.feature_column as fc
import tensorflow as tf

# Load dataset.
dftrain = pd.read_csv('titanic_lin_train.csv')
dfeval = pd.read_csv('titanic_lin_eval.csv')
dftest = pd.read_csv('titanic_lin_test.csv')
y_train = dftrain.pop('survived')
y_eval = dfeval.pop('survived')

ID:(13829, 0)

Show structures and data

Description

Show structures and data from the Titanic passenger dataset:

# show structure and data
dftrain.head()

ID:(13830, 0)

Show Data Statistics

Description

Show Titanic Passenger Data Statistics

# show statistics of data
dftrain.describe()

ID:(13831, 0)

Number of records to train and evaluate

Description

Number of records to train and evaluate:

# number of train and evaluation records
dftrain.shape[0], dfeval.shape[0]

(627, 264)

ID:(13832, 0)

Show the histogram (vertical) of the age

Description

To show the histogram (vertical) hist of the ages dftrain.age with interval of 20 years (bins = 20) is obtained from

# age histogram
dftrain.age.hist(bins=20)

ID:(13833, 0)

Show the histogram (horizontal) of the gender

Description

To show the histogram (horizontal) plot (kind = 'barh') of the gender dftrain.sex counting the number of passengers according to gender value_counts () you get

# gender histogram
dftrain.sex.value_counts().plot(kind='barh')

ID:(13834, 0)

Show the histogram (horizontal) of the classes

Description

>Top

To show the histogram (horizontal) plot (kind='barh') of the class dftrain['class'] counting the number of passengers according to class value_counts () you get

# class histogram dftrain['class'].value_counts().plot(kind='barh')

ID:(13835, 0)

Show the histogram (horizontal) of the classes

Description

>Top

To show the histogram (horizontal) plot(kind='barh') of the survivors survived averaged mean() grouped by gender groupby('sex') with the label on the x-axis set_xlabel('% survive') of the string assembled by concatenating pd.concat([dftrain, y_train], axis = 1) of all the data dftrain and the abscissa y_train;

# survived histogram pd.concat([dftrain, y_train], axis=1).groupby('sex').survived.mean().plot(kind='barh').set_xlabel('% survive')

ID:(13836, 0)

Form columns of data

Description

>Top

To form the tensor, the feature_columns data must be added according to whether they are categories or numeric.
The former are rescued via the vocabulary array and added with the append command, forming the column with the tf.feature_column.categorical_column_with_vocabulary_list command.
The latter are assigned directly with the append command with the column via its tf.feature_column.numeric_column designation.

# define tensor feature_columns = [] # define categorical columns names CATEGORICAL_COLUMNS = ['sex', 'class', 'deck', 'embark_town', 'alone'] # build columns of categorical variables for feature_name in CATEGORICAL_COLUMNS: vocabulary = dftrain[feature_name].unique() feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary)) # define numeric columns names FLOAT_COLUMNS = ['age', 'fare'] # build columns of numeric variables for feature_name in FLOAT_COLUMNS: feature_columns.append(tf.feature_column.numeric_column(feature_name, dtype=tf.float32)) # define numeric columns names NUMERIC_COLUMNS = ['n_siblings_spouses', 'parch'] # build columns of numeric variables for feature_name in NUMERIC_COLUMNS: feature_columns.append(tf.feature_column.numeric_column(feature_name, dtype=tf.int32))

ID:(13837, 0)

Form set to train and evaluate

Description

>Top

A function is defined to generate arrangements for the training and evaluation processes:

# define function for making set def make_input_fn(data_df, label_df, num_epochs=10, shuffle=True, batch_size=32): def input_function(): ds = tf.data.Dataset.from_tensor_slices((dict(data_df), label_df)) if shuffle: ds = ds.shuffle(1000) ds = ds.batch(batch_size).repeat(num_epochs) return ds return input_function

Generation of sets for training and evaluation:

# generate training input train_input_fn = make_input_fn(dftrain, y_train) # generate evaluating input eval_input_fn = make_input_fn(dfeval, y_eval, num_epochs=1, shuffle=False)

ID:(13838, 0)

Training and Evaluation

Description

>Top

To perform the classification, the model is defined using the estimator.LinearClassifier function applied to the columns of the feature_columns data.

# define model linear_model = tf.estimator.LinearClassifier(feature_columns=feature_columns)

Then we proceed to train the model by applying train on the model linear_model with the data to train train_input_fn

# define train model linear_model.train(train_input_fn)

Finally it is evaluated with evaluate applied on the model linear_model and the data to evaluate eval_input_fn

# evaluate model result = linear_model.evaluate(eval_input_fn)

El resultado final se imprime en mediante el comando print:

# print result clear_output() print(result)
Is obtained:
{'accuracy': 0.75757575, 'accuracy_baseline': 0.625, 'auc': 0.83067036, 'auc_precision_recall': 0.78773487, 'average_loss': 0.49594572, 'label/mean': 0.375, 'loss': 0.49112648, 'precision': 0.65217394, 'prediction/mean': 0.44832677, 'recall': 0.75757575, 'global_step': 200}

ID:(13841, 0)

Histogram of the probabilities of survival

Description

>Top

If the predicted probability of survival is evaluated based on its frequency:

# histogram of the probability pred_dicts = list(linear_est.predict(eval_input_fn)) probs = pd.Series([pred['probabilities'][1] for pred in pred_dicts]) probs.plot(kind='hist', bins=20, title='predicted probabilities')

a distribution of the form is obtained:

ID:(13839, 0)

ROC Curve

Description

>Top

To use this forecast, a limit value of the property must be defined to define on which value the survival will be predicted and under which the non-survival will be forecast. To do this, the probability that survival is predicted and observed (true positive) must be evaluated and compared with the probability that survival is predicted when it is not survived (false positive).\\nThe true-positive TPR or sensitivity factor is defined as\\n\\n
$TPR=\displaystyle\frac{TP}{TP+FN}$
\\n\\nwhere true-positive TP the cases correctly predicted as positive and the false-negative FN corresponding to the cases predicted as negative that are negative.\\nThe false-positive FPR factor is defined as\\n\\n
$FPR=\displaystyle\frac{FP}{FP+TN}$

where false-positive FP the cases predicted as positive when it is actually false and the false-negative FN corresponding to the cases predicted as negative that are negative.

from sklearn.metrics import roc_curve from matplotlib import pyplot as plt fpr, tpr, _ = roc_curve(y_eval, probs) plt.plot(fpr, tpr) plt.title('ROC curve') plt.xlabel('false positive rate') plt.ylabel('true positive rate') plt.xlim(0,) plt.ylim(0,)

The representation of both probabilities is called a ROC (Receiver Operating Characteristic) diagram:

ID:(13840, 0)