Comparative Analysis of Classification Algorithms

June 2020

2840 Words, 16 Minutes

Supervised Machine Learning Algorithms

Supervised learning is the types of machine learning in which machines are trained using well “labelled” training data, and on basis of that data, machines predict the output. The labelled data means some input data is already tagged with the correct output. In supervised learning, the training data provided to the machines work as the supervisor that teaches the machines to predict the output correctly. It applies the same concept as a student learns in the supervision of the teacher. Supervised learning is a process of providing input data as well as correct output data to the machine learning model. The aim of a supervised learning algorithm is to find a mapping function to map the input variable(x) with the output variable(y). In the real-world, supervised learning can be used for Risk Assessment, Image classification, Fraud Detection, spam filtering, etc.

input

Algorithms I have implemented are:

K-Nearest Neighbour
Random Forest
Decision Tree
Support Vector Machine

Importing Libraries

We will import the necessary libraries that we will need in our project.

Label Encoder a popular encoding technique for handling categorical variables
Stanndard Scaler is used to normalize the values
Simple Imputer is used to fill in missing values
Seaborn is used for data visualisation

 import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
import seaborn as sn
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
import seaborn as sns

Reading the csv

We have taken the dataset of a Liver Patient.Based on chemical compounds(bilrubin,albumin,protiens,alkaline phosphatase) present in human body and tests like SGOT , SGPT the outcome mentioned whether person is patient ie needs to be diagnosed or not.

 df = pd.read_csv("data1/liver_p.csv")
df.head()

Output

Age	Gender	Total_Bilirubin	Direct_Bilirubin	Alkaline_Phosphotase	Alamine_Aminotransferase	Aspartate_Aminotransferase	Total_Protiens	Albumin	Albumin_and_Globulin_Ratio	Dataset
65	Female	0.7	0.1	187	16	18	6.8	3.3	0.90	1
62	Male	10.9	5.5	699	64	100	7.5	3.2	0.74	1
62	Male	7.3	4.1	490	60	68	7.0	3.3	0.89	1
58	Male	1.0	0.4	182	14	20	6.8	3.4	1.00	1
72	Male	3.9	2.0	195	27	59	7.3	2.4	0.40	1

Checking for missing values

X = df[features]
Y = df.Dataset
df.isnull().sum()

Output

Age                           0
Gender                        0
Total_Bilirubin               0
Direct_Bilirubin              0
Alkaline_Phosphotase          0
Alamine_Aminotransferase      0
Aspartate_Aminotransferase    0
Total_Protiens                0
Albumin                       0
Albumin_and_Globulin_Ratio    4
Dataset                       0
dtype: int64

Since there are missing values we will handle them using the Simple Imputer library and fill in the mean of the column for the missing value

imp = SimpleImputer(strategy ="mean")
X_new =pd.DataFrame(imp.fit_transform(X))
X_new.columns = df[features].columns
X_new.head()

Output

	Age	Gender	Total_Bilirubin	Direct_Bilirubin	Alkaline_Phosphotase	Alamine_Aminotransferase	Aspartate_Aminotransferase	Total_Protiens	Albumin	Albumin_and_Globulin_Ratio
65.0	0.7	0.1	187.0	16.0	18.0	6.8	3.3	0.90	0.0
62.0	10.9	5.5	699.0	64.0	100.0	7.5	3.2	0.74	1.0
62.0	7.3	4.1	490.0	60.0	68.0	7.0	3.3	0.89	1.0
58.0	1.0	0.4	182.0	14.0	20.0	6.8	3.4	1.00	1.0
72.0	3.9	2.0	195.0	27.0	59.0	7.3	2.4	0.40	1.0

Checking for missing values again

X_new.isnull().sum()

Output

Age                           0
Gender                        0
Total_Bilirubin               0
Direct_Bilirubin              0
Alkaline_Phosphotase          0
Alamine_Aminotransferase      0
Aspartate_Aminotransferase    0
Total_Protiens                0
Albumin                       0
Albumin_and_Globulin_Ratio    0
dtype: int64

Dropping the Gender Column

X.drop("Gender" , axis = 1 , inplace = True )
X.head()

Output

Age	Total_Bilirubin	Direct_Bilirubin	Alkaline_Phosphotase	Alamine_Aminotransferase	Aspartate_Aminotransferase	Total_Protiens	Albumin	Albumin_and_Globulin_Ratio	Sex
65	0.7	0.1	187	16	18	6.8	3.3	0.90	0
62	10.9	5.5	699	64	100	7.5	3.2	0.74	1
62	7.3	4.1	490	60	68	7.0	3.3	0.89	1
58	1.0	0.4	182	14	20	6.8	3.4	1.00	1
72	3.9	2.0	195	27	59	7.3	2.4	0.40	

Data Visualisation

We will visualise the patients having the liver disease and the ones not having the disease.

sns.countplot(data=df, x = 'Dataset', label='Count')

LD, NLD = df['Dataset'].value_counts()
print('Number of patients diagnosed with liver disease: ',LD)
print('Number of patients not diagnosed with liver disease: ',NLD)

Output input

Mapping one feature with respect to another using seaborn library

g = sns.FacetGrid(df, col="Gender", row="Dataset", margin_titles=True)
g.map(plt.scatter,"Direct_Bilirubin", "Total_Bilirubin", edgecolor="w")
plt.subplots_adjust(top=0.9)

Output input

Data Normalisation

Normalization is generally required when we are dealing with attributes on a different scale, otherwise, it may lead to a dilution in effectiveness of an important equally important attribute(on lower scale) because of other attribute having values on larger scale. In simple words, when multiple attributes are there but attributes have values on different scales, this may lead to poor data models while performing data mining operations. So they are normalized to bring all the attributes on the same scale.

X_arr = X_new.iloc[: , : ]
Y_arr = df.iloc[: , -1]
sc = StandardScaler()
X_arr = sc.fit_transform(X_arr)
X_arr

Output

array([[ 1.25209764, -0.41887783, -0.49396398, ...,  0.19896867,
        -0.14789798, -1.76228085],
       [ 1.06663704,  1.22517135,  1.43042334, ...,  0.07315659,
        -0.65069686,  0.56744644],
       [ 1.06663704,  0.6449187 ,  0.93150811, ...,  0.19896867,
        -0.17932291,  0.56744644],
       ...,
       [ 0.44843504, -0.4027597 , -0.45832717, ...,  0.07315659,
         0.16635131,  0.56744644],
       [-0.84978917, -0.32216906, -0.35141677, ...,  0.32478075,
         0.16635131,  0.56744644],
       [-0.41704777, -0.37052344, -0.42269037, ...,  1.58290153,
         1.73759779,  0.56744644]])

Splitting the data into train and test set

ain-Valid-Test split is a technique to evaluate the performance of your machine learning model — classification or regression alike. Train Dataset

Set of data used for learning (by the model), that is, to fit the parameters to the machine learning model Valid Dataset
Set of data used to provide an unbiased evaluation of a model fitted on the training dataset while tuning model hyperparameters.
Also play a role in other forms of model preparation, such as feature selection, threshold cut-off selection. Test Dataset
Set of data used to provide an unbiased evaluation of a final model fitted on the training dataset.

X_train , X_test , Y_train , Y_test = train_test_split(X_arr , Y_arr , test_size = 0.2 , shuffle = True)
len(X_train) , len(X_test) , len(Y_train) , len(Y_test)

Output

(466, 117, 466, 117)

Building K-Nearest Neighbour Model

K-Nearest Neighbours is one of the most basic yet essential classification algorithms in Machine Learning. It belongs to the supervised learning domain and finds intense application in pattern recognition, data mining and intrusion detection. It is widely disposable in real-life scenarios since it is non-parametric, meaning, it does not make any underlying assumptions about the distribution of data.

Algorithm

Let m be the number of training data samples. Let p be an unknown point.
Store the training samples in an array of data points arr[]. This means each element of this array represents a tuple (x, y).
for i=0 to m: Calculate Euclidean distance d(arr[i], p).
Make set S of K smallest distances obtained. Each of these distances corresponds to an already classified data point.
Return the majority label among S.

model_knn = KNeighborsClassifier(n_neighbors= 3)
model_knn.fit(X_train , Y_train)
pred = model_knn.predict(X_test)
score = accuracy_score(pred , Y_test)
score

Output

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

Output

0.6324786324786325

cm = confusion_matrix(Y_test , pred)
print(cm)

Output

[[64 18]
 [25 10]]

Building Random Forest Classifier

The Random forest or Random Decision Forest is a supervised Machine learning algorithm used for classification, regression, and other tasks using decision trees. The Random forest classifier creates a set of decision trees from a randomly selected subset of the training set. It is basically a set of decision trees (DT) from a randomly selected subset of the training set and then It collects the votes from different decision trees to decide the final prediction.

model_forest = RandomForestClassifier(n_estimators= 700 , random_state = 1)
model_forest
model_forest.fit(X_train , Y_train)
pred_forest = model_forest.predict(X_test)
score_forest = accuracy_score(pred_forest , Y_test)
score_forest

Output

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=700,
                       n_jobs=None, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

Output

0.6837606837606838

cm_forest = confusion_matrix(Y_test , pred_forest)
print(cm_forest)

Output

[[70 12]
 [25 10]]

Decision Tree

Decision Tree is the most powerful and popular tool for classification and prediction. A Decision tree is a flowchart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label.

Construction

A tree can be “learned” by splitting the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning. The recursion is completed when the subset at a node all has the same value of the target variable, or when splitting no longer adds value to the predictions. The construction of a decision tree classifier does not require any domain knowledge or parameter setting, and therefore is appropriate for exploratory knowledge discovery. Decision trees can handle high-dimensional data. In general decision tree classifier has good accuracy. Decision tree induction is a typical inductive approach to learn knowledge on classification.

model_decision = DecisionTreeClassifier(max_depth= 10 , random_state= 1)
model_decision.fit(X_train , Y_train)
pred_decision = model_decision.predict(X_test)
pred_decision
score = accuracy_score(pred_decision , Y_test)
score

Output

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=1, splitter='best')

Output

0.6752136752136753

Building SVM model

Support Vector Machine (SVM) is a relatively simple Supervised Machine Learning Algorithm used for classification and/or regression. It is more preferred for classification but is sometimes very useful for regression as well. Basically, SVM finds a hyper-plane that creates a boundary between the types of data. In 2-dimensional space, this hyper-plane is nothing but a line. In SVM, we plot each data item in the dataset in an N-dimensional space, where N is the number of features/attributes in the data. Next, find the optimal hyperplane to separate the data. So by this, you must have understood that inherently, SVM can only perform binary classification (i.e., choose between two classes). However, there are various techniques to use for multi-class problems. Support Vector Machine for Multi-CLass Problems To perform SVM on multi-class problems, we can create a binary classifier for each class of the data. The two results of each classifier will be :

The data point belongs to that class OR The data point does not belong to that class.

model_svm = svm.SVC(C = 10, gamma = 5)
model_svm
model_svm.fit(X_train , Y_train)
pred_svm = model_svm.predict(X_test)

Output

SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=5, kernel='rbf', max_iter=-1,
    probability=False, random_state=None, shrinking=True, tol=0.001,
    verbose=False)

score_svm = accuracy_score(Y_test , pred_svm)
score_svm

Output

0.7264957264957265