👩💻A Support Vector Machine (SVM) model is a supervised👀 machine learning model used for classification. It is simple but very useful. It uses a tool🔨 called hyperplane to separate data into different groups. It has different methods for both linear and non-linear data. And the methods that are used here are svmLinear and svmPloy.
Since SVM is a supervised model, the first step is to split the dataset into training and testing sets and assume the dataset is clean♻️. After the dataset split, we run🏃 the SVM in the training set and use the trained data to predict the testing set. Finally, we use a 📐📏confusion matrix to show how well is our prediction in terms of accuracy, recall, Specificity, etc.
We are 🔭interested in the effect of data split ratios on the model performance, so we compare two new ratios, 75:25 and 50:50 with the original ratio of 60:40 conducted by Dr.Hunt. We also implement both linear and poly methods on the dataset to see the difference in their performance.
The result for both splits and poly method are better📈 than the original model performance, which means that data split does have an impact on the model fitness. 🎊Because we know that two classes of variables have some overlaps, so it is expected that poly will be a better fit for the dataset. However, we did not test statistical significance for those changes🙊.
comparison <- matrix(c(0.9333, 0.9000, 0.7938, 0.8500, 0.9444, 0.9167, 0.8333, 1.0000,
                       0.9467, 0.9200, 0.9600, 0.8800, 0.9667, 0.9500, 0.9000, 1.0000),
                    ncol = 4, byrow = TRUE)
colnames(comparison) <- c("Accuracy", "Kappa", "Recall-Versi", "Recall-Virgi")
rownames(comparison) <- c("Linear 60:40", "Linear 75:25", "Linear 50:50", "NonLin 60:40")
comparison <- as.data.frame.matrix(comparison)
kable(comparison)
| Accuracy | Kappa | Recall-Versi | Recall-Virgi | |
|---|---|---|---|---|
| Linear 60:40 | 0.9333 | 0.9000 | 0.7938 | 0.85 | 
| Linear 75:25 | 0.9444 | 0.9167 | 0.8333 | 1.00 | 
| Linear 50:50 | 0.9467 | 0.9200 | 0.9600 | 0.88 | 
| NonLin 60:40 | 0.9667 | 0.9500 | 0.9000 | 1.00 | 
The SVM model can be used for identifying abnormalities such as fraudulent transactions, material misstatements, bankruptcies, abnormal reserves, etc. Auditors can utilize this machine learning model with other algorithms as well as financial ratios to improve their accuracy and efficiency.
Accuracy and Kappa both increased a little✨ bit
# data split
iris1 <- iris
train_index <- createDataPartition(iris1$Species, p = .75, list = FALSE, times = 1)
iris_train <- iris1[train_index,]
iris_test <- iris1[-train_index,]
# train
iris_svm_train <- train(
  form = factor(Species) ~.,
  data = iris_train,
  trControl = trainControl(method = "cv",
                           number = 10,
                           classProbs = TRUE),
  method = "svmLinear",
  preProcess = c("center", "scale"),
  tuneLength = 10
)
# iris_svm_train
summary(iris_svm_train)
Length  Class   Mode 
     1   ksvm     S4 # predict
iris_svm_predict <- predict(iris_svm_train, iris_test, type = "prob")
iris_svm_predict <- cbind(iris_svm_predict, iris_test)
iris_svm_predict <- iris_svm_predict %>% 
  mutate(prediction = if_else(setosa > versicolor & setosa > virginica, "setosa",
                              if_else(versicolor > setosa & versicolor > virginica, "versicolor",
                                      if_else(virginica > setosa & virginica > versicolor, "virginica", "PROBLEM"))))
# table(iris_svm_predict$prediction)
confusionMatrix(factor(iris_svm_predict$prediction), factor(iris_svm_predict$Species))
Confusion Matrix and Statistics
            Reference
Prediction   setosa versicolor virginica
  setosa         12          0         0
  versicolor      0         10         0
  virginica       0          2        12
Overall Statistics
                                          
               Accuracy : 0.9444          
                 95% CI : (0.8134, 0.9932)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : 1.728e-14       
                                          
                  Kappa : 0.9167          
                                          
 Mcnemar's Test P-Value : NA              
Statistics by Class:
                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8333           1.0000
Specificity                 1.0000            1.0000           0.9167
Pos Pred Value              1.0000            1.0000           0.8571
Neg Pred Value              1.0000            0.9231           1.0000
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.2778           0.3333
Detection Prevalence        0.3333            0.2778           0.3889
Balanced Accuracy           1.0000            0.9167           0.9583sv1 <- iris_train[iris_svm_train$finalModel@SVindex,]
ggplot(data = iris_test, mapping = aes(x = Sepal.Width, y= Petal.Width, color = Species)) +
  geom_point(alpha = .5) +
  geom_point(data = iris_svm_predict, mapping = aes(x = Sepal.Width, y = Petal.Width, color = prediction), 
             shape = 6, size = 3) +
  geom_point(data = sv1, mapping = aes(x = Sepal.Width, y = Petal.Width), shape = 4, size = 4) +
  theme(legend.title = element_blank()) +
  ggtitle("Support Vector Machine")

Accuracy and Kappa both increased a little. Versicolor’s recall increased a lot👏
# data split
train_index <- createDataPartition(iris1$Species, p = .5, list = FALSE, times = 1)
iris_train <- iris1[train_index,]
iris_test <- iris1[-train_index,]
# train
iris_svm_train <- train(
  form = factor(Species) ~.,
  data = iris_train,
  trControl = trainControl(method = "cv",
                           number = 10,
                           classProbs = TRUE),
  method = "svmLinear",
  preProcess = c("center", "scale"),
  tuneLength = 10
)
# iris_svm_train
summary(iris_svm_train)
Length  Class   Mode 
     1   ksvm     S4 # predict
iris_svm_predict <- predict(iris_svm_train, iris_test, type = "prob")
iris_svm_predict <- cbind(iris_svm_predict, iris_test)
iris_svm_predict <- iris_svm_predict %>% 
  mutate(prediction = if_else(setosa > versicolor & setosa > virginica, "setosa",
                              if_else(versicolor > setosa & versicolor > virginica, "versicolor",
                                      if_else(virginica > setosa & virginica > versicolor, "virginica", "PROBLEM"))))
# table(iris_svm_predict$prediction)
confusionMatrix(factor(iris_svm_predict$prediction), factor(iris_test$Species))
Confusion Matrix and Statistics
            Reference
Prediction   setosa versicolor virginica
  setosa         25          0         0
  versicolor      0         24         3
  virginica       0          1        22
Overall Statistics
                                         
               Accuracy : 0.9467         
                 95% CI : (0.869, 0.9853)
    No Information Rate : 0.3333         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.92           
                                         
 Mcnemar's Test P-Value : NA             
Statistics by Class:
                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.9600           0.8800
Specificity                 1.0000            0.9400           0.9800
Pos Pred Value              1.0000            0.8889           0.9565
Neg Pred Value              1.0000            0.9792           0.9423
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3200           0.2933
Detection Prevalence        0.3333            0.3600           0.3067
Balanced Accuracy           1.0000            0.9500           0.9300Has the 🏆best overall and individual class performance
train_index <- createDataPartition(iris1$Species, p = .6, list = FALSE, times = 1)
iris_train <- iris1[train_index,]
iris_test <- iris1[-train_index,]
# train
iris_svm_train <- train(
  form = factor(Species) ~.,
  data = iris_train,
  trControl = trainControl(method = "cv",
                           number = 10,
                           classProbs = TRUE),
  method = "svmPoly",
  preProcess = c("center", "scale"),
  tuneLength = 10
)
summary(iris_svm_train)
Length  Class   Mode 
     1   ksvm     S4 # predict
iris_svm_predict <- predict(iris_svm_train, iris_test, type = "prob")
iris_svm_predict <- cbind(iris_svm_predict, iris_test)
iris_svm_predict <- iris_svm_predict %>% 
  mutate(prediction = if_else(setosa > versicolor & setosa > virginica, "setosa",
                              if_else(versicolor >setosa & versicolor > virginica, "versicolor",
                                      if_else(virginica > setosa & virginica > versicolor, "virginica", "PROBLEM"))))
# table(iris_svm_predict$prediction)
confusionMatrix(factor(iris_svm_predict$prediction), factor(iris_test$Species))
Confusion Matrix and Statistics
            Reference
Prediction   setosa versicolor virginica
  setosa         20          0         0
  versicolor      0         18         0
  virginica       0          2        20
Overall Statistics
                                          
               Accuracy : 0.9667          
                 95% CI : (0.8847, 0.9959)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.95            
                                          
 Mcnemar's Test P-Value : NA              
Statistics by Class:
                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.9000           1.0000
Specificity                 1.0000            1.0000           0.9500
Pos Pred Value              1.0000            1.0000           0.9091
Neg Pred Value              1.0000            0.9524           1.0000
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3000           0.3333
Detection Prevalence        0.3333            0.3000           0.3667
Balanced Accuracy           1.0000            0.9500           0.9750