Nepal¹ is tomorrow~

Map

Dataset

I got two datasets, peak ascents and deaths reports for Everest from The Himalayan Database and combined them into one dataset. Most other online databases only provide one or two recent year data for free, and I didn’t find more available data, so this dataset has limited variables for analysis. Probably could add some weather data like daily wind speed and average temperature as predictors. 👉National Geographic

something is wrong here, NA is more than 215, and the totals are not equal?

Show code

NA_cnt <- table(is.na(total))
NA_pct <- prop.table(NA_cnt)
cbind(NA_cnt, NA_pct)

      NA_cnt      NA_pct
FALSE 135076 0.998410833
TRUE     215 0.001589167

Show code

total <- VIM::kNN(total, 
                  variable = c("Age", "Oxy"),
                  k = 10)
total <- total[, -c(15:16)]
table(is.na(total))


 FALSE 
145698

Show code

counts <- table(total$Dth)
proportions <- prop.table(counts)
cbind(counts, proportions)

  counts proportions
0  10101  0.97059671
1    306  0.02940329

Plots

Age Male climbers’ age range is wider than females due to some outliers. We can see the oldest climbers are over 80. 50% of death in females are between the age of 30-50 while males are between 30-45. Both genders’ interquartile ranges are similar among success ascents groups.

Show code

ggplot(data = total, 
       mapping = aes(x = Sex,
                     y = Age,
                     fill = Dth), na.rm = TRUE) +
  labs(x = "Sex",
       y = "Age",
       title = "Age Distribution by Sex and Death",
       tag = "Fig. 1") +
  geom_boxplot(alpha = .3)

Causes More people died the altitude below 7500m because of natural disasters compared to the altitude above 7500m. Death zone² is defined as an altitude above 8000m, which can cause mountaineers to lose physical strength or even vital functions due to a low level of concentration of oxygen. I guess I could add some ratio variables for the death rates in different altitude zones? Like death zone and Khumbu Icefall?

=(Had issues with some columns even though I had already removed the missing values and it did work when I ran the chunk individually. However, it did not work when I knit the whole post. 😩

Show code

total %>% 
  drop_na(Altitude) %>% 
  ggplot(mapping = aes(x = Cause,
                       y = Altitude,
                       fill = Cause)) +
  labs(title = "Causes of Death by Altitude",
       tag = "Fig. 2") +
  geom_boxplot(alpha = .3) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 90))

Oxy

Show code

order <- c(1, 2, 3, 4) 
total %>% 
  drop_na(Altitude) %>% 
  ggplot(mapping = aes(x = factor(Season, level = order),
                       y = Altitude,
                       fill = Oxy)) +
  labs(title = "Oxygen Consumption by Season",
       x = "Season",
       tag = "Fig. 3") +
  geom_boxplot(alpha = .3) +
  scale_x_discrete(labels = c("1" = "Spring", "2" = "Summer", "3" = "Fall", "4" = "Winter")) +
  scale_fill_discrete(labels = c("Yes", "No", "NA"))

Oxy by Age Most climbers are between 25-45 years old

Show code

total %>% 
  drop_na(Age) %>% 
  ggplot(mapping = aes(x = Age,
                     fill = Oxy)) +
  geom_histogram(position = "stack",
                 alpha = .6,
                 stat="count") +
  labs(title = "Histogram: Credit Limit vs Attrition",
       tag = "Fig. 4")

Oxy by Altitude Higher consumption of oxygen above 7500m altitude.

Show code

total %>% 
  drop_na(Altitude) %>% 
  ggplot(mapping = aes(x = Altitude,
                       group = Oxy,
                       fill = Oxy)) +
  geom_density(adjust = 1.5, alpha = .4) +
  labs(title = "Altitude vs Oxygen Consumption",
       tag = "Fig. 5")

Season Most expeditions happened in spring and some in the fall

Show code

order <- c(1, 2, 3, 4)
total %>% 
  drop_na(Altitude) %>% 
  ggplot(mapping = aes(x = Altitude,
                       fill = factor(Season, level = order))) +
  geom_histogram(position = "stack",
                 alpha = .5) +
  labs(title = "Altitude vs Season",
       tag = "Fig. 6") +
  scale_fill_discrete(name = "Season", labels = c("Spring", "Summer", "Fall", "Winter"))

Error: StatBin requires a continuous x variable: the x variable is discrete.Perhaps you want stat="count"?

Result

Show code

comparison <- matrix(c(0.9199, 0.2818, 0.6264, 0.9287, 
                       0.9327, 0.3152, 0.6044, 0.9426, 
                       0.9503, 0.3963, 0.6154, 0.9604),
                     ncol = 4, byrow = TRUE)
colnames(comparison) <- c("Accuracy", "Kappa", "Recall", "Sepecificity")
rownames(comparison) <- c("i.SMOTE", "ii.UP", "iii.DOWN")
comparison <- as.data.frame.matrix(comparison)
kable(comparison) %>% 
  row_spec(3, color = "white", background = "#bdaeea")

	Accuracy	Kappa	Recall	Sepecificity
i.SMOTE	0.9199	0.2818	0.6264	0.9287
ii.UP	0.9327	0.3152	0.6044	0.9426
iii.DOWN	0.9503	0.3963	0.6154	0.9604

Code

i. My logistic regression does not work=(

Show code

index <- createDataPartition(total1$Dth, p = .7, list = FALSE, times = 1)

train <- total1[index,]
test <- total1[-index,]

glm <- train(
  form = Dth ~ Age + factor(Oxy),
  data = train,
  trContrl =trainControl(method = "cv", number = 10),
  method = "glm",
  family = "binomial",
  na.action = na.omit
)

Something is wrong; all the Accuracy metric values are missing:
    Accuracy       Kappa    
 Min.   : NA   Min.   : NA  
 1st Qu.: NA   1st Qu.: NA  
 Median : NA   Median : NA  
 Mean   :NaN   Mean   :NaN  
 3rd Qu.: NA   3rd Qu.: NA  
 Max.   : NA   Max.   : NA  
 NA's   :1     NA's   :1

Error: Stopping

ii. knn with SMOTE

Show code

knn1 <- train( 
  Dth ~ factor(Season) + factor(Sex) + Age + factor(Oxy),
  data = train,
  method = "knn",
  trControl = trainControl(method = "cv", number = 10,
                           sampling = "smote"),
  preProcess = c("center", "scale"),
  tuneLength = 10,
  na.action = na.omit
)

plot(knn1)

Show code

knn1$bestTune

    k
10 23

Show code

pred1 <- predict(knn1, test)
confusionMatrix(pred1, test$Dth,
                positive = "1")

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 2814   34
         1  216   57
                                          
               Accuracy : 0.9199          
                 95% CI : (0.9098, 0.9292)
    No Information Rate : 0.9708          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.2818          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.62637         
            Specificity : 0.92871         
         Pos Pred Value : 0.20879         
         Neg Pred Value : 0.98806         
             Prevalence : 0.02916         
         Detection Rate : 0.01826         
   Detection Prevalence : 0.08747         
      Balanced Accuracy : 0.77754         
                                          
       'Positive' Class : 1

iii. knn with Up

Show code

knn3 <- train( 
  Dth ~ factor(Season) + factor(Sex) + Age + factor(Oxy),
  data = train,
  method = "knn",
  trControl = trainControl(method = "cv", number = 10,
                           sampling = "up"),
  preProcess = c("center", "scale"),
  tuneLength = 10
)

plot(knn3)

Show code

knn3$bestTune

  k
1 5

Show code

pred3 <- predict(knn3, test)
confusionMatrix(pred3, test$Dth,
                positive = "1")

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 2856   36
         1  174   55
                                          
               Accuracy : 0.9327          
                 95% CI : (0.9234, 0.9413)
    No Information Rate : 0.9708          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3152          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.60440         
            Specificity : 0.94257         
         Pos Pred Value : 0.24017         
         Neg Pred Value : 0.98755         
             Prevalence : 0.02916         
         Detection Rate : 0.01762         
   Detection Prevalence : 0.07337         
      Balanced Accuracy : 0.77348         
                                          
       'Positive' Class : 1

iv. knn with Down

Show code

knn4 <- train( 
  Dth ~ factor(Season) + factor(Sex) + Age + factor(Oxy),
  data = train,
  method = "knn",
  trControl = trainControl(method = "cv", number = 10,
                           sampling = "down"),
  preProcess = c("center", "scale"),
  tuneLength = 10,
  na.action = na.omit
)

plot(knn4)

Show code

knn4$bestTune

    k
10 23

Show code

pred4 <- predict(knn4, test)
confusionMatrix(pred4, test$Dth,
                positive = "1")

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 2910   35
         1  120   56
                                          
               Accuracy : 0.9503          
                 95% CI : (0.9421, 0.9577)
    No Information Rate : 0.9708          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3963          
                                          
 Mcnemar's Test P-Value : 1.509e-11       
                                          
            Sensitivity : 0.61538         
            Specificity : 0.96040         
         Pos Pred Value : 0.31818         
         Neg Pred Value : 0.98812         
             Prevalence : 0.02916         
         Detection Rate : 0.01794         
   Detection Prevalence : 0.05639         
      Balanced Accuracy : 0.78789         
                                          
       'Positive' Class : 1

R.I.P all the bodies and Hillary Step

Mt. Everest

Map

Dataset

Plots

Result

Code

i. My logistic regression does not work=(

ii. knn with SMOTE

iii. knn with Up

iv. knn with Down