Importing necessary libraries.

library(mlbench)
library(ggplot2)
library(ggforce)
data <- read.csv("proj518.csv", head = TRUE, sep=";")
head(data)
names(data)
## [1] "lifetime"       "broken"         "pressureInd"    "moistureInd"   
## [5] "temperatureInd" "team"           "provider"
print(ncol(data))
## [1] 7
print(nrow(data))
## [1] 1000
 str(data)
## 'data.frame':    1000 obs. of  7 variables:
##  $ lifetime      : int  56 81 60 86 34 30 68 65 23 81 ...
##  $ broken        : int  0 1 0 1 0 0 0 1 0 1 ...
##  $ pressureInd   : num  92.2 72.1 96.3 94.4 97.8 ...
##  $ moistureInd   : num  104.2 103.1 77.8 108.5 99.4 ...
##  $ temperatureInd: num  96.5 87.3 112.2 72 103.8 ...
##  $ team          : chr  "TeamA" "TeamC" "TeamA" "TeamC" ...
##  $ provider      : chr  "Provider4" "Provider4" "Provider1" "Provider2" ...

We have in total five numerical columns, among two are having integer datatype. broken is our target column, it’s a categorical columns with only two classes, “0” and “1”.

“1” stands for broken “0” stands for not broken.

Now, lets get the statistical summary of the dataset.

summary(data)  
##     lifetime        broken       pressureInd      moistureInd    
##  Min.   : 1.0   Min.   :0.000   Min.   : 33.48   Min.   : 58.55  
##  1st Qu.:34.0   1st Qu.:0.000   1st Qu.: 85.56   1st Qu.: 92.77  
##  Median :60.0   Median :0.000   Median : 97.22   Median : 99.43  
##  Mean   :55.2   Mean   :0.397   Mean   : 98.60   Mean   : 99.38  
##  3rd Qu.:80.0   3rd Qu.:1.000   3rd Qu.:112.25   3rd Qu.:106.12  
##  Max.   :93.0   Max.   :1.000   Max.   :173.28   Max.   :128.60  
##  temperatureInd       team             provider        
##  Min.   : 42.28   Length:1000        Length:1000       
##  1st Qu.: 87.68   Class :character   Class :character  
##  Median :100.59   Mode  :character   Mode  :character  
##  Mean   :100.63                                        
##  3rd Qu.:113.66                                        
##  Max.   :172.54

We have Two categorical columns in the features “team” and “provider”. Let’s check categories for each.

unique(data$team)
## [1] "TeamA" "TeamC" "TeamB"
unique(data$provider)
## [1] "Provider4" "Provider1" "Provider2" "Provider3"

Let’s encode these columns into numerically categorized columns.

data$team = factor(data$team,
                   levels = c('TeamA', 'TeamC', 'TeamB'),                                              labels = c(0, 1, 2))
data$provider = factor(data$provider,

                                                levels = c('Provider4', 'Provider1', 'Provider2','Provider3'),

                                                labels = c(0, 1, 2, 3))

Lets check the data again to make sure if the changes got applied or not.

head(data)

And yes the encodings were applied to the dataset. Now, we can test if these columns have any direct relation with the target or not.

But, before that let’s check the distribution of the data with respect to the target values(i.e. ‘0’ or ‘1’) the outliers present in our dataset.

plot( factor(data$broken), data$moistureInd )

Moisture is nearly equally distributed in both categories, with slight differences between the values of mean, Q1, Q3, etc.

plot(factor(data$broken), data$pressureInd)

plot(factor(data$broken),data$temperatureInd)

plot(factor(data$broken),data$lifetime)

In the above graph we can clearly see the relationship between life time of the machine and expected state of it. Clearly we can say that for machine whose lifetime has exceeded 60 years, there is a chance it may break ofcouse other factors will play their fair role in deciding that.

It is evident from the the above graphs that they are outliers present in the numerical columns except lifetime, we shall remove these outliers.

Let’s create functions to remove outliers.

outliers <- function(x) {

  Q1 <- quantile(x, probs=.25)
  Q3 <- quantile(x, probs=.75)
  iqr = Q3-Q1

 upper_limit = Q3 + (iqr*1.5)
 lower_limit = Q1 - (iqr*1.5)

 x > upper_limit | x < lower_limit
}

remove_outliers <- function(df, cols = names(df)) {
  for (col in cols) {
    df <- df[!outliers(df[[col]]),]
  }
  df
}

Now applying these functions to remove outliers from the columns we found outliers in.

data = remove_outliers(data, c("lifetime","pressureInd","moistureInd","temperatureInd"))

Now, go back to the plotting part and check for the outliers.

train <- data[,which(names(data)!= "broken")]
train
#Creating partitioning.
library(caTools)
## Warning: package 'caTools' was built under R version 4.2.1
sample <- sample.split(data$broken, SplitRatio = 0.7)
train_data <- subset(data, sample == TRUE)
test_data <- subset(data, sample == FALSE)
str(test_data)  
## 'data.frame':    294 obs. of  7 variables:
##  $ lifetime      : int  34 29 82 80 74 35 79 53 60 13 ...
##  $ broken        : int  0 0 1 1 1 0 1 0 1 0 ...
##  $ pressureInd   : num  97.8 67.8 103.1 88.4 100 ...
##  $ moistureInd   : num  99.4 96.1 103.7 97.4 103.5 ...
##  $ temperatureInd: num  103.8 122.4 79.5 89.3 96.9 ...
##  $ team          : Factor w/ 3 levels "0","1","2": 3 1 2 3 2 2 3 1 2 2 ...
##  $ provider      : Factor w/ 4 levels "0","1","2","3": 2 2 1 2 2 3 2 4 4 2 ...
plot(data$lifetime,data$provider, pch = 21,
     bg = "red",   # Fill color
     col = "blue", # Border color
     cex = 1,      # Symbol size
     lwd = 1)      # Border width

There is some relation found between provider and the lifetime. Specifically machines from “provider3” do not reach the lifetime of 80yrs mark while others do machines from “provider2” go even further.

plot(data$lifetime,data$broken, pch = 21,
     bg = "red",   # Fill color
     col = "blue", # Border color
     cex = 1,      # Symbol size
     lwd = 1)      # Border width

It is evident from the data that machines die out after they have crossed 60 mark.

Now, lets try out logistic regression model.

simple_logistic_model <- glm(data = train_data,
                            broken ~ lifetime + pressureInd +  moistureInd + temperatureInd,
                            family = binomial())
summary(simple_logistic_model)
## 
## Call:
## glm(formula = broken ~ lifetime + pressureInd + moistureInd + 
##     temperatureInd, family = binomial(), data = train_data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.38865  -0.29489  -0.04082   0.48761   1.68142  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -9.072333   1.792058  -5.063 4.14e-07 ***
## lifetime        0.138270   0.011131  12.422  < 2e-16 ***
## pressureInd    -0.004009   0.007115  -0.563    0.573    
## moistureInd    -0.002661   0.012970  -0.205    0.837    
## temperatureInd  0.005389   0.006909   0.780    0.435    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 919.37  on 683  degrees of freedom
## Residual deviance: 398.50  on 679  degrees of freedom
## AIC: 408.5
## 
## Number of Fisher Scoring iterations: 7
probs <- predict(simple_logistic_model, test_data, type="response")
pred_target <- ifelse(probs > 0.5, 1, 0)
pred_target
##   5  12  15  16  21  24  28  29  31  46  52  58  65  68  71  73  74  77  80  81 
##   0   0   1   1   1   0   1   0   0   0   0   1   0   0   0   1   0   1   1   0 
##  83  84  86  90  91  92  93  97  99 105 107 110 113 120 125 128 129 138 140 143 
##   0   0   1   0   0   1   1   0   0   0   1   0   0   0   0   0   0   1   0   1 
## 149 151 152 153 156 159 160 161 170 173 175 182 187 188 191 195 196 202 203 204 
##   1   1   0   0   1   0   1   0   1   0   0   1   0   0   0   1   0   0   0   0 
## 205 209 214 218 221 222 223 229 231 239 241 249 251 252 255 259 265 268 273 276 
##   0   0   0   0   0   0   1   0   0   0   1   0   0   1   0   0   0   0   0   1 
## 278 284 285 286 289 295 301 306 307 314 321 324 336 337 338 341 350 355 362 363 
##   0   1   1   0   1   0   0   0   1   0   0   1   1   0   0   0   0   1   0   0 
## 365 367 375 376 381 393 394 401 403 405 407 408 413 414 418 420 424 433 434 436 
##   0   1   0   0   1   0   0   0   0   1   1   0   1   1   0   0   1   0   0   1 
## 437 438 444 452 454 455 456 461 463 466 468 470 471 473 474 482 483 492 493 498 
##   0   0   0   1   0   1   0   0   1   1   1   0   0   1   1   0   1   0   0   1 
## 500 502 503 504 506 508 516 521 537 544 545 548 560 561 563 568 572 576 578 581 
##   1   1   0   1   0   1   1   0   0   0   0   0   0   0   1   0   1   0   0   0 
## 587 593 594 595 596 602 612 615 616 619 620 623 629 633 635 636 637 644 647 651 
##   0   1   1   1   1   1   1   0   0   0   1   1   0   1   0   1   0   1   0   0 
## 656 657 659 660 666 670 675 684 689 691 693 695 701 703 706 708 709 714 716 718 
##   0   0   0   0   1   0   1   1   0   0   1   0   1   1   0   0   0   0   0   1 
## 725 727 728 730 734 736 738 740 741 742 743 745 747 759 760 764 768 769 771 772 
##   1   0   1   0   0   0   0   0   0   0   0   1   1   0   0   0   1   1   1   1 
## 784 788 789 794 796 797 799 802 804 808 809 811 812 813 814 825 828 837 838 840 
##   1   1   0   0   0   0   0   1   1   0   0   1   1   0   1   0   1   0   0   1 
## 841 844 847 850 851 852 853 856 857 859 861 863 865 867 869 870 872 873 874 877 
##   1   0   0   0   0   0   1   1   0   1   0   0   0   0   1   1   0   0   1   0 
## 879 880 883 890 893 894 895 902 905 908 912 919 921 925 931 932 937 939 943 944 
##   0   0   0   0   1   0   0   1   1   0   0   1   1   0   0   0   0   0   0   0 
## 951 953 956 965 966 971 975 977 983 985 986 988 995 999 
##   0   0   0   1   0   1   0   1   0   0   0   1   1   1

Now, lets check the accuracy of our model to see how well it is performing…

# Model accuracy
mean(pred_target == test_data$broken)*100
## [1] 77.55102

We got almost 80% accuracy score for our model applied on the cleaned dataset.

Now, you can try to add or remove some features and see what effect does it has on the model.