library(mlbench)
library(ggplot2)
library(ggforce)
data <- read.csv("proj518.csv", head = TRUE, sep=";")
head(data)
names(data)
## [1] "lifetime" "broken" "pressureInd" "moistureInd"
## [5] "temperatureInd" "team" "provider"
print(ncol(data))
## [1] 7
print(nrow(data))
## [1] 1000
str(data)
## 'data.frame': 1000 obs. of 7 variables:
## $ lifetime : int 56 81 60 86 34 30 68 65 23 81 ...
## $ broken : int 0 1 0 1 0 0 0 1 0 1 ...
## $ pressureInd : num 92.2 72.1 96.3 94.4 97.8 ...
## $ moistureInd : num 104.2 103.1 77.8 108.5 99.4 ...
## $ temperatureInd: num 96.5 87.3 112.2 72 103.8 ...
## $ team : chr "TeamA" "TeamC" "TeamA" "TeamC" ...
## $ provider : chr "Provider4" "Provider4" "Provider1" "Provider2" ...
We have in total five numerical columns, among two are having integer datatype. broken is our target column, it’s a categorical columns with only two classes, “0” and “1”.
“1” stands for broken “0” stands for not broken.
summary(data)
## lifetime broken pressureInd moistureInd
## Min. : 1.0 Min. :0.000 Min. : 33.48 Min. : 58.55
## 1st Qu.:34.0 1st Qu.:0.000 1st Qu.: 85.56 1st Qu.: 92.77
## Median :60.0 Median :0.000 Median : 97.22 Median : 99.43
## Mean :55.2 Mean :0.397 Mean : 98.60 Mean : 99.38
## 3rd Qu.:80.0 3rd Qu.:1.000 3rd Qu.:112.25 3rd Qu.:106.12
## Max. :93.0 Max. :1.000 Max. :173.28 Max. :128.60
## temperatureInd team provider
## Min. : 42.28 Length:1000 Length:1000
## 1st Qu.: 87.68 Class :character Class :character
## Median :100.59 Mode :character Mode :character
## Mean :100.63
## 3rd Qu.:113.66
## Max. :172.54
unique(data$team)
## [1] "TeamA" "TeamC" "TeamB"
unique(data$provider)
## [1] "Provider4" "Provider1" "Provider2" "Provider3"
data$team = factor(data$team,
levels = c('TeamA', 'TeamC', 'TeamB'), labels = c(0, 1, 2))
data$provider = factor(data$provider,
levels = c('Provider4', 'Provider1', 'Provider2','Provider3'),
labels = c(0, 1, 2, 3))
head(data)
And yes the encodings were applied to the dataset. Now, we can test if these columns have any direct relation with the target or not.
But, before that let’s check the distribution of the data with respect to the target values(i.e. ‘0’ or ‘1’) the outliers present in our dataset.
plot( factor(data$broken), data$moistureInd )
Moisture is nearly equally distributed in both categories, with slight differences between the values of mean, Q1, Q3, etc.
plot(factor(data$broken), data$pressureInd)
plot(factor(data$broken),data$temperatureInd)
plot(factor(data$broken),data$lifetime)
In the above graph we can clearly see the relationship between life time of the machine and expected state of it. Clearly we can say that for machine whose lifetime has exceeded 60 years, there is a chance it may break ofcouse other factors will play their fair role in deciding that.
It is evident from the the above graphs that they are outliers present in the numerical columns except lifetime, we shall remove these outliers.
outliers <- function(x) {
Q1 <- quantile(x, probs=.25)
Q3 <- quantile(x, probs=.75)
iqr = Q3-Q1
upper_limit = Q3 + (iqr*1.5)
lower_limit = Q1 - (iqr*1.5)
x > upper_limit | x < lower_limit
}
remove_outliers <- function(df, cols = names(df)) {
for (col in cols) {
df <- df[!outliers(df[[col]]),]
}
df
}
data = remove_outliers(data, c("lifetime","pressureInd","moistureInd","temperatureInd"))
train <- data[,which(names(data)!= "broken")]
train
#Creating partitioning.
library(caTools)
## Warning: package 'caTools' was built under R version 4.2.1
sample <- sample.split(data$broken, SplitRatio = 0.7)
train_data <- subset(data, sample == TRUE)
test_data <- subset(data, sample == FALSE)
str(test_data)
## 'data.frame': 294 obs. of 7 variables:
## $ lifetime : int 34 29 82 80 74 35 79 53 60 13 ...
## $ broken : int 0 0 1 1 1 0 1 0 1 0 ...
## $ pressureInd : num 97.8 67.8 103.1 88.4 100 ...
## $ moistureInd : num 99.4 96.1 103.7 97.4 103.5 ...
## $ temperatureInd: num 103.8 122.4 79.5 89.3 96.9 ...
## $ team : Factor w/ 3 levels "0","1","2": 3 1 2 3 2 2 3 1 2 2 ...
## $ provider : Factor w/ 4 levels "0","1","2","3": 2 2 1 2 2 3 2 4 4 2 ...
plot(data$lifetime,data$provider, pch = 21,
bg = "red", # Fill color
col = "blue", # Border color
cex = 1, # Symbol size
lwd = 1) # Border width
There is some relation found between provider and the lifetime. Specifically machines from “provider3” do not reach the lifetime of 80yrs mark while others do machines from “provider2” go even further.
plot(data$lifetime,data$broken, pch = 21,
bg = "red", # Fill color
col = "blue", # Border color
cex = 1, # Symbol size
lwd = 1) # Border width
It is evident from the data that machines die out after they have crossed 60 mark.
simple_logistic_model <- glm(data = train_data,
broken ~ lifetime + pressureInd + moistureInd + temperatureInd,
family = binomial())
summary(simple_logistic_model)
##
## Call:
## glm(formula = broken ~ lifetime + pressureInd + moistureInd +
## temperatureInd, family = binomial(), data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.38865 -0.29489 -0.04082 0.48761 1.68142
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.072333 1.792058 -5.063 4.14e-07 ***
## lifetime 0.138270 0.011131 12.422 < 2e-16 ***
## pressureInd -0.004009 0.007115 -0.563 0.573
## moistureInd -0.002661 0.012970 -0.205 0.837
## temperatureInd 0.005389 0.006909 0.780 0.435
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 919.37 on 683 degrees of freedom
## Residual deviance: 398.50 on 679 degrees of freedom
## AIC: 408.5
##
## Number of Fisher Scoring iterations: 7
probs <- predict(simple_logistic_model, test_data, type="response")
pred_target <- ifelse(probs > 0.5, 1, 0)
pred_target
## 5 12 15 16 21 24 28 29 31 46 52 58 65 68 71 73 74 77 80 81
## 0 0 1 1 1 0 1 0 0 0 0 1 0 0 0 1 0 1 1 0
## 83 84 86 90 91 92 93 97 99 105 107 110 113 120 125 128 129 138 140 143
## 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1
## 149 151 152 153 156 159 160 161 170 173 175 182 187 188 191 195 196 202 203 204
## 1 1 0 0 1 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0
## 205 209 214 218 221 222 223 229 231 239 241 249 251 252 255 259 265 268 273 276
## 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1
## 278 284 285 286 289 295 301 306 307 314 321 324 336 337 338 341 350 355 362 363
## 0 1 1 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0
## 365 367 375 376 381 393 394 401 403 405 407 408 413 414 418 420 424 433 434 436
## 0 1 0 0 1 0 0 0 0 1 1 0 1 1 0 0 1 0 0 1
## 437 438 444 452 454 455 456 461 463 466 468 470 471 473 474 482 483 492 493 498
## 0 0 0 1 0 1 0 0 1 1 1 0 0 1 1 0 1 0 0 1
## 500 502 503 504 506 508 516 521 537 544 545 548 560 561 563 568 572 576 578 581
## 1 1 0 1 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0
## 587 593 594 595 596 602 612 615 616 619 620 623 629 633 635 636 637 644 647 651
## 0 1 1 1 1 1 1 0 0 0 1 1 0 1 0 1 0 1 0 0
## 656 657 659 660 666 670 675 684 689 691 693 695 701 703 706 708 709 714 716 718
## 0 0 0 0 1 0 1 1 0 0 1 0 1 1 0 0 0 0 0 1
## 725 727 728 730 734 736 738 740 741 742 743 745 747 759 760 764 768 769 771 772
## 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 1 1
## 784 788 789 794 796 797 799 802 804 808 809 811 812 813 814 825 828 837 838 840
## 1 1 0 0 0 0 0 1 1 0 0 1 1 0 1 0 1 0 0 1
## 841 844 847 850 851 852 853 856 857 859 861 863 865 867 869 870 872 873 874 877
## 1 0 0 0 0 0 1 1 0 1 0 0 0 0 1 1 0 0 1 0
## 879 880 883 890 893 894 895 902 905 908 912 919 921 925 931 932 937 939 943 944
## 0 0 0 0 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0
## 951 953 956 965 966 971 975 977 983 985 986 988 995 999
## 0 0 0 1 0 1 0 1 0 0 0 1 1 1
Now, lets check the accuracy of our model to see how well it is performing…
# Model accuracy
mean(pred_target == test_data$broken)*100
## [1] 77.55102
We got almost 80% accuracy score for our model applied on the cleaned dataset.
Now, you can try to add or remove some features and see what effect does it has on the model.