Practical Machine Learning, Project July ‘15
Kate Sergeeva
Project summary
The goal of this project is using machine learning techniques to find a best model which will be able to predict how good athletes are doing the particular types of exercises and helps to catch the common mistakes.
Data for this project kindly provided by a group of enthusiasts who take measurements about themselves regularly using devices such as Jawbone Up, Nike FuelBand, and Fitbit to find patterns in their behavior to improve their health.
We will use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
Project report will describe the following:
- how the model was built,
- how the cross-validation was used,
- how out of sample error was estimated,
- how other choices been made.
And provide prediction for 20 different test cases.
Contents
- Loading and cleansing data
- Building models
- Cross-validation to measure “out of sample error”
- Model selection
- Report the results of test cases processing
Loading and cleansing data
train <- read.csv(file = "/Users/macbookpro/Downloads/pml-training.csv", header = TRUE, sep = ",")
test <- read.csv(file = "/Users/macbookpro/Downloads/pml-testing.csv", header = TRUE, sep = ",")
Removing meaningless variables
The test cases require predictions for particular moments with no connections to the style of particular athlet and it allows us to simply exclude from the following consideration the name of athlet and any time series data and row numbers.
modelTrain <- subset(train, select = -c(user_name, raw_timestamp_part_1, raw_timestamp_part_2,
cvtd_timestamp, new_window, num_window, X))
Handling missing values
Closer look to the data showed that it contains some inappropriate symbols and true missing values, that is need to be handled of using replacement and imputing techniques.
modelTrain[modelTrain == ""] <- NA
#looks like data comes from the Excel or similar editor and contains such a nice
modelTrain[modelTrain == "#DIV/0!"] <- NA
# Lookup for NA columns
naVars <- data.frame(apply(modelTrain, 2, function(x) {sum(is.na(x))}))
naVars$toRemove <- apply(naVars, 1, function(y) {y == length(modelTrain[, 1])})
columnsToExclude <- rownames(naVars[naVars$toRemove == "TRUE", ])
columnsToExclude
## [1] "kurtosis_yaw_belt" "skewness_yaw_belt" "kurtosis_yaw_dumbbell"
## [4] "skewness_yaw_dumbbell" "kurtosis_yaw_forearm" "skewness_yaw_forearm"
# Remove empty predictors
modelTrain <- subset(modelTrain, select = -c(kurtosis_yaw_belt, skewness_yaw_belt,
kurtosis_yaw_dumbbell, skewness_yaw_dumbbell,
kurtosis_yaw_forearm, skewness_yaw_forearm ))
Make up column classes
Before any modeling it is a good idea to check the variables classes and make sure that thea are all have the correct data type.
# Convert character columns into numeric
library(taRifx)
modelTrain <- japply(modelTrain, which(sapply(modelTrain, class)=="character"),as.integer)
modelTrain <- japply(modelTrain, which(sapply(modelTrain, class)=="integer"),as.integer)
modelTrain <- japply(modelTrain, which(sapply(modelTrain, class)=="factor"),as.integer)
# Convert classe to factor
modelTrain$classe <- as.factor(modelTrain$classe)
Tricky moment
Current number of variables most likely will cause the very long executing process and it would be better to reduce some dimensions. For example, instead of using nearZeroVar function I decided to play with lucky “cut-off” around number of missing values in the “nothing to impute about” columns.
library(e1071)
dim(modelTrain)
## [1] 19622 147
modelTrain[is.na(modelTrain)] <- 0
colAn <- data.frame(colSums(subset(modelTrain, select = -classe) == 0))
modelTrain <- modelTrain[, - which(colAn[, 1] > 19000)]
dim(modelTrain)
## [1] 19622 53
Now we have ready to modeling clean and nice data set with a manageable number of predictors.
Building models
Before going deep into modeling better previously to take care of possible ways of improving productivity of your computer and use parallel processing if it is possible.
# Do parallel
library(doParallel)
registerDoParallel(cores=4)
Subsetting data for cross-validation
The idea of cross-validation is very nice and undoubtedly useful - to split data into two separated parts (around 70/30), learn the model at the biggest one and than estimate the out of sample error on the smallest one. The idea might be complicated with number of splits and their usage, but in case of our pretty huge data set with the managable amount of predictors I will follow of idea of just two random splits.
library(caret)
## Loading required package: ggplot2
inTrain <- createDataPartition(y = modelTrain$classe, p = 0.7, list = FALSE)
training <- modelTrain[inTrain, ]
testing <- modelTrain[-inTrain, ]
dim(training); dim(testing)
## [1] 13737 53
## [1] 5885 53
Modeling
For classification task with more than two types of outcomes convinient to use classification trees algorithms and two different tree-based models will be trained - regular classification tree provided by “rpart”-method and random forest.
library(caret)
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
library(rattle)
modelRPART <- train(classe ~ ., method = "rpart", data = training, na.action = na.pass)
print(modelRPART$finalModel)
## n= 13737
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 13737 9831 1 (0.28 0.19 0.17 0.16 0.18)
## 2) roll_belt< 130.5 12600 8706 1 (0.31 0.21 0.19 0.18 0.11)
## 4) pitch_forearm< -33.35 1101 10 1 (0.99 0.0091 0 0 0) *
## 5) pitch_forearm>=-33.35 11499 8696 1 (0.24 0.23 0.21 0.2 0.12)
## 10) magnet_dumbbell_y< 439.5 9718 6974 1 (0.28 0.18 0.24 0.19 0.11)
## 20) roll_forearm< 123.5 6043 3582 1 (0.41 0.18 0.18 0.17 0.062) *
## 21) roll_forearm>=123.5 3675 2457 3 (0.077 0.18 0.33 0.23 0.19) *
## 11) magnet_dumbbell_y>=439.5 1781 880 2 (0.033 0.51 0.048 0.22 0.19) *
## 3) roll_belt>=130.5 1137 12 5 (0.011 0 0 0 0.99) *
predRPART <- predict(modelRPART, subset(testing, select = -classe))
confusionMatrix(predRPART, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5
## 1 1533 478 495 437 147
## 2 22 385 22 174 145
## 3 117 276 509 353 284
## 4 0 0 0 0 0
## 5 2 0 0 0 506
##
## Overall Statistics
##
## Accuracy : 0.4984
## 95% CI : (0.4855, 0.5112)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3439
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.9158 0.33802 0.49610 0.0000 0.46765
## Specificity 0.6303 0.92351 0.78802 1.0000 0.99958
## Pos Pred Value 0.4961 0.51471 0.33073 NaN 0.99606
## Neg Pred Value 0.9496 0.85322 0.88104 0.8362 0.89288
## Prevalence 0.2845 0.19354 0.17434 0.1638 0.18386
## Detection Rate 0.2605 0.06542 0.08649 0.0000 0.08598
## Detection Prevalence 0.5251 0.12710 0.26151 0.0000 0.08632
## Balanced Accuracy 0.7730 0.63077 0.64206 0.5000 0.73362
fancyRpartPlot(modelRPART$finalModel)
It is sad, but my computer failed to build random forest inside R Markdown report and I can't provide all the metrics and graphics that I planned to (this is the lesson to me to save all the versions (or just screenshot), not to live in hope of reproducibility of any code at any given moment). But believe me, the "rf" function gives 99% accuracy due to cross-validation on the separated test set and predicted all the test cases correctly. Update: I got it at the last moment - build separately and manually add data to this report. Sorry for some mess.
#modelRF <- train(classe ~ ., method = "rf", data = training)
print(modelRF$finalModel)
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
## OOB estimate of error rate: 0.63%
## Confusion matrix:
## 1 2 3 4 5 class.error
## 1 3904 1 1 0 0 0.0005120328
## 2 14 2637 7 0 0 0.0079006772
## 3 0 16 2374 6 0 0.0091819699
## 4 0 0 32 2219 1 0.0146536412
## 5 0 0 1 7 2517 0.0031683168
predRF <- prediction(modelRF, subset(testing, select = -classe))
confusionMatrix(predRF, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5
## 1 1672 9 0 0 0
## 2 2 1129 9 0 0
## 3 0 1 1015 12 2
## 4 0 0 2 951 1
## 5 0 0 0 1 1079
##
## Overall Statistics
##
## Accuracy : 0.9934
## 95% CI : (0.991, 0.9953)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9916
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.9988 0.9912 0.9893 0.9865 0.9972
## Specificity 0.9979 0.9977 0.9969 0.9994 0.9998
## Pos Pred Value 0.9946 0.9904 0.9854 0.9969 0.9991
## Neg Pred Value 0.9995 0.9979 0.9977 0.9974 0.9994
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2841 0.1918 0.1725 0.1616 0.1833
## Detection Prevalence 0.2856 0.1937 0.1750 0.1621 0.1835
## Balanced Accuracy 0.9983 0.9945 0.9931 0.9930 0.9985
varImpPlot(modelRF$finalModel)
plot(modelRF$finalModel, log = "y")
See how the accuracy increase with the number of trees.
Random forest gives an excellent result and should be used for futher predictions.
Test case prediction
To use trained model on the test case all the data manipulations that were done with the train set should be applied to the test one.
Code not shown, but it is the same as test set manipulations - cleansing and reducing.Final prediction of test cases gives the following vector:
predTestTry <- predict(modelRF, testTry)
Which is: B A B A A E D B A A B C B A E E A B B B, and showed the 100% accuracy in test cases prediction.