# Part I: Tests for a cybersecurity data set
Let's revisit cybersecurity breach report data downloaded 2015-02-26 from the US Health and Human Services.
From the Office for Civil Rights of the U.S. Department of Health and Human Services, I obtained the following information:
"As required by section 13402(e)(4) of the HITECH Act, the Secretary must post a list of breaches of unsecured protected health information affecting 500 or more individuals.
"Since October 2009 organizations in the U.S. that store data on human health are required to report any incident that compromises the confidentiality of 500 or more patients / human subjects (45 C.F.R. 164.408). These reports are publicly available. Our data set was downloaded from the Office for Civil Rights of the U.S. Department of Health and Human Services, 2015-02-26."
Load this data set and store it as cyberData
, using the following code:
cyberData<-read.csv(url("https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/HHSCyberSecurityBreaches.csv")) |
As you know, this data set contains all reports regarding health information data breaches from 2009 to 2015. Let's pretend this is just a sample from the population of all data breaches, related or not to health information.
# Question 1.
Compare the number of individuals affected by data breaches (column Individuals.Affected
) in two states, Arkansas ( State=="AR"
) and California ( State=="CA"
).
This can be done by performing a test of difference in means, for example.
Repeat the same test for another pair of states, California ("CA") and Illinois ("IL").
Please note, in order to answer this question completely, you will need to run several lines of code, extract subsets of the data appropriately, run a statistical hypothesis test, and interpret the results. Draw a conclusion. Partial answers to the question will are insufficient.
AR <- cyberData[cyberData$State=="AR", ] | |
CA <- cyberData[cyberData$State=="CA", ] | |
IL <- cyberData[cyberData$State=="IL", ] |
Since we don't know the variance, we need to use the t-test to compare the means. Before we compare the means, we can use the F test to see if the variances are equal
# F test | |
var.test(AR$Individuals.Affected, CA$Individuals.Affected) |
F test to compare two variances
data: AR$Individuals.Affected and CA$Individuals.Affected
F = 0.00066857, num df = 6, denom df = 127, p-value = 2.814e-09
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.0002664288 0.0032769357
sample estimates:
ratio of variances
0.0006685688
The F test tells us that we should reject . The variances are different.
We should do t-test with different variances.
t.test(AR$Individuals.Affected, CA$Individuals.Affected) |
Welch Two Sample t-test
data: AR$Individuals.Affected and CA$Individuals.Affected
t = -2.2841, df = 129.71, p-value = 0.02399
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-30145.686 -2161.579
sample estimates:
mean of x mean of y
2769.00 18922.63
As p-value = 0.02399 is smaller than 0.05, we should reject .
The mean value of Individuals.Affected is different in AR and CA.
# F test | |
var.test(CA$Individuals.Affected, IL$Individuals.Affected) |
F test to compare two variances
data: CA$Individuals.Affected and IL$Individuals.Affected
F = 0.02224, num df = 127, denom df = 56, p-value < 2.2e-16
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.01392852 0.03413975
sample estimates:
ratio of variances
0.02223981
The F test tells us that we should reject . The variances are different.
We should do t-test with different variances.
t.test(CA$Individuals.Affected, IL$Individuals.Affected) |
Welch Two Sample t-test
data: CA$Individuals.Affected and IL$Individuals.Affected
t = -0.87104, df = 57.112, p-value = 0.3874
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-203969.48 80308.12
sample estimates:
mean of x mean of y
18922.63 80753.32
As p-value = 0.3874, we should not reject .
The mean value of Individuals.Affected is same in CA and IL.
# Question 2.
Explore the variable Type.Of.Breach
collected in this data set:
- What proportion of data entries in
cyberData
haveType.of.Breach == "Hacking/IT Incident"
?
hackIT <- nrow(cyberData[cyberData$Type.of.Breach == "Hacking/IT Incident",0]) | |
total <- nrow(cyberData) | |
prop <- hackIT / total |
The proportion of data entries in
cyberData
haveType.of.Breach == "Hacking/IT Incident"
isr prop
.
- What are all the different values of
Type.Of.Breach
reported in the data set? How many are hacking/IT incidents?
table(cyberData$Type.of.Breach) |
Hacking/IT Incident
77
Hacking/IT Incident, Other
2
Hacking/IT Incident, Other, Unauthorized Access/Disclosure
1
Hacking/IT Incident, Theft
1
Hacking/IT Incident, Theft, Unauthorized Access/Disclosure
3
Hacking/IT Incident, Unauthorized Access/Disclosure
10
Improper Disposal
42
Improper Disposal, Loss
3
Improper Disposal, Loss, Theft
3
Improper Disposal, Theft, Unauthorized Access/Disclosure
1
Improper Disposal, Unauthorized Access/Disclosure
2
Loss
79
Loss, Other
2
Loss, Other, Theft
1
Loss, Theft
15
Loss, Unauthorized Access/Disclosure
5
Loss, Unauthorized Access/Disclosure, Unknown
1
Loss, Unknown
2
Other
89
Other, Theft
5
Other, Theft, Unauthorized Access/Disclosure
2
Other, Unauthorized Access/Disclosure
7
Other, Unknown
2
Theft
577
Theft, Unauthorized Access/Disclosure
24
Theft, Unauthorized Access/Disclosure, Unknown
1
Unauthorized Access/Disclosure
183
Unauthorized Access/Disclosure
1
Unknown
10
There are 29 different types of
Type.Of.Breach
. There are 77 is hacking/IT incidents. There also exist some types include Hacking/IT Incident, like Hacking/IT Incident, Other, Hacking/IT Incident, Other, Unauthorized Access/Disclosure etc.
- What type of breach is reported in the 748th row of
cyberData
? How about 349th row? Was row 349 counted in the proportion of Hacking/IT incident breaches you computed above? Why or why not?
cyberData[748, 7] | |
cyberData[349, 7] | |
table(unlist(strsplit(cyberData$Type.of.Breach, ','))) |
[1] "Loss, Theft"
[1] "Hacking/IT Incident, Unauthorized Access/Disclosure"
Loss Other Theft
6 6 31
Unauthorized Access/Disclosure Unknown Hacking/IT Incident
57 6 94
Improper Disposal Loss Other
51 105 105
Theft Unauthorized Access/Disclosure Unauthorized Access/Disclosure
602 183 1
Unknown
10
The type of breach is reported in the 748th row of
cyberData
is Loss, Theft.
The type of breach in 349th row is Hacking/IT Incident, Unauthorized Access/Disclosure.
The row 349 is not counted in the proportion of Hacking/IT incident breaches you computed above as "Hacking/IT Incident, Unauthorized Access/Disclosure" not exactly match "Hacking/IT Incident".
- Perform a hypothesis test on whether there is a difference in proportion of Hacking/IT incidents between the state of Illinois and the state of California. Write your conclusion interpreting the results of the statistical test.
table(IL$Type.of.Breach) | |
x_IL<-sum(IL$Type.of.Breach=="Hacking/IT Incident") | |
n_IL<-length(IL$Type.of.Breach) | |
table(CA$Type.of.Breach) | |
x_CA<-sum(CA$Type.of.Breach=="Hacking/IT Incident") | |
n_CA<-length(CA$Type.of.Breach) | |
x<-c(x_IL,x_CA) | |
n<-c(n_IL,n_CA) | |
prop.test(x,n) |
After prop test , the p value is 0.05505 which is larger than 0.05.
So there is no difference in proportion of Hacking/IT incidents between the state of Illinois and the state of California.
# Part II: Review of basic concepts in statistical learning
You will spend some time thinking of some real-life applications for statistical learning.
# Question 3.
Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.
- Forecast of weather forecast. The response is the weather, like rain, sunny, cloudy. The predictors can be humidity, visibility, wind speed, pressure etc. This application is prediction, we can use earlier weather meteorological indicators to predict future weather.
- Diagnosis of cancer in patients. The response is the patient is cancer or not. The predictors can be some index in routine blood test, some indicators of cancer etc. This application is inference, we can use the classification method to separate cancer or non-cancer people or identify cancer markers.
- Predict the wine's quality (good or bad). The response is the wine is good or bad, the predictors are chroma, acidity, alcohol purity etc. We can use this application to judge the unknown quality of wine.
# Question 4.
Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.
- Calories burned by running. The response is the calories burned, the predictors are gender, age, weight, IBM, etc. The goal is to predict the calories that burned by people.
- Predict the price of diamonds. The response is the price of diamond, the predictors are carat, cut, color, clarity, etc. The goal is to predict the price of diamond.
- Predict of GDP. The response is the GDP, the predictors are country, the number of women, the number of child, education level, medical conditions, etc. The goal is to predict the GDP for any country.
# Question 5.
Describe three real-life applications in which cluster analysis might be useful.
- Sequencing analysis like Single cell sequencing. We can know the heterogeneity of different single-cell groups.
- We can cluster students by their achievements, then students with similar achievements are clustered together.
- Market analysis. We can divided the market by several clusters, then analysis different clusters by different methods.
# Question 6.
What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?
The advantage of flexible approach for regression or classification is the accuracy is good, but disadvantage is the robust is not good as less flexible approach.
The advantages of less flexible approach for regression or classification is robust is good, but disadvantage is the accuracy is not good as flexible approach.
# Part III: Simple and Multiple Linear Regression
Load the Boston
data set
# import packages | |
library(MASS) | |
#load data | |
data(Boston) |
# Question 7:
Construct a simple linear regression of medv
with crim
, dis
, and age
respectively. Based on the output, answer the following questions:
Is there a relationship between the predictor and the response?
How strong is the relationship between the predictor and the response?
Is the relationship between the predictor and the response positive or negative?
Based on the RSE and , which model will you choose for the simple linear regression? Explain it.
fit1<-lm(medv~crim,data=Boston) | |
summary(fit1) | |
fit2<-lm(medv~dis,data=Boston) | |
summary(fit2) | |
fit3<-lm(medv~age,data=Boston) | |
summary(fit3) | |
summary(fit1)$r.square | |
summary(fit2)$r.square | |
summary(fit3)$r.square | |
RSE1<-mean((fitted(fit1)- Boston$medv) ^2) | |
RSE2<-mean((fitted(fit2)- Boston$medv) ^2) | |
RSE3<-mean((fitted(fit3)- Boston$medv) ^2) | |
RSE1 | |
RSE2 | |
RSE3 |
There is relationship between medv and crime, medv and dis, medv and age as p values smaller than 0.05. The relationship is very strong as p is very small and has three significance star (<0.001).
The relationship between medv and crime is negative.
The relationship between medv and dis is positive.
The relationship between medv and age is negative.
I will choose predictor of crime with the highest R square and the smallest RSE.
# Question 8:
Please use all the other features/attributes to construct a linear regression model.
Interpret the coefficients of all the attributes. Which attributes are insignificant?
Remove the insignificant attributes and construct a new linear regression model
Any improvement on the RSE and ?
fit_all<-lm(medv~.,data=Boston) | |
summary(fit_all) | |
fit_improve<-lm(medv~.-indus-age,data=Boston) | |
summary(fit_all) | |
summary(fit_all)$r.square | |
summary(fit_improve)$r.square | |
RSE1<-mean((fitted(fit_all)- Boston$medv) ^2) | |
RSE2<-mean((fitted(fit_improve)- Boston$medv) ^2) | |
RSE1 | |
RSE2 |
Predictors of crim, zn, chas, nox, rm, dis, rad, tax, ptratio, black, and lstat are significant with p values smaller than 0.05. And predictors of indus and age are insignificant with p values larger than 0.05.
If the coefficient is positive, it means this predictor has positive effect on medv; if the coefficient is negative, it means this predictor has negative effect on medv.
Like the coefficient of crim is -0.0108, it means when crim increase by 1, medv will decrease by 0.0108. the coefficient of zn is 0.0464, it means when zn increase by 1, medv will increase by 0.0464.
The R2 in the new model increase compared with question 7. There is no difference between fit_all and fit_improve, so fit_improve is best among all built models.