# Outline
- Continuous and Continuous Variables
连续变量和连续变量- Pearson's correlation coefficient
皮尔逊相关系数
- Pearson's correlation coefficient
- Categorical and Continuous Variables
分类变量和连续变量- ANOVA Test
方差分析检验
- ANOVA Test
- Categorical and Categorical Variables
分类变量和分类变量- Chi-squared Test
卡方检验
- Chi-squared Test
# Correlation between Numerical Variables 数值变量之间的相关性
To investigate the correlation, we can use pairs
function.
为了研究相关性,我们可以使用 pairs
“配对” 函数。
pairs(iris[,1:4], col = "blue") |
# To make it a little fancier | |
library(lattice) | |
super.sym <- trellis.par.get("superpose.symbol") | |
splom(~iris[1:4], groups = Species, data = iris, | |
panel = panel.superpose, | |
key = list(title = "Three Varieties of Iris", | |
columns = 3, | |
points = list(pch = super.sym$pch[1:3], | |
col = super.sym$col[1:3]), | |
text = list(c("Setosa", "Versicolor", "Virginica")))) |
How to interpret the scatter plot?
如何解读散点图?
To get the Pearson correlation coefficients r
using cor
function
使用 cor
函数获得皮尔逊相关系数 r
cor(iris[,1:4]) |
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
r<0.3
, weak correlation 弱相关性0.3<r<0.7
, moderate correlation 中等相关性r>0.7
, high correlation 高相关性
# Association between One Numerical Variable and One Categorical Variable 一个数值变量和一个分类变量之间的关联
We can do a overlay boxplot first.
我们可以先做一个叠加箱线图。
Let's play with iris
data set.
ggplot(iris, aes(x = Species , y = Sepal.Length )) + geom_boxplot() |
We can use ANOVA test to check the association between one numerical variable and one categorical variable with aov
function.
利用 aov
函数,可以用方差分析检验一个数值变量和一个分类变量之间的关联性。
ANOVA
( AOV
) is short for ANalysis Of VAriance.
aov1 <- aov(Sepal.Length ~ Species, data = iris) | |
summary(aov1) |
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 63.21 31.606 119.3 <2e-16 ***
Residuals 147 38.96 0.265
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Interpretation 解读
The Df column displays the degrees of freedom for the independent variable (the number of levels in the variable minus 1), and the degrees of freedom for the residuals (the total number of observations minus one and minus the number of levels in the independent variables).
Df 列显示自变量的自由度 (变量中的级别数减去 1),以及残差的自由度 (观察总数减去 1 和减去自变量中的级别数)。The Sum Sq column displays the sum of squares (a.k.a. the total variation between the group means and the overall mean).
Sum Sq 列显示平方和 (也就是组均值和总体均值之间的总变异)。The Mean Sq column is the mean of the sum of squares, calculated by dividing the sum of squares by the degrees of freedom for each parameter.
Mean Sq 列是平方和的平均值,计算方法是将平方和除以每个参数的自由度。The F-value column is the test statistic from the F test.
F-Value 列是来自 F 检验的测试统计数据。
This is the mean square of each independent variable divided by the mean square of the residuals.
这是每个自变量的均方除以残差的均方。
The larger the F value, the more likely it is that the variation caused by the independent variable is real and not due to chance.
F 值越大,由自变量引起的变化就越有可能是真实的,而不是偶然的。The Pr(>F) column is the p-value of the F-statistic.
Pr (>F) 列是 F 统计量的 p 值。
This shows how likely it is that the F-value calculated from the test would have occurred if the null hypothesis of no difference among group means were true.
这表明,如果分组平均值之间没有差异的零假设为真,那么从测试中计算出的 F 值发生的可能性有多大。
Here, the p-value of the Species
variable is very low (p < 0.001), so it appears that the type of Species
used has a real impact on the Sepal.Length
.
这里, Species
变量的 p 值非常低 (p < 0.001),所以看起来使用的 Species
类型对 Sepal.Length
有真正的影响。
# Associations between Categorical Variables 分类变量之间的关联
# Chi-squared Test 卡方检验
We can use Chi-squared Test to determine if a population has a specified theoretical distribution.
我们可以使用卡方检验来确定一个群体是否具有特定的理论分布。
We can also use Chi-squared Test
to check the independence of two categorical features/attributes.
我们也可以使用 Chi-squared Test
来检查两个分类特征 / 属性的独立性。
We need the contingency table one more time.
再次使用列联表。
We want to investigate the independence of rank
and sex
in Salaries
data set.
我们想要调查 “薪水” Salaries
数据集中 “等级” rank
和 “性别” sex
的独立性。
Here null hypothesis : these two variables are associated;
零假设:这两个变量是相关的;
alternative hypothesis: these two variables are NOT associated.
备择假设:这两个变量不相关。
# Load the data | |
data("Salaries", package = "carData") | |
summary(Salaries) |
rank discipline yrs.since.phd yrs.service sex
AsstProf : 67 A:181 Min. : 1.00 Min. : 0.00 Female: 39
AssocProf: 64 B:216 1st Qu.:12.00 1st Qu.: 7.00 Male :358
Prof :266 Median :21.00 Median :16.00
Mean :22.31 Mean :17.61
3rd Qu.:32.00 3rd Qu.:27.00
Max. :56.00 Max. :60.00
salary
Min. : 57800
1st Qu.: 91000
Median :107300
Mean :113706
3rd Qu.:134185
Max. :231545
# Create contingency table | |
contTable<- table(Salaries$rank, Salaries$sex) | |
contTable |
Female Male
AsstProf 11 56
AssocProf 10 54
Prof 18 248
# Conduct Chi-squared Test | |
chisqtestResult<- chisq.test(contTable) | |
chisqtestResult |
Pearson's Chi-squared test
data: contTable
X-squared = 8.5259, df = 2, p-value = 0.01408
Since we get a p-value of less than the significance level of 0.05
, we can reject the null hypothesis and conclude that the two variables are, indeed, independent.
由于我们得到的 p 值小于 0.05 的显著性水平,我们可以拒绝零假设,并得出结论,这两个变量确实是独立的。
# A problem with Pearson’s 皮尔逊 的问题
Coefficient is that the range of its maximum value depends on the sample size and the size of the contingency table.
系数是其最大值的范围取决于样本大小和列联表的大小。
These values may vary in different situations.
这些值在不同的情况下可能不同。
To overcome this problem, the coefficient can be standardized to lie between 0 and 1 so that it is independent of the sample size as well as the dimension of the contingency table.
为了克服这个问题,可以将系数标准化到 0 到 1 之间,这样它就独立于样本大小以及列联表的维数。
# Cramer's V (phi) Coefficient 克莱姆系数
Suppose we have a contingency table, Cramer’s V as follow:
假设我们有一个 列联表,Cramer's V 如下:
where is the Chi-squared statistic, is the sample size, is the number of rows, and is the number of columns.
是卡方统计量, 是样本量, 是行数, 是列数。
From the previous example we have
从之前的例子中
n <- nrow(Salaries) | |
chistats <- chisqtestResult$statistic | |
r <- 3 | |
c <- 2 | |
cramerv <- sqrt(chistats/n/min(r-1,c-1)) | |
cramerv |
X-squared
0.1465466
We can also use the function cramerV
in package rcompanion
to calculate Cramer's V value.
我们也可以使用 rcompanion
包中的 cramerV
函数来计算 Cramer's V 值。
#load rcompanion library | |
library(rcompanion) | |
#calculate Cramer's V | |
cramerV(contTable) |
Cramer V
0.1465
The range of Cramer's V value is from 0 to 1.
Cramer’s V 的取值范围是 0 到 1。
The value we got here is very small.
我们得到的值很小。
We can conclude there is no significant association between rank
and sex
.
我们可以得出结论,“等级” rank
和 “性别” sex
之间没有显著的联系。
# Reference
Probability & Statistics for Engineers & Scientist, 9th Edition, Ronald E. Walpole, Raymond H. Myers, Sharon L. Myers, Keying Ye, Prentice Hall
Correlation between discrete (categorical) variables, https://rpubs.com/hoanganhngo610/558925.
Understanding ANOVA in R, https://bookdown.org/steve_midway/DAR/understanding-anova-in-r.html