# Part One
# 1. Types of random variables
# Instructions
Review the definition of discrete and continuous random variables.
A variable is a quantity whose value changes.
变量是一个值会变化的量A discrete variable is a variable whose value is obtained by counting.
离散变量是一个由计数获得其值的变量Examples:
- number of students present
出席的学生人数 - number of red marbles in a jar
一个罐子里红色弹珠的数量 - number of heads when flipping three coins
抛出三个硬币时正面的数量 - students’ grade level
学生的年级
- number of students present
A continuous variable is a variable whose value is obtained by measuring.
连续变量是一个通过测量获得其值的变量Examples:
- height of students in class
班级学生的身高 - weight of students in class
班级学生的体重 - time it takes to get to school
到学校所需的时间 - distance traveled between classes
班级间的距离
- height of students in class
A random variable is a variable whose value is a numerical outcome of a random phenomenon.
随机变量是一个值是随机现象的数值结果的变量。- A random variable is denoted with a capital letter
随机变量用大写字母表示 - The probability distribution of a random variable tells what the possible values of are and how probabilities are assigned to those values
随机变量 的概率分布表示 的可能值是什么,以及如何将概率分配给这些值 - A random variable can be discrete or continuous
随机变量可以是离散的或连续的
- A random variable is denoted with a capital letter
A discrete random variable has a countable number of possible values.
离散随机变量 具有可数数量的可能值。- Example: Let represent the sum of two dice.
让 代表两个骰子的总和。 - To graph the probability distribution of a discrete random variable, construct a probability histogram.
要绘制离散随机变量的概率分布图,请构建概率直方图。
- Example: Let represent the sum of two dice.
A continuous random variable takes all values in a given interval of numbers.
连续随机变量 取给定数字区间内的所有值。- The probability distribution of a continuous random variable is shown by a density curve.
连续随机变量的概率分布由密度曲线表示。 - The probability that is between an interval of numbers is the area under the density curve between the interval endpoints
在数字区间之间的概率是区间端点之间密度曲线下的面积 - The probability that a continuous random variable is exactly equal to a number is zero
连续随机变量 正好等于一个数字的概率为零
- The probability distribution of a continuous random variable is shown by a density curve.
# Tasks
Classify the following random variables as discrete or continuous:
X: the number of automobile accidents per year in Virginia.
discrete
Y : the length of time to play 18 holes of golf.
continuous
M: the amount of milk produced yearly by a particular cow.
continuous
N: the number of eggs laid each month by a hen.
discrete
P: the number of building permits issued each month in a certain city.
discrete
Q: the weight of grain produced per acre.
continuous
# 2. Choosing a measure of location to summarize the data
# Instructions
We have learned two ways to measure location or centrality of the data: the sample mean and the sample median. Review their definition and how to compute them (in R or Python, of course!).
# Task
A certain polymer is used for evacuation systems for aircraft. It is important that the polymer be resistant to the aging process.
某种聚合物用于飞机的疏散系统。重要的是聚合物能够抵抗老化过程。
Twenty specimens of the polymer were used in an experiment. Ten were assigned randomly to be exposed to an accelerated batch aging process that involved exposure to high temperatures for 10 days.
在一个实验中使用了 20 个聚合物样品。10 个被随机分配到暴露于高温下 10 天的加速批次老化过程中。
Measurements of tensile strength of the specimens were made, and the following data were recorded on tensile strength in psi:
对试样的拉伸强度进行了测量,并记录了以下以 psi 为单位的拉伸强度数据:
No aging: 227 222 218 216 218 217 225 229 228 221
Aging: 219 214 218 203 215 211 209 204 201 205
# You can use the following code to create a data frame | |
strength = c( 227 ,222, 218, 216, 218, 217, 225, 229, 228,221,219,214,218,203,215,211,209,204,201,205) | |
aging<-as.factor(c(rep(0,10),rep(1,10))) | |
polymerData<-data.frame(strength,aging) |
(a) Do a dot plot of the data. Hint: You can use
qplot
fromggplot2
to include two attributesstrength
andaging
qplot(strength, aging)
(b) From your plot, does it appear as if the aging process has had an effect on the tensile strength of this polymer? Explain.
The degree of aging process has an effect on the tensile strength of the polymer. According to the distribution of the dot plot, the older the polymer, the worse its strength.
(c) Calculate the sample mean tensile strength of the two samples.
mean(polymerData[1:10,1])
mean(polymerData[11:20,1])
(d) Calculate the median for both. Discuss the similarity or lack of similarity between the mean and median of each group.
median(polymerData[1:10,1])
median(polymerData[11:20,1])
The mean and median of each group are relatively similar. It means the distribution is symmetric.
# 3. Choosing a measure of variability to summarize the data
# Instructions
We have learned about two statistics that capture data variability: variance and standard deviation. Review their meaning and units, for the only differ in units.
# Task
The previous problem showed tensile strength data for two samples, one in which specimens were exposed to an aging process and one in which there was no aging of the specimens.
(a) Calculate the sample variance as well as standard deviation in tensile strength for both samples.
var(polymerData[1:10,1])
sd(polymerData[1:10,1])
var(polymerData[11:20,1])
sd(polymerData[11:20,1])
(b) Does there appear to be any evidence that aging affects the variability in tensile strength?
Yes. The sample variance of the aging group is greater, indicating that the tensile strength data is more variable.
# Part Two: Working With Data
# Instructions
Read the following information if you need it before you begin:
- Obtaining the wine quality dataset
# Tasks
For the following exercises, work with the winequality-red
data set. Use either Python
or R
to solve each
problem.
Type a comment stating that you are working on a random data set we downloaded.
Locate the "Run" button and note whether there is a keyboard shortcut.
Execute the comment from the previous exercise. What is the output? Explain your answer.
Import the following packages:
a. For
Python
, import thepandas
andnumpy
packages. Rename thepandas
package "pd
" and
rename thenumpy
package "np
".
b. ForR
, import theggplot2
package. Make sure you both install and open the package.
Import the
winequality-red
data set and name itwinequalRed
.# here is a hint for the r version # -- change these commands as needed and delete these comments before submitting your work -- # if you downloaded the data set as a .csv file then you can read it in as follows: # winequalRed <- read.csv("~/Documents/datasets/winequality-red.csv", sep=";") # To view the data set # View(winequal_red)
# Import the winequality-red.csv
winequalRed <- read.csv("winequality-red.csv", sep=";")
# View(winequalRed)
Create a table of the
quality
andalcohol
attributes from thewinequalRed
data set. Do not save the output from the code.# hint: if you have two data columns named X and Y in your data frame, you can use code like this to create a table: table(my.data.set$X, my.data.set$Y)
table(winequalRed$quality, winequalRed$alcohol)
Save the first nine records of the
winequalRed
data set as their own data frame.firstNine <- head(winequalRed, 9)
firstNine
Save the
density
andpH
records of thewinequalRed
data set as their own data frame.redDensity <- winequalRed$density
redPH <- winequalRed$pH
Separate the wine data into a low quality class (quality ) and a high quality class (quality > 5), find the mean and standard deviation for two the attributes
total.sulfur.dioxide
andalcohol
for the two classes. Based on the statistical information, describe if there exists difference for these two attributes between the low quality and high quality red wines.lowQuality <- winequalRed[which(winequalRed$quality <= 5),]
highQuality <- winequalRed[which(winequalRed$quality > 5),]
mean(lowQuality$total.sulfur.dioxide)
mean(lowQuality$alcohol)
sd(lowQuality$total.sulfur.dioxide)
sd(lowQuality$alcohol)
mean(highQuality$total.sulfur.dioxide)
mean(highQuality$alcohol)
sd(highQuality$total.sulfur.dioxide)
sd(highQuality$alcohol)
To investigate the distribution of
quality
attribute, which plot you will use, boxplot or histogram? Show your result.Both histograms and box plots allow to visually assess the central tendency, the amount of variation in the data as well as the presence of gaps, outliers or unusual data points.
直方图和箱线图都允许直观地评估中心趋势、数据变化量以及存在间隙、离群值或异常数据点。
Histograms are preferred to determine the underlying probability distribution of a data. Box plots on the other hand are more useful when comparing between several data sets.
直方图更倾向于确定数据的基本概率分布。另一方面,在比较多个数据集时,箱线图更有用。
Although histograms are better in displaying the distribution of data, box plots can be used to tell if the distribution is symmetric or skewed.
虽然直方图在显示数据分布方面更好,但可以使用箱线图来判断分布是对称的还是偏斜的。boxplot(winequalRed$quality)
hist(winequalRed$quality, breaks = seq(3,8), labels=TRUE)
# Extra Points
Without quitting R, load the
winequality-white.csv
file into the work space. Create a data frame by using the first 50 records of red wines and the first 50 records of white wines, and show a plot ofquality
.Hint: You need to use
merge
function# Import the winequality-white.csv
winequalWhite <- read.csv("winequality-white.csv", sep=";")
# View(winequalWhite)
# the first 50 records of red wines
redWine <- head(winequalRed, 50)
# the first 50 records of white wines
whiteWine <- head(winequalWhite, 50)
# merge data
mergedWine <- merge(redWine, whiteWine, all = TRUE)
boxplot(mergedWine$quality)
hist(mergedWine$quality, breaks = seq(3,8), labels=TRUE)