# Objective

  • What is a 'regression' function
    什么是 “回归” 函数
  • Simple linear regression
  • Best approximation, least squares, residual sum of squares (RSS), RSE, R2R^2
    最佳近似、最小二乘法、残差平方和 (RSS)、RSE、R2R^2
  • Understand the output of a simple linear regression
  • Basic R and Python command to run a simple linear regression model
    用基本的 RPython 命令运行简单线性回归模型

# Advertising Example

Figure 2.1. from ISLR: Y = Sales plottedagainst TV , Radio and Newspaper advertising budgets.
来自 ISLR:Y = Sales ,根据 TVRadioNewspper 三种途径的广告预算绘制。

Our goal is to develop an accurate model (ff) that can be used to predict sales on the basis of the three media budgets:
我们的目标是开发一个准确的模型 (ff),可以用来根据三种媒体预算预测 sales 销售额:

Salesf(TV,Radio,Newspaper).Sales \approx f(TV, Radio, Newspaper).

  • Sales = a reponse, target, or outcome.
    Sales 是响应、目标、或结果。

    • The variable we want to predicit.
    • Denoted by YY.
      YY 表示。
  • TV is one of the features, or inputs.
    TV 是特征之一,或输入。

    • Denoted by X1X_1.
      X1X_1 表示。
  • Similarly for Radio and Newspaper .

  • We can put all the predictors into a single input vector

    X=(X1,X2,X3)X = (X_1,X_2,X_3)

  • Now we can write our model as

    Y=f(X)+ϵY=f(X) +\epsilon

    , where ϵ\epsilon captures measurement errors and other discrepancies between the response YY and the model ff.
    其中 ϵ\epsilon 捕获测量误差,以及响应变量YY 和模型ff 之间的其他差异。

# Regression function 回归函数

Formally, the regression function is given by E(YX=x)E(Y | X = x). This is the expected value of YY at X=xX = x.
形式上,回归函数E(YX=x)E(Y | X = x) 给出。这是YYX=XX = X 时的期望值
The ideal or optimal predictor of YY based on XX is thus

f(x)=E(YX=x)f(x) = E(Y | X = x)

A good value is $$ f(4) = E(Y | X = 4) $$

# Simple linear regression using a single predictor XX 使用单一预测值的简单线性回归 XX

  • Predict a quantitative YY by single predictor variable XX
    通过单个预测变量XX 预测一个定量的YY

Yβ0+β1XY \approx \beta_0+\beta_1 X

  • Example: salesβ0+β1×TVsales \approx \beta_0+\beta_1\times TV.

  • β0\beta_0, β1\beta_1 are two unknown constants that represent the intercept and slope. [parameters, or coefficients.]
    β0\beta_0β1\beta_1 是两个未知常数,分别表示截距斜率。(参数系数。)


y^=β^0+β^1x.\hat y = \hat\beta_0+\hat\beta_1x.

# How to estimate the coefficients 如何估计系数

Let y^i=β^0+β^1xi\hat{y}_i = \hat\beta_0+\hat\beta_1x_i be the prediction for YY based on the iith value of XX.
基于XX 的第ii 个值,对YY 的预测。

ei=yiy^ie_i = y_i-\hat{y}_i represents the iith residual.

Residual Sum of Squares RSS 剩余平方和

RSS=i=1nei2=i=1n(yiy^i)2.\text{RSS} = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n(y_i-\hat{y}_i)^2.

The least squares approach chooses β0\beta_0 and β1\beta_1 to minimize RSS.
最小二乘法选择 β0\beta_0β1\beta_1最小化 RSS

Fig 3.1. ISLR: For the Advertising data, the least squares fit for the regression of sales onto TV is shown.
对于 Advertising 数据,显示了 salesTV 回归的最小二乘法。

The fit is found by minimizing the sum of squared errors.

Each grey line segment represents an error, and the fit makes a compro- mise by averaging their squares.

In this case a linear fit captures the essence of the relationship, although it is somewhat deficient in the left of the plot.

Fig. 3.2. ISLR: A simulated data set.

Left: The red line represents the true relationship, f(X)=2+3Xf(X) = 2 + 3X, which is known as the population regression line.

The blue line is the least squares line; it is the least squares estimate for f(X)f(X) based on the observed data, shown in black.
蓝线是最小二乘线;它是基于观测数据的 f(X)f(X) 的最小二乘估计,用黑色显示。

Right: The population regression line is again shown in red, and the least squares line in dark blue.

In light blue, ten least squares lines are shown, each computed on the basis of a separate random set of observations.

Each least squares line is different, but on average, the least squares lines are quite close to the population regression line.

# Understand the output 了解输出

For the Advertising data, coefficients of the least squares modei for the regression of number of units sold on TV advertising budget.

An increase of $1,000 in the TV advertising budget is associated with an increase in sales by around 50 units (Recall that the sales variable is in thousands of units, and the TV variable is in thousands of dollars).
电视广告预算增加 1000 美元与销售增加约 50 个单位有关(回想一下,销售变量以千单位为单位,电视变量以千美元为单位)。

Here t=β^i0SE(β^i)t= \frac{\hat \beta_i-0}{SE(\hat\beta_i)} is a t statistic. 是一个 t 统计

Question: Is there a relationship between the response YY and predictor XX?
响应变量YY 和预测变量XX 之间有关系吗?

We can do a hypothesis testing.

  • check whether β1=0\beta_1=0

    • Hypothesis test: H0:β1=0H_0:\beta_1=0 vs. H1:β10H_1: \beta_1\neq 0.
    • a tt-statistic measures the number of standard deviations that β1\beta_1 is away from 0 (specifically, t=β^10SE(β^1)t= \frac{\hat \beta_1-0}{SE(\hat\beta_1)} with n2n-2 degrees of freedom)
      tt - 统计测量 \BETA_1 远离 0 的标准偏差数(具体地说,t=β^10SE(β^1)t= \frac{\hat \beta_1-0}{SE(\hat\beta_1)},自由度为 n2n-2
  • pp-value

    • the probability of observing any value equal to t or larger; as usual! - the probability of seeing the data we saw under the H0H_0
      观察到等于或大于 t 的任何值的概率;像往常一样!- 看到我们在H0H_0 下看到的数据的概率

    • in practice, we just read off the tt-test. or read off the output of linear models.

# Assessing model fit 评定模型拟合

Question: Suppose we have rejected the null hypothesis in favor of the alternative. Now what??

  • Natural: quantify the extent to which the model fits the data.
  • The quality of a linear regression fit is typically assessed using two related quantities:
    • the residual standard error RSE and
      残差标准误差 RSE
    • the R2R^2 statistic.
      R2R^2 统计

Advertising Data Results


A measure of the lack of fit of the model simple linear regression model to the data:

RSE=1n2RSS=1n2i=1n(yiy^i)2RSE = \sqrt{\frac{1}{n-2}RSS} = \sqrt{\frac{1}{n-2}\sum_{i=1}^n (y_i-\hat y_i)^2}

  • If the predictions obtained using the model are very close to the true outcome values (y^iyi\hat y_i\approx y_i for i = 1, ..., n), then RSE will be small
    如果使用模型得到的预测非常接近真实的结果值 (y^iyi\hat y_i\approx y_i for i = 1, ..., n),则 RSE 将很小

    • we can conclude that the model fits the data very well.
  • If y^i\hat y_i is very far from yiy_i for one or more observations, then the RSE may be quite large
    如果 y^i\hat y_iyiy_i 之间有很大的距离,那么 RSE 可能相当大

    • indicating that the model doesn’t fit the data well.

Interpretation 解释
The RSE provides an absolute measure of lack of fit.
RSE 提供了一种绝对的缺乏契合度的测量方法

But since it is measured in the units of YY , it is not always clear what constitutes a good RSE...
但由于它是YY 为单位来衡量的,所以并不总是清楚什么构成了一个好的 RSE……

# R2R^2

The R2R^2 statistic provides an alternative measure of fit (proportion):
R2R^2 统计提供了另一种拟合方法 (比例):

R2=TSSRSSTSS=1RSSTSSR^2 = \frac{TSS-RSS}{TSS}=1 - \frac{RSS}{TSS}

  • TSS = total sum of squares i=1n(yiyˉ)2\sum_{i=1}^n(y_i-\bar y)^2
    where yˉ=1ni=1nyi\bar y = \frac{1}{n}\sum_{i=1}^ny_i
  • RSS = residual sum of squares i=1n(yiy^i)2\sum_{i=1}^n(y_i-\hat y_i)^2

R2R^2 measures the proportion of variability in YY that can be explained using XX
R2R^2 衡量的是可以用XX 解释的YY 的可变性比例

Interpretation 解释
Always between 0 and 1 (independent of scale of YY).
总是在 0 和 1 之间 (与YY 的比例无关)。

Question What's a good value?
Can be challenging to determine ... in general, depends on the application.
很难确定… 一般来说,取决于应用程序。

# How to construct linear regression in R

Use simple linear regression on the Auto data set.
Auto 数据集使用简单的线性回归。

  • Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor.
    使用 lm() 函数执行一个简单的线性回归,以 mpg 作为响应变量, horsepower 作为预测变量。
    Loading required package: ISLR
    fitlm <- lm(mpg ~ horsepower, data=Auto)

Where is the output??

  • Let's take a look at the fitlm object.
    Use the summary() function to print the results.

    lm(formula = mpg ~ horsepower, data = Auto)
        Min       1Q   Median       3Q      Max 
    -13.5710  -3.2592  -0.3435   2.7630  16.9240 
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
    horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    Residual standard error: 4.906 on 390 degrees of freedom
    Multiple R-squared:  0.6059,    Adjusted R-squared:  0.6049 
    F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16
  • Is there a relationship between the predictor and the response?

    • Yes
  • How strong is the relationship between the predictor and the response?

    • pp-value is close to 0: relationship is strong
      pp-value 接近 0:关系强
  • Is the relationship between the predictor and the response positive or negative?

    • Coefficient is negative: relationship is negative
  • What is the predicted mpg associated with a horsepower of 9898? What are the associated 95% confidence and prediction intervals?
    9898 的被预测的 mpg 对应的 horsepower 是多少?相关的 95% 置信和预测区间是什么?

First you have to make a new data frame object which will contain the new point:

new <- data.frame(horsepower = 98)
predict(fitlm, new) # predicted mpg
predict(fitlm, new, interval="confidence") # conf interval
       fit      lwr      upr
1 24.46708 23.97308 24.96108
predict(fitlm, new, interval="prediction") # pred interval
       fit     lwr      upr
1 24.46708 14.8094 34.12476

What is the difference between confidence and prediction intervals!? \rightarrow we will learn in the next lecture!!

# How to construct linear regression in Python

# import the necessary packages
import pandas as pd
import numpy as np
import statsmodels.api as sm #py_install("statsmodels")

We still want to find the relationship between mpg and horsepower . Let's read the dataset first.
我们仍然想要找到 mpghorsepower 之间的关系。

# Read the data set from the website of our textbook
auto = pd.read_csv('https://www.statlearning.com/s/Auto.csv')
# Here I dropped the records with horsepower = ?
auto = auto.drop(labels=[32,126,330,336,354], axis=0)
# choose the predictor
# 选择预测变量
X = auto[['horsepower']]
# choose the response
# 选择响应变量
y = auto[['mpg']]
# add a constant to the predictor
# 向预测变量添加一个常数
X = sm.add_constant(X)
# OLS stands for “Ordinary Least Squares”
# OLS 代表普通最小二乘
# `horsepower` consider as a object, so I use astype to convert it to numeric
# `horsepower` 考虑作为一个对象,所以使用 astype 将它转换为数字
model01 = sm.OLS(y, X.astype(float)).fit()
# To obtain the results of the regression model, run the `summary()` command on the model
# 要获得回归模型的结果,在模型上运行 `summary ()` 命令
OLS Regression Results
Dep. Variable:	mpg	R-squared:	0.606
Model:	OLS	Adj. R-squared:	0.605
Method:	Least Squares	F-statistic:	599.7
Date:	Wed, 03 Nov 2021	Prob (F-statistic):	7.03e-81
Time:	10:50:42	Log-Likelihood:	-1178.7
No. Observations:	392	AIC:	2361.
Df Residuals:	390	BIC:	2369.
Df Model:	1		
Covariance Type:	nonrobust		
coef	std err	t	P>|t|	[0.025	0.975]
const	39.9359	0.717	55.660	0.000	38.525	41.347
horsepower	-0.1578	0.006	-24.489	0.000	-0.171	-0.145
Omnibus:	16.432	Durbin-Watson:	0.920
Prob(Omnibus):	0.000	Jarque-Bera (JB):	17.305
Skew:	0.492	Prob(JB):	0.000175
Kurtosis:	3.299	Cond. No.	322.

We can ask us the similar questions as above.
We actually get the same answer.
How to predict in Python ?
如何在 “Python” 中进行预测?

# The first one is the constant; the second one is horsepower
# 第一个是常数;第二个是 horsepower
auto01 = np.array((1,98))

# Simple plots in R

Plot the response and the predictor.

Use the abline() function to display the least squares regression line.
使用 abline() 函数显示最小二乘回归线。

plot(Auto$horsepower, Auto$mpg)
abline(fitlm, col="red")

Use the plot() function to produce diagnostic plots of the least squares regression fit.
使用 plot() 函数生成最小二乘回归拟合的诊断图。

Comment on any problems you see with the fit.


  • residuals vs fitted plot shows that the relationship is non-linear

# In-Class Exercise

Construct a simple linear regression of mpg with cylinders , displacement , and acceleration respectively.
分别用 cylinders “气缸”、 displacement “位移” 和 acceleration “加速度” 构建一个简单的 mpg 线性回归。

Based on the output, answer the following questions:

  • Is there a relationship between the predictor and the response?

  • How strong is the relationship between the predictor and the response?

  • Is the relationship between the predictor and the response positive or negative?

  • Based on the RSE and R2R^2, which model will you choose for the simple linear regression? Explain it.
    基于 RSER2R^2,你会选择哪个模型进行简单的线性回归?解释它。

# Reference

  1. Chapter 3 of the textbook Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani,
    An Introduction to Statistical Learning: with Applications in R.

  2. Chapter 11 of the textbook Chantal D. Larose and Daniel T. Larose
    Data Science Using Python and R.

  3. Part of this lecture notes are extracted from Prof. Sonja Petrovic ITMD/ITMS 514 lecture notes.