# Neural Networks

# Artificial Neural Networks (ANN)

Researchers tried to learn from the biological neuron systems and built the ANN
研究人员试图向生物神经元系统学习，并建立了 ANN
- There are many neuron units in ANN
  ANN 中有许多神经元单位
- They are connected within a structure
  他们是连接在一个结构
- They work as threshold switching units
  他们作为阈值切换单元工作
- There are weighted interconnections among units
  有加权互联单位之一
- We are able to learn and tune up these weights automatically by a training process
  我们能够通过训练过程自动学习和调整这些重量
A neuron in ANN looks like this…
人工神经网络中的一个神经元看起来像这样…
input signals → input function(linear) → activation function(nonlinear) → output signal
输入信号→输入函数 (线性)→激活函数 (非线性)→输出信号

# Perceptron 感知器

First neural network learning model in the 1960’s
20 世纪 60 年代的第一个神经网络学习模型
Simple and limited (single layer models)
简单且有限 (单层模型)
Still used in current applications (modems, etc.)
仍在当前应用中使用 (调制解调器等)。

Input 输入

different

x

variables with weights on edges
不同的

x

变量权重边缘

Input function 输入函数

it is used to aggregate the inputs, usually it is a weighted sum of its inputs
用于汇总输入，通常是其输入的加权和

Activation function 激活函数

It is a threshold function
是一种阈值函数
For the purpose of binary classification
为目的的二元分类
- The output is only 1 or 0
  输出是只 1 或 0
- Sign function can be used as the activation function
  符号函数可以用来激活功能
- Sigmoid can be used as activation function
  乙状函数可用作激活函数

Sigmoid function 乙状函数 / S 函数

It is popular for classification, due to being easy to be updated and learned in the training process.
由于在训练过程中易于更新和学习，是流行的分类方法。

Neural networks canbe used for both classifications and regressions.
神经网络既可以用于分类，也可以用于回归
It can be controlled by applying different activation functions.
可以通过施加不同的激活函数进行控制。

For classifications

# Perceptron Training 感知器训练

It is the simplest ANN model
这是最简单的人工神经网络模型
- We need to train the model to learn the weights, $w$ , where
  我们需要训练模型来学习权重， $w$ ，其中
  $w_{i} \leftarrow w_{i}+\Delta w_{i}$
  $\Delta w_{i}=\eta(t-o) x_{i}$
- $t$ is the real value
  是实际值
- $o$ is the output value (prediction by the model)
  是输出值 (由模型预测)
- $\eta$ is a constant value in $[0, 1]$ as the learning rate
  是 $[0, 1]$ 中的一个常数值，作为学习速率
It is a process of iterative learning
这是一个反复学习的过程
- At the beginning, give random values to $w$
  开始时，给 $w$ 随机取值
- Get the output $o$ through the perceptron
  通过感知器得到输出 $o$
- Update the $w$ by using the update rules
  使用更新规则更新 $w$
- Stop the learning process by a stopping criterion
  通过停止标准停止学习过程
  - Classification error is smaller than a threshold
    分类误差小于阈值
  - Or, maximal learning iterations have been reached
    或者，已经达到最大学习迭代次数

# Perceptron: Example

Consider learning the logical OR function
考虑学习逻辑或函数

Sample	x0	x1	x2	label
1	1	0	0	0
2	1	0	1	1
3	1	1	0	1
4	1	1	1	1

Activation function 激活函数
$S=\sum_{k=0}^{k=n} w_{k} x_{k} \quad S>0 \text { then } O=1 \quad \text { else } \quad O=0$
We’ll use a single perceptron with three inputs.
我们将使用一个有三个输入的感知器。
We’ll start with all weights 0 W= <0,0,0>
我们将从所有重量 0 开始

Example 1 I = <0,0,0> label = 0 W = <0,0,0>
- Perceptron ( $1 \times 0 + 0 \times 0 + 0 \times 0 = 0, S=0$ ) output = 0
- it classifies it as 0 , so correct, do nothing
  它将其归类为 0 ，所以正确，什么也不做
Example 2 I = <1,0,1> label=1 W = <0,0,0>
- Perceptron ( $1 \times 0 + 0 \times 0 + 1 \times 0 = 0$ ) output = 0
- it classifies it as 0 , while it should be 1 , so we add input to weights W = <0,0,0> + <1,0,1>= <1,0,1>
  它将其分类为 0，而它应该是 1，所以我们将输入添加到权重 W = <0,0,0> + <1,0,1>= <1,0,1>
Example 3 I = <1,1,0> label = 1 W = <1,0,1>
- Perceptron ( $1 \times 0 + 1 \times 0 + 0 \times 0 \gt 0$ ) output = 1
- it classifies it as 1 , correct, do nothing W = <1,0,1>
  它将其分类为 1 ，正确，什么都不做 W = <1,0,1>
Example 4 I = <1,1,1> label = 1 W = <1,0,1>
- Perceptron ( $1 \times 0 + 1 \times 0 + 1 \times 0 \gt 0$ ) output = 1
- it classifies it as 1 , correct, do nothing W = <1,0,1>

1st iteration is completed. 第一次迭代完成。
Repeat until no errors 重复，直到没有错误

# Limitations of Perceptron 感知器的局限性

It is too simple, cannot learn complex and effective models
它太简单，无法学习复杂而有效的模型
It assumes the data can be linearly separatable in the binary classification, but actually it could be non linear!
它假设数据在二元分类中可以线性分离，但实际上它可能是非线性的！
- SVM, we use kernel function to map data to higher dimension
  SVM，使用核函数将数据映射到更高维度
- ANN, we can add more layers!!
  ANN，可以加更多的层

# Multi layer Feed forward Networks 多层前馈网络

Multi layer Feed forward Networks is an extension of the perceptron model. It adds hidden layers to the original perceptron.
多层前馈网络是感知器模型的扩展。它将隐藏层添加到原始感知器中。

Input layer: accepts inputs only
输入层：仅接受输入
Hidden layers: neurons with functions
隐藏层：具有功能的神经元
Output layer: produce outputs
输出层：产生输出

# Training Phrase

The training phrase is a typical process of machine learning and optimization
训练阶段是机器学习和优化的典型过程
We need to 我们需要
- Setup a learning objective as loss function
  将学习目标设置为损失函数
- Use appropriate optimizer to learn the parameters
  使用适当的优化器来学习参数
- It is usually a process of iterative learning
  这通常是一个迭代学习的过程

# Loss Function 损失函数

The loss function $L\left(x, y, y^{\prime}\right)$ is defined as the amount of utility lost by predicting $h(x)=y^{\prime}$ when the correct answer is $f(x)=y$
损失函数定义为当正确答案是 $f(x)=y$ 时，通过预测而损失的效用量
Often a simplified version is used, $L\left(y, y^{\prime}\right)$ , that is independent of $x$
通常使用简化的版本，独立于 $x$
Three commonly used loss functions:
三种常用的损失函数:
- Absolute value loss: 绝对值损失 $L_{1}\left(y, y^{\prime}\right)=\left|y-y^{\prime}\right|$
- Squared error loss: 平方误差损失 L_{2}\left(y, y^{\prime}\right)=\left(y-y^{\prime}\right)^
- 0/1 loss: $L_{0 / 1}\left(y, y^{\prime}\right)=0$ if $y=y^{\prime}$ , else $1$
Let $E$ be the set of examples. Total loss $L(E)=\sum_{e \in E} L(e)$

# Optimizer: Gradient Descent 优化器：梯度下降

Gradient Descent is widely used as one of the
popular optimizers in machine learning, especially in
the ANN learning
梯度下降法是机器学习，尤其是人工神经网络学习中最常用的优化方法之一

# Optimization In Linear Regression

How to apply gradient descent to minimize the cost function for regression
如何应用梯度下降来最小化回归的成本函数
1. a closer look at the cost function
  仔细看看成本函数
2. applying gradient descent to find the minimum of the cost function
  应用梯度下降来寻找成本函数的最小值

# a closer look at the cost function

Hypothesis: 假设
$h_{\theta}(x)=\theta_{0}+\theta_{1}x$
Parameters: 参数
$\theta_{0}, \theta_{1}$
Cost Function: 成本函数
Sum of squared errors 误差平方和
$J\left(\theta_{0}, \theta_{1}\right)=\frac{1}{2 m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2}$
Goal:
$\underset{\theta_{0}, \theta_{1}}{\operatorname{minimize}} J\left(\theta_{0}, \theta_{1}\right)$
Optimization
- There are at least two optimization methods
  至少有两种优化方法
  - Least square optimization
    最小平方优化
  - Optimization based on gradient descent
    基于梯度下降的优化
Least Square Optimization 最小平方优化
- Find the optimal point 找到最佳点
  $\frac{\partial}{\partial \theta_{j}} J=0$ , $J$ is the objective function with $\theta_{1}, \theta_{2}, \theta_{3}, \ldots $
- $j=1,2,3, \ldots, N+1$ , assume $u$ have $N \times$ variables
- Therefore, you will have $N+1$ functions to be solved
  因此，将有 $N+1$ 个函数需要求解
- Drawback: it is complicated if you have many $x$ variables
  缺点：如果有许多 $x$ 变量，这就复杂了

# applying gradient descent to find the minimum of the cost function

Have some function $J\left(\theta_{0}, \theta_{1}\right)$
Want $\min _{\theta_{0}, \theta_{1}} J\left(\theta_{0}, \theta_{1}\right)$
Gradient descent algorithm outline: 梯度下降算法概述:
- Start with some $\theta_{0}, \theta_{1}$ ;
- Keep changing $\theta_{0}, \theta_{1}$ to reduce $J\left(\theta_{0}, \theta_{1}\right)$ until we hopefully end up at a minimum

# Backpropagation Training 反向传播训练

There are several network structure in neural networks, such as feed forward neural networks and the recurrent neural networks
神经网络有几种网络结构，如前馈神经网络和递归神经网络
Multi layer Feed forward Networks used a forward procedure for predictions.
多层前馈网络使用前向程序进行预测。
But it was trained by using a Backward propagation approach
但它是用反向传播方法训练的
These ANNs are also called BP (Backpropagation) Neural Networks
这些人工神经网络也被称为 BP (反向传播) 神经网络

# ANN needs a process of weight training

A set of examples, each with input vector $x$ and output vector $y$
一组例子，每个例子都有输入向量 $x$ 和输出向量 $y$ 。
Squared error loss: $Loss =\sum_{k} \operatorname{Loss}_{k}, \operatorname{Loss}_{k}=\left(y_{k}-a_{k}\right)^{2}$ , where $a_{k}$ is the $k$ -th output of the neural net
The weights are adjusted as follows:

$w_{i j} \leftarrow w_{i j}-\alpha \partial L o s s / \partial w_{i j}$

How can we compute the gradient efficiently given an arbitrary network structure?
Answer: backpropagation algorithm

# Forward vs Backward in ANN

Forward phase:

Propagate inputs forward to compute the output of each unit
Output $a_{j}$ at unit $j$ : $a_{j}=g\left(i n_{j}\right)$ where $in_{j}=\sum_{i} w_{i j} a_{i}$ .

Backward phase:

Propagate errors backward
For an output unit $j$ : $\Delta_{j}=g^{\prime}\left(i n_{j}\right)\left(y_{j}-a_{j}\right)$
For an hidden unit $i$ : $\Delta_{i}=g^{\prime}\left(i n_{i}\right) \sum_{j} w_{i j} \Delta_{j}$ .

Forward vs Backward in ANN

Example: f functions are used to denote the activation function s

# Forward in ANN:

Forward inputs to the nodes at hidden layers

Forward inputs to the last node to produce the predictions

# Backward in ANN

Backpropagation Learning

# Neural Networks and Deep Learning

To make ANN more powerful, there are two solutions
为了使 ANN 更加强大，有两种解决方案
- Add more neurons in the hidden layer
  在隐藏层中添加更多的神经元
- Add more hidden layers
  添加更多隐藏层
Deep Learning Deep Learning
- Traditional ANN only has 3 layers. Deep learning utilizes neural networks with multiple layers
  传统的人工神经网络只有 3 层。深度学习利用多层神经网络
- Deep learning have more structures for neural networks, such as ANN, Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and so forth
  深度学习有更多的神经网络结构，如 ANN、卷积神经网络 (CNN)、递归神经网络 (RNN) 等等
- Deep learning is not related to neural networks only. It also correlates with computing, such as GPU
  深度学习不仅仅与神经网络相关。它还与计算相关，如 GPU
ANN vs Deep Learning

# Ensembles of Classifiers 分类器集合

Basic idea is to learn a set of classifiers (experts) and to allow them to vote.
基本思想是学习一组分类器 (专家) 并允许他们投票。
Advantage: improvement in predictive accuracy.
优点：预测精度提高。
Disadvantage: it is difficult to understand an ensemble of classifiers.
缺点：很难理解分类器的集成。

# Ensemble Methods 集成方法

# Bagging

Process in bagging 装袋过程
- Sample several training sets of size $n$ (instead of just having one training set of size $n$ )
  采样几个大小为 $n$ 的训练集 (而不是只有一个大小为 $n$ 的训练集）
- Build a classifier for each training set
  为每个训练集构建一个分类器
- Combine the classifier’s predictions by voting or averaging
  通过投票或平均来组合分类器的预测

Bagging classifiers

Classifier generation

Let n be the size of the training set.
For each of t iterations: 
    Sample n instances with replacement from the training set.
    Apply the learning algorithm to the sample.
    Store the resulting classifier.

classification

For each of the t classifiers:
    Predict class of instance using classifier.
Return class that was predicted most often.

Voting and Averaging 投票和平均
- Voting is used for classifications, and averaging is used for regressions
  投票用于分类，平均用于回归
- Voting: Hard and Soft voting
  投票：硬投票和软投票
Hard voting
Predictions:
Classifier 1 predicts class A
Classifier 2 predicts class B
Classifier 3 predicts class B
2/3 classifiers predict class B, so class B is the ensemble decision.
Soft voting
Predictions (identical to the earlier example, but now in terms of probabilities.Shown only for class A here because the problem is binary):
Classifier 1 predicts class A with probability 99%
Classifier 2 predicts class A with probability 49%
Classifier 3 predicts class A with probability 49%
The average probability of belonging to class A across the classifiers is (99+49+49)/3 = 65.67% .
Therefore, class A is the ensemble decision.
Why does bagging work?
- Bagging reduces variance by voting / averaging, thus reducing the overall expected error
  - In the case of classification there are pathological situations where the overall error might increase
  - Usually, the more classifiers the better

# Boosting

Also uses voting/averaging but models are weighted according to their performance
Iterative procedure new models are influenced by performance of previously built ones
- New model is encouraged to become expert for instances classified incorrectly by earlier models
- Assign more weights to the misclassified instances to improve the classification iteratively
There are several variants of this algorithm

AdaBoost.M1

classifier generation

Assign equal weight to each training instance.
For each of t iterations:
    Learn a classifier from weighted dataset.
    Compute error e of classifier on weighted dataset.
    If e equal to zero, or e greater or equal to 0.5: 
        Terminate classifier generation.
    For each instance in dataset:
        If instance classified correctly by classifier: 
            Multiply weight of instance by e / (1 - e)
    Normalize weight of all instances.

classification

Assign weight of zero to all classes.
For each of the t classifiers:
    Add -log(e / (1 - e)) to weight of class predicted by the classifier.
Return class with highest weight.

# Random Forest

Random forest is a bagging method which uses decision trees as the classifiers
The workflow in the random forest is the same as the ones in bagging
In bagging, we can use any classifiers In random forest, we use decision trees

Classifier generation

Let n be the size of the training set.
For each of t iterations:
    (1) Sample n instances with replacement from the training set
    (2) Learn a decision tree s.t. the variable for any new node is the best variable among m randomly selected variables.
    (3) Store the resulting decision tree.

Classification

For each of the t decision trees:
    Predict class of instance.
Return class that was predicted most often.

# Semi-Supervised Classification 半监督分类

Classifications require labeled data
Data labeling is a complicated and expensive process. It is not guaranteed that we have enough and high qualified labels
Labels may be hard to get
- Human labeling is slow and boring
- It may require expert knowledge
- It may require special or expensive devices
Goal:
Using both labeled and unlabeled data to build better classifiers (than using labeled data alone).
Notation:
- input $x$ , label $y$
- classifier f: \mathcal{X} \mapsto \mathcal
- labeled data $\left(X_{l}, Y_{l}\right)=\left\{\left(x_{1}, y_{1}\right), \ldots,\left(x_{l}, y_{l}\right)\right\}$
- unlabeled data $X_{u}=\left\{x_{l+1}, \ldots, x_{n}\right\}$
- usually $n \gg l$

# Solutions: Self-training

Algorithm: Self-training
1. Pick your favorite classification method. Train a classifier $f$ from $\left(X_{l}, Y_{l}\right)$ .
2. Use $f$ to classify all unlabeled items $x \in X_{u}$ .
3. Pick $x^{*}$ with the highest confidence, add $\left(x^{*}, f\left(x^{*}\right)\right)$ to labeled data.
4. Repeat.

The simplest semi-supervised learning method.

Pros
- Simple
- Applies to almost all existing classifiers
Cons
- Mistakes reinforce themselves. Heuristics against pitfalls
- 'Un-label' a training point if its classification confidence drops below a threshold
- Randomly perturb learning parameters

# Solutions: Co-training

Your data can be split into different views
The view can be defined by different set of the features
Each item is represented by two kinds of features $x=\left[x^{(1)} ; x^{(2)}\right]$
- $x^{(1)}$ = image features
- $\boldsymbol{\square} x^{(2)}$ = web page text
- This is a natural feature split (or multiple views)
Co-training idea:
- Train an image classifier and a text classifier
- The two classifiers teach each other
Algorithm: Co-training
1. Train two classifiers: $f^{(1)}$ from $\left(X_{l}^{(1)}, Y_{l}\right), f^{(2)}$ from $\left(X_{l}^{(2)}, Y_{l}\right)$
2. Classify $X_{u}$ with $f^{(1)}$ and $f^{(2)}$ separately.
3. Add $f^{(1)}$ 's $k$ -most-confident $\left(x, f^{(1)}(x)\right)$ to $f^{(2)}$ 's labeled data.
4. Add $f^{(2)}$ 's $k$ -most-confident $\left(x, f^{(2)}(x)\right)$ to $f^{(1)}$ 's labeled data.
5. Repeat.
Pros
- Simple. Applies to almost all existing classifiers
- Less sensitive to mistakes
Cons
- Feature split may not exist
- Models using BOTH features should do better

# Multi-Label Classifications

Binary classification: Is this a picture of the sea?
$\in\{ yes, no \}$
Multi-class classification: What is this a picture of?
$\in\{ sea, sunset, trees, people, mountain, urban \}$
Multi-label classification: Which labels are relevant to this picture?
$\subseteq\{ sea, sunset, trees, people, mountain, urban \}$
i.e., multiple labels per instance instead of a single label!

# Applications

the news

the IMDb dataset: Textual movie plot summaries associated with genres (labels).

Images are labelled to indicate
- multiple concepts
- multiple objects
- multiple people
  e.g., Scene data with concept labels
$\subseteq\{ beach, sunset, foliage, field, mountain, urban \}$
Labelling music/tracks with genres / voices, concepts, etc.
- e.g., Music dataset, audio tracks labelled with different moods, among:
  - amazed-surprised,
  - happy-pleased,
  - relaxing-calm,
  - quiet-still,
  - sad-lonely,
  - angry-aggressive

# Example

Difference in data sets
Table: Single-label $Y \in \{0,1\}$ .
Table: Multi-label $Y \subseteq\left\{\lambda_{1}, \ldots, \lambda_{L}\right\}$
We usually convert labels to binary labels

# Solutions

# Transformation Based Methods

Transform the task to binary/multi-class classifications

# Binary Relevance

If there are $N$ labels, we have $N$ binary classifications

Drawback: it ignores the label depenence

# Classifier Chains

Classifier Chains build the model in a chain by taking label correlations into consideration
It uses the feature to perform binary classification on 1st label, the prediction on 1st label will be reused as the features into the 2nd step to predict the 2nd label
Repeat the process above until all of the labels are predicted

Use previous prediction results as new features
Drawbacks in Classifier Chains
- Difficult to define the sequence in the chain, though there are some methods (e.g., info gain)
- If the previous predictions are incorrect, the following predictions may not be right too.

# Label Powerset

Each subset of the label set will be a single label
Assign binary classification or multi-class classification to them
Find a way to aggregate the results
1. Transform dataset
  
  ...into a multi-class problem, taking $2^{L}$ possible values:
2. ...and train any off-the-shelf multi-class classifier
Drawbacks in Label Powerset
标签权力集的缺点
- Too many subsets if there are several labels
  如果有多个标签，则有太多的子集
- Highly possible to have imbalance issue
  极有可能出现不平衡的问题
- Overfitting: how to predict new values/labels?
  过度拟合：如何预测新值 / 标签？

# Adaptation Based Methods 基于适应性的方法

Develop new algorithms to solve the problem
开发新的算法来解决这个问题

# Algorithm adaptation techniques

MLkNN.For each test instance:
- Retrieve the top-k nearest neighbors to each instance
- Compute the frequency of occurrence of each label
- Assign a probability to each label and select the labels by using a probability cut-off value

# Evaluation of multilabel learning

Notes
Both transformation and adaptation methods are the methods to solve MLC problem
They are not classification algorithms
For each method, you can use any traditional binary/multi-class classification algorithms to produce the predictions

There are multiple labels in the MLC problem
Traditional evaluation metrics in the classification may not work for MLC
We need to develop new evaluation metrics

# Hamming Loss

Consider the misclassification in each bit

$\text { HAMMING LOSS } =\frac{1}{N L} \sum_{i=1}^{N} \sum_{j=1}^{L} \mathbb{I}\left[\hat{y}_{j}^{(i)} \neq y_{j}^{(i)}\right] = 4 /(4 * 5) \\ =0.20$

N = # of labels
L = # of data rows

# 0/1 Loss

Consider the misclassification in the whole label set

$0 / 1 \mathrm{LOSS} =\frac{1}{N} \sum_{i=1}^{N} \mathbb{I}\left(\hat{\mathbf{y}}^{(i)} \neq \mathbf{y}^{(i)}\right)=3 / 5 \\ =0.60$

# Other Metrics

JACCARD INDEX: often called multi-label ACCURACY
RANK LOSS: average fraction of pairs not correctly ordered
ONE ERROR: if top ranked label is not in set of true labels
COVERAGE: average "depth" to cover all true labels
LOG LOSS: i.e., cross entropy
PRECISION: predicted positive labels that are relevant
RECALL: relevant labels which were predicted

PRECISION VS. RECALL curves

F-MEASURE

micro-averaged ('global' view)
macro-averaged by label (ordinary averaging of a binary measure, changes in infrequent labels have a big impact)
macro-averaged by example (one example at a time, average across examples)

# Tools

Mulan

Java Based
Reuse Weka library
No UI
http://mulan.sourceforge.net/

Meka

Similar to Weka
Java Based
With UI
http://meka.sourceforge.net/

# Classification: Summary

We learned different algorithms
- No learning process: KNN and Naïve Bayes
- Learning based: Logistic regression, Decision tree, SVM, Neural Networks
- Ensemble methods: bagging, boosting, RandomForest
For each algorithm 对于每一种算法
- Understand how it works
  了解它是如何工作的
- Know the requirements on the data; Know how to prepare a preprocessed data set
  知道对数据的要求；知道如何准备一个预处理的数据集
- Know what are the parameters to be tuned up
  知道哪些是需要调整的参数
- Know the solutions for overfittings
  知道超配的解决方案
- Which algorithm is the best?
  哪种算法是最好的？
  - It varies from data to data
    不同的数据会有不同的结果
  - We need to tune parameters to tune up the model
    我们需要调整参数来调优模型
  - We need to compare different classification models
    我们需要比较不同的分类模型
- General issue: imbalance in labels
  一般问题：标签的不平衡性

数据挖掘机器学习