# Neural Networks

# Artificial Neural Networks (ANN)

  • Researchers tried to learn from the biological neuron systems and built the ANN
    研究人员试图向生物神经元系统学习,并建立了 ANN

    • There are many neuron units in ANN
      ANN 中有许多神经元单位
    • They are connected within a structure
    • They work as threshold switching units
    • There are weighted interconnections among units
    • We are able to learn and tune up these weights automatically by a training process
  • A neuron in ANN looks like this…

    input signals → input function(linear) → activation function(nonlinear) → output signal
    输入信号→输入函数 (线性)→激活函数 (非线性)→输出信号

# Perceptron 感知器

  • First neural network learning model in the 1960’s
    20 世纪 60 年代的第一个神经网络学习模型
  • Simple and limited (single layer models)
    简单且有限 (单层模型)
  • Still used in current applications (modems, etc.)
    仍在当前应用中使用 (调制解调器等)。
Input 输入
different xx variables with weights on edges
不同的xx 变量权重边缘
Input function 输入函数
it is used to aggregate the inputs, usually it is a weighted sum of its inputs
Activation function 激活函数
  • It is a threshold function
  • For the purpose of binary classification
    • The output is only 1 or 0
      输出是只 1 或 0
    • Sign function can be used as the activation function
    • Sigmoid can be used as activation function
Sigmoid function 乙状函数 / S 函数
It is popular for classification, due to being easy to be updated and learned in the training process.

  • Neural networks canbe used for both classifications and regressions.
  • It can be controlled by applying different activation functions.

# Perceptron Training 感知器训练

  • It is the simplest ANN model

    • We need to train the model to learn the weights, ww, where

      wiwi+Δwiw_{i} \leftarrow w_{i}+\Delta w_{i}

      Δwi=η(to)xi\Delta w_{i}=\eta(t-o) x_{i}

    • tt is the real value
    • oo is the output value (prediction by the model)
      是输出值 (由模型预测)
    • η\eta is a constant value in [0,1][0, 1] as the learning rate
      [0,1][0, 1] 中的一个常数值,作为学习速率
  • It is a process of iterative learning

    • At the beginning, give random values to ww
      开始时,给ww 随机取值
    • Get the output oo through the perceptron
    • Update the ww by using the update rules
    • Stop the learning process by a stopping criterion
      • Classification error is smaller than a threshold
      • Or, maximal learning iterations have been reached

# Perceptron: Example

  • Consider learning the logical OR function
  • Activation function 激活函数

    S=k=0k=nwkxkS>0thenO=1elseO=0S=\sum_{k=0}^{k=n} w_{k} x_{k} \quad S>0 \text { then } O=1 \quad \text { else } \quad O=0

  • We’ll use a single perceptron with three inputs.

  • We’ll start with all weights 0 W= <0,0,0>
    我们将从所有重量 0 开始

  • Example 1 I = <0,0,0> label = 0 W = <0,0,0>

    • Perceptron (1×0+0×0+0×0=0,S=01 \times 0 + 0 \times 0 + 0 \times 0 = 0, S=0) output = 0
    • it classifies it as 0 , so correct, do nothing
      它将其归类为 0 ,所以正确,什么也不做
  • Example 2 I = <1,0,1> label=1 W = <0,0,0>

    • Perceptron (1×0+0×0+1×0=01 \times 0 + 0 \times 0 + 1 \times 0 = 0) output = 0
    • it classifies it as 0 , while it should be 1 , so we add input to weights W = <0,0,0> + <1,0,1>= <1,0,1>
      它将其分类为 0,而它应该是 1,所以我们将输入添加到权重 W = <0,0,0> + <1,0,1>= <1,0,1>
  • Example 3 I = <1,1,0> label = 1 W = <1,0,1>

    • Perceptron (1×0+1×0+0×0>01 \times 0 + 1 \times 0 + 0 \times 0 \gt 0) output = 1
    • it classifies it as 1 , correct, do nothing W = <1,0,1>
      它将其分类为 1 ,正确,什么都不做 W = <1,0,1>
  • Example 4 I = <1,1,1> label = 1 W = <1,0,1>

    • Perceptron (1×0+1×0+1×0>01 \times 0 + 1 \times 0 + 1 \times 0 \gt 0) output = 1
    • it classifies it as 1 , correct, do nothing W = <1,0,1>

1st iteration is completed. 第一次迭代完成。
Repeat until no errors 重复,直到没有错误

# Limitations of Perceptron 感知器的局限性

  • It is too simple, cannot learn complex and effective models
  • It assumes the data can be linearly separatable in the binary classification, but actually it could be non linear!
    • SVM, we use kernel function to map data to higher dimension
    • ANN, we can add more layers!!

# Multi layer Feed forward Networks 多层前馈网络

  • Multi layer Feed forward Networks is an extension of the perceptron model. It adds hidden layers to the original perceptron.
  • Input layer: accepts inputs only
  • Hidden layers: neurons with functions
  • Output layer: produce outputs

# Training Phrase

  • The training phrase is a typical process of machine learning and optimization
  • We need to 我们需要
    • Setup a learning objective as loss function
    • Use appropriate optimizer to learn the parameters
    • It is usually a process of iterative learning

# Loss Function 损失函数

  • The loss function L(x,y,y)L\left(x, y, y^{\prime}\right) is defined as the amount of utility lost by predicting h(x)=yh(x)=y^{\prime} when the correct answer is f(x)=yf(x)=y
    损失函数定义为当正确答案是 f(x)=yf(x)=y 时,通过预测而损失的效用量
  • Often a simplified version is used, L(y,y)L\left(y, y^{\prime}\right), that is independent of xx
  • Three commonly used loss functions:
    • Absolute value loss: 绝对值损失 L1(y,y)=yyL_{1}\left(y, y^{\prime}\right)=\left|y-y^{\prime}\right|
    • Squared error loss: 平方误差损失 L_{2}\left(y, y^{\prime}\right)=\left(y-y^{\prime}\right)^
    • 0/1 loss: L0/1(y,y)=0L_{0 / 1}\left(y, y^{\prime}\right)=0 if y=yy=y^{\prime}, else 11
  • Let EE be the set of examples. Total loss L(E)=eEL(e)L(E)=\sum_{e \in E} L(e)

# Optimizer: Gradient Descent 优化器:梯度下降

  • Gradient Descent is widely used as one of the
    popular optimizers in machine learning, especially in
    the ANN learning

# Optimization In Linear Regression

  • How to apply gradient descent to minimize the cost function for regression
    1. a closer look at the cost function
    2. applying gradient descent to find the minimum of the cost function
# a closer look at the cost function
  • Hypothesis: 假设


  • Parameters: 参数

    θ0,θ1\theta_{0}, \theta_{1}

  • Cost Function: 成本函数
    Sum of squared errors 误差平方和

    J(θ0,θ1)=12mi=1m(hθ(x(i))y(i))2J\left(\theta_{0}, \theta_{1}\right)=\frac{1}{2 m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2}

  • Goal:

    minimizeθ0,θ1J(θ0,θ1)\underset{\theta_{0}, \theta_{1}}{\operatorname{minimize}} J\left(\theta_{0}, \theta_{1}\right)

  • Optimization

    • There are at least two optimization methods
      • Least square optimization
      • Optimization based on gradient descent
  • Least Square Optimization 最小平方优化

    • Find the optimal point 找到最佳点
      θjJ=0\frac{\partial}{\partial \theta_{j}} J=0, JJ is the objective function with $\theta_{1}, \theta_{2}, \theta_{3}, \ldots $
    • j=1,2,3,,N+1j=1,2,3, \ldots, N+1, assume uu have N×N \times variables
    • Therefore, you will have N+1N+1 functions to be solved
      因此,将有N+1N+1 个函数需要求解
    • Drawback: it is complicated if you have many xx variables
      缺点:如果有许多xx 变量,这就复杂了
# applying gradient descent to find the minimum of the cost function
  • Have some function J(θ0,θ1)J\left(\theta_{0}, \theta_{1}\right)

  • Want minθ0,θ1J(θ0,θ1)\min _{\theta_{0}, \theta_{1}} J\left(\theta_{0}, \theta_{1}\right)

  • Gradient descent algorithm outline: 梯度下降算法概述:

    • Start with some θ0,θ1\theta_{0}, \theta_{1} ;
    • Keep changing θ0,θ1\theta_{0}, \theta_{1} to reduce J(θ0,θ1)J\left(\theta_{0}, \theta_{1}\right) until we hopefully end up at a minimum

# Backpropagation Training 反向传播训练

  • There are several network structure in neural networks, such as feed forward neural networks and the recurrent neural networks
  • Multi layer Feed forward Networks used a forward procedure for predictions.
    But it was trained by using a Backward propagation approach
  • These ANNs are also called BP (Backpropagation) Neural Networks
    这些人工神经网络也被称为 BP (反向传播) 神经网络

# ANN needs a process of weight training

  • A set of examples, each with input vector xx and output vector yy
    一组例子,每个例子都有输入向量xx 和输出向量yy
  • Squared error loss: Loss=kLossk,Lossk=(ykak)2Loss =\sum_{k} \operatorname{Loss}_{k}, \operatorname{Loss}_{k}=\left(y_{k}-a_{k}\right)^{2}, where aka_{k} is the kk-th output of the neural net
  • The weights are adjusted as follows:

wijwijαLoss/wijw_{i j} \leftarrow w_{i j}-\alpha \partial L o s s / \partial w_{i j}

  • How can we compute the gradient efficiently given an arbitrary network structure?
  • Answer: backpropagation algorithm

# Forward vs Backward in ANN

Forward phase:
  • Propagate inputs forward to compute the output of each unit
  • Output aja_{j} at unit jj: aj=g(inj)a_{j}=g\left(i n_{j}\right) where inj=iwijaiin_{j}=\sum_{i} w_{i j} a_{i} .
Backward phase:
  • Propagate errors backward
  • For an output unit jj: Δj=g(inj)(yjaj)\Delta_{j}=g^{\prime}\left(i n_{j}\right)\left(y_{j}-a_{j}\right)
  • For an hidden unit ii: Δi=g(ini)jwijΔj\Delta_{i}=g^{\prime}\left(i n_{i}\right) \sum_{j} w_{i j} \Delta_{j} .

# Forward in ANN:

# Backward in ANN

# Neural Networks and Deep Learning

  • To make ANN more powerful, there are two solutions
    为了使 ANN 更加强大,有两种解决方案

    • Add more neurons in the hidden layer
    • Add more hidden layers
  • Deep Learning Deep Learning

    • Traditional ANN only has 3 layers. Deep learning utilizes neural networks with multiple layers
      传统的人工神经网络只有 3 层。深度学习利用多层神经网络
    • Deep learning have more structures for neural networks, such as ANN, Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and so forth
      深度学习有更多的神经网络结构,如 ANN、卷积神经网络 (CNN)、递归神经网络 (RNN) 等等
    • Deep learning is not related to neural networks only. It also correlates with computing, such as GPU
      深度学习不仅仅与神经网络相关。它还与计算相关,如 GPU
  • ANN vs Deep Learning

# Ensembles of Classifiers 分类器集合

  • Basic idea is to learn a set of classifiers (experts) and to allow them to vote.
    基本思想是学习一组分类器 (专家) 并允许他们投票。
  • Advantage: improvement in predictive accuracy.
  • Disadvantage: it is difficult to understand an ensemble of classifiers.

# Ensemble Methods 集成方法

# Bagging

  • Process in bagging 装袋过程

    • Sample several training sets of size nn (instead of just having one training set of size nn)
      采样几个大小为nn 的训练集 (而不是只有一个大小为nn 的训练集)
    • Build a classifier for each training set
    • Combine the classifier’s predictions by voting or averaging
  • Bagging classifiers

    • Classifier generation
    Let n be the size of the training set.
    For each of t iterations: 
        Sample n instances with replacement from the training set.
        Apply the learning algorithm to the sample.
        Store the resulting classifier.
    • classification
    For each of the t classifiers:
        Predict class of instance using classifier.
    Return class that was predicted most often.
  • Voting and Averaging 投票和平均

    • Voting is used for classifications, and averaging is used for regressions
    • Voting: Hard and Soft voting
  • Hard voting

    Classifier 1 predicts class A
    Classifier 2 predicts class B
    Classifier 3 predicts class B
    2/3 classifiers predict class B, so class B is the ensemble decision.

  • Soft voting
    Predictions (identical to the earlier example, but now in terms of probabilities.Shown only for class A here because the problem is binary):

    Classifier 1 predicts class A with probability 99%
    Classifier 2 predicts class A with probability 49%
    Classifier 3 predicts class A with probability 49%
    The average probability of belonging to class A across the classifiers is (99+49+49)/3 = 65.67% .
    Therefore, class A is the ensemble decision.

  • Why does bagging work?

    • Bagging reduces variance by voting / averaging, thus reducing the overall expected error
      • In the case of classification there are pathological situations where the overall error might increase
      • Usually, the more classifiers the better

# Boosting

  • Also uses voting/averaging but models are weighted according to their performance

  • Iterative procedure new models are influenced by performance of previously built ones

    • New model is encouraged to become expert for instances classified incorrectly by earlier models
    • Assign more weights to the misclassified instances to improve the classification iteratively
  • There are several variants of this algorithm

  • AdaBoost.M1

    • classifier generation
    Assign equal weight to each training instance.
    For each of t iterations:
        Learn a classifier from weighted dataset.
        Compute error e of classifier on weighted dataset.
        If e equal to zero, or e greater or equal to 0.5: 
            Terminate classifier generation.
        For each instance in dataset:
            If instance classified correctly by classifier: 
                Multiply weight of instance by e / (1 - e)
        Normalize weight of all instances.
    • classification
    Assign weight of zero to all classes.
    For each of the t classifiers:
        Add -log(e / (1 - e)) to weight of class predicted by the classifier.
    Return class with highest weight.

# Random Forest

  • Random forest is a bagging method which uses decision trees as the classifiers

  • The workflow in the random forest is the same as the ones in bagging

  • In bagging, we can use any classifiers In random forest, we use decision trees

  • Classifier generation

    Let n be the size of the training set.
    For each of t iterations:
        (1) Sample n instances with replacement from the training set
        (2) Learn a decision tree s.t. the variable for any new node is the best variable among m randomly selected variables.
        (3) Store the resulting decision tree.
  • Classification

    For each of the t decision trees:
        Predict class of instance.
    Return class that was predicted most often.

# Semi-Supervised Classification 半监督分类

  • Classifications require labeled data

  • Data labeling is a complicated and expensive process. It is not guaranteed that we have enough and high qualified labels

  • Labels may be hard to get

    • Human labeling is slow and boring
    • It may require expert knowledge
    • It may require special or expensive devices
  • Goal:
    Using both labeled and unlabeled data to build better classifiers (than using labeled data alone).

  • Notation:

    • input xx, label yy
    • classifier f: \mathcal{X} \mapsto \mathcal
    • labeled data (Xl,Yl)={(x1,y1),,(xl,yl)}\left(X_{l}, Y_{l}\right)=\left\{\left(x_{1}, y_{1}\right), \ldots,\left(x_{l}, y_{l}\right)\right\}
    • unlabeled data Xu={xl+1,,xn}X_{u}=\left\{x_{l+1}, \ldots, x_{n}\right\}
    • usually nln \gg l

# Solutions: Self-training

  • Algorithm: Self-training
    1. Pick your favorite classification method. Train a classifier ff from (Xl,Yl)\left(X_{l}, Y_{l}\right).
    2. Use ff to classify all unlabeled items xXux \in X_{u}.
    3. Pick xx^{*} with the highest confidence, add (x,f(x))\left(x^{*}, f\left(x^{*}\right)\right) to labeled data.
    4. Repeat.

The simplest semi-supervised learning method.

  • Pros

    • Simple
    • Applies to almost all existing classifiers
  • Cons

    • Mistakes reinforce themselves. Heuristics against pitfalls
    • 'Un-label' a training point if its classification confidence drops below a threshold
    • Randomly perturb learning parameters

# Solutions: Co-training

  • Your data can be split into different views

  • The view can be defined by different set of the features

  • Each item is represented by two kinds of features x=[x(1);x(2)]x=\left[x^{(1)} ; x^{(2)}\right]

    • x(1)x^{(1)} = image features
    • x(2)\boldsymbol{\square} x^{(2)} = web page text
    • This is a natural feature split (or multiple views)
  • Co-training idea:

    • Train an image classifier and a text classifier
    • The two classifiers teach each other
  • Algorithm: Co-training

    1. Train two classifiers: f(1)f^{(1)} from (Xl(1),Yl),f(2)\left(X_{l}^{(1)}, Y_{l}\right), f^{(2)} from (Xl(2),Yl)\left(X_{l}^{(2)}, Y_{l}\right)
    2. Classify XuX_{u} with f(1)f^{(1)} and f(2)f^{(2)} separately.
    3. Add f(1)f^{(1)}'s kk-most-confident (x,f(1)(x))\left(x, f^{(1)}(x)\right) to f(2)f^{(2)}'s labeled data.
    4. Add f(2)f^{(2)}'s kk-most-confident (x,f(2)(x))\left(x, f^{(2)}(x)\right) to f(1)f^{(1)}'s labeled data.
    5. Repeat.
  • Pros

    • Simple. Applies to almost all existing classifiers
    • Less sensitive to mistakes
  • Cons

    • Feature split may not exist
    • Models using BOTH features should do better

# Multi-Label Classifications

  • Binary classification: Is this a picture of the sea?

    {yes,no}\in\{ yes, no \}

  • Multi-class classification: What is this a picture of?

    {sea,sunset,trees,people,mountain,urban}\in\{ sea, sunset, trees, people, mountain, urban \}

  • Multi-label classification: Which labels are relevant to this picture?

    {sea,sunset,trees,people,mountain,urban}\subseteq\{ sea, sunset, trees, people, mountain, urban \}

    i.e., multiple labels per instance instead of a single label!

# Applications

  • Images are labelled to indicate

    • multiple concepts
    • multiple objects
    • multiple people
      e.g., Scene data with concept labels

    {beach,sunset,foliage,field,mountain,urban}\subseteq\{ beach, sunset, foliage, field, mountain, urban \}

  • Labelling music/tracks with genres / voices, concepts, etc.

    • e.g., Music dataset, audio tracks labelled with different moods, among:
      • amazed-surprised,
      • happy-pleased,
      • relaxing-calm,
      • quiet-still,
      • sad-lonely,
      • angry-aggressive

# Example

  • Difference in data sets

  • Table: Single-label Y{0,1}Y \in \{0,1\}.

  • Table: Multi-label Y{λ1,,λL}Y \subseteq\left\{\lambda_{1}, \ldots, \lambda_{L}\right\}

  • We usually convert labels to binary labels

# Solutions

# Transformation Based Methods

Transform the task to binary/multi-class classifications

# Binary Relevance

  • If there are NN labels, we have NN binary classifications

  • Drawback: it ignores the label depenence

# Classifier Chains

  • Classifier Chains build the model in a chain by taking label correlations into consideration
  • It uses the feature to perform binary classification on 1st label, the prediction on 1st label will be reused as the features into the 2nd step to predict the 2nd label
  • Repeat the process above until all of the labels are predicted

  • Use previous prediction results as new features

  • Drawbacks in Classifier Chains

    • Difficult to define the sequence in the chain, though there are some methods (e.g., info gain)
    • If the previous predictions are incorrect, the following predictions may not be right too.

# Label Powerset

  • Each subset of the label set will be a single label

  • Assign binary classification or multi-class classification to them

  • Find a way to aggregate the results

    1. Transform dataset

      ...into a multi-class problem, taking 2L2^{L} possible values:
    2. ...and train any off-the-shelf multi-class classifier
  • Drawbacks in Label Powerset

    • Too many subsets if there are several labels
    • Highly possible to have imbalance issue
    • Overfitting: how to predict new values/labels?
      过度拟合:如何预测新值 / 标签?

# Adaptation Based Methods 基于适应性的方法

Develop new algorithms to solve the problem

# Algorithm adaptation techniques

  • MLkNN.For each test instance:
    • Retrieve the top-k nearest neighbors to each instance
    • Compute the frequency of occurrence of each label
    • Assign a probability to each label and select the labels by using a probability cut-off value

# Evaluation of multilabel learning


  • Both transformation and adaptation methods are the methods to solve MLC problem
  • They are not classification algorithms
  • For each method, you can use any traditional binary/multi-class classification algorithms to produce the predictions
  • There are multiple labels in the MLC problem
  • Traditional evaluation metrics in the classification may not work for MLC
  • We need to develop new evaluation metrics

# Hamming Loss

Consider the misclassification in each bit

HAMMING LOSS=1NLi=1Nj=1LI[y^j(i)yj(i)]=4/(45)=0.20\text { HAMMING LOSS } =\frac{1}{N L} \sum_{i=1}^{N} \sum_{j=1}^{L} \mathbb{I}\left[\hat{y}_{j}^{(i)} \neq y_{j}^{(i)}\right] = 4 /(4 * 5) \\ =0.20

N = # of labels
L = # of data rows

# 0/1 Loss

Consider the misclassification in the whole label set

0/1LOSS=1Ni=1NI(y^(i)y(i))=3/5=0.600 / 1 \mathrm{LOSS} =\frac{1}{N} \sum_{i=1}^{N} \mathbb{I}\left(\hat{\mathbf{y}}^{(i)} \neq \mathbf{y}^{(i)}\right)=3 / 5 \\ =0.60

# Other Metrics

often called multi-label ACCURACY
average fraction of pairs not correctly ordered
if top ranked label is not in set of true labels
average "depth" to cover all true labels
i.e., cross entropy
predicted positive labels that are relevant
relevant labels which were predicted
  • micro-averaged ('global' view)
  • macro-averaged by label (ordinary averaging of a binary measure, changes in infrequent labels have a big impact)
  • macro-averaged by example (one example at a time, average across examples)

# Tools

  • Java Based
  • Reuse Weka library
  • No UI
  • http://mulan.sourceforge.net/
  • Similar to Weka
  • Java Based
  • With UI
  • http://meka.sourceforge.net/

# Classification: Summary

  • We learned different algorithms

    • No learning process: KNN and Naïve Bayes
    • Learning based: Logistic regression, Decision tree, SVM, Neural Networks
    • Ensemble methods: bagging, boosting, RandomForest
  • For each algorithm 对于每一种算法

    • Understand how it works
    • Know the requirements on the data; Know how to prepare a preprocessed data set
    • Know what are the parameters to be tuned up
    • Know the solutions for overfittings
    • Which algorithm is the best?
      • It varies from data to data
      • We need to tune parameters to tune up the model
      • We need to compare different classification models
    • General issue: imbalance in labels

请我喝[茶]~( ̄▽ ̄)~*

Ruri Shimotsuki 微信支付


Ruri Shimotsuki 支付宝


Ruri Shimotsuki 贝宝
