# Data and Data Types
Attribute (ordimensions, features, variables): a data field, representing a characteristic or feature of a data object.
属性 (或维度、特征、变量):数据字段,表示数据对象的特征或特征。- E.g., customer_ID, name, address
Types
- Categorical
- Nominal 标称型变量
- Binary 二元的
- Ordinal 有序的
- Numerical
- Interval-scaled or Ratio-scaled 间隔标度或比率标度
- Discrete or continuous 离散和连续
- Categorical
一旦有了数据,则可能需要执行数据预处理。
# Data Preprocessing
- Deal with Missing values 处理缺失值
- What are the possible solutions?
Missing Values
如果是数值型变量,则求解缺失数据。
如果是标称型变量,则填写缺失数据。 - Reduce variance for a numerical variable: binning
减少数值变量的方差 - Correlation analysis: for different variables
相关分析 - Data Normalization
- Data Transformation: numerical <--> nominal Discretization
- Feature Selection and Reduction
- Outlier Detection
# Supervised vs. Unsupervised Learning
# Supervised Learning
- infer a (predictive) function from data associated with pre-defined targets/classes/labels
从与预定义目标 / 类别 / 标签相关的数据推断(预测)函数 - Example: group objects by predefined labels
示例:按预定义标签对对象进行分组 - Goal: Learn a model from labelled data (with multiple features) for future predictions
目标:从标记数据 (具有多个特征) 中学习模型,以用于未来的预测 - Outcomes: We know outcomes: the predefined labels
结果:我们知道结果:预定义的标签 - Evaluation: error/accuracy, and other more metrics
评估:错误 / 准确性和其他更多指标 - Data Mining Task: Classification
数据挖掘任务:分类
# Unsupervised Learning
- discover or describe underlying structure from unlabelled data
从未贴标签的数据中发现或描述底层结构 - Example: group objects by multiple features
示例:按多个特征对对象进行分组 - Goal: Learn the structure from unlabelled data (with multiple features) Outcomes: We do not know the outcomes
目标:从未标记数据(具有多个特征)中学习结构结果:我们不知道结果 - Evaluation: No clear performance or evaluation methods
评估:没有明确的绩效或评估方法 - Data Mining Task: Clustering
数据挖掘任务:聚类
# Classification
Data Splits for Evaluations
Binary, Multi-Class and Multi-Label classifications
Imbalance Issues and solutions
Classification Evaluations & Metrics
# Clustering Tasks/Approaches
- Partitional Clustering
- just group objects to minimize intra-cluster distances and maximize inter-cluster distances, e.g., K-Means
只需将对象分组以最小化簇内距离并最大化簇间距离,例如 K - 均值 - Hierarchical Clustering
- a clustering process in order to discover the hierarchical structure, like a hierarchical tree Example: categories and subcategories; taxonomies
为了发现层次结构而进行的聚类过程,如层次树示例:类别和子类别;分类法