## Preface ix

### PART I. DATA ANALYSIS FOUNDATIONS

#### 1 Data Mining and Analysis ....................... 3

1.1 Data Matrix 3
1.2 Attributes 4
1.3 Data: Algebraic and Geometric View 5
1.4 Data: Probabilistic View 16
1.6 Exercises 28

#### 2 Numeric Attributes ............................. 29

2.1 Univariate Analysis 29
2.2 Bivariate Analysis 40
2.3 Multivariate Analysis 46
2.4 Data Normalization 50
2.5 Normal Distribution 52
2.7 Exercises 58

#### 3 Categorical Attributes ......................... 61

3.1 Univariate Analysis 61
3.2 Bivariate Analysis 70
3.3 Multivariate Analysis 81
3.4 Distance and Angle 86
3.5 Discretization 87
3.7 Exercises 90

#### 4 Graph Data ..................................... 92

4.1 Graph Concepts 92
4.2 Topological Attributes 96
4.3 Centrality Analysis 101
4.4 Graph Models 111
4.6 Exercises 131

#### 5 Kernel Methods ................................. 134

5.1 Kernel Matrix 138
5.2 Vector Kernels 144
5.3 Basic Kernel Operations in Feature Space 149
5.4 Kernels for Complex Objects 155
5.6 Exercises 161

#### 6 High-dimensional Data .......................... 163

6.1 High-dimensional Objects 163
6.2 High-dimensional Volumes 167
6.3 Hypersphere Inscribed within Hypercube 170
6.4 Volume of Thin Hypersphere Shell 171
6.5 Diagonals in Hyperspace 172
6.6 Density of the Multivariate Normal 173
6.7 Appendix: Derivation of Hypersphere Volume 177
6.9 Exercises 181

#### 7 Dimensionality Reduction ....................... 184

7.1 Background 184
7.2 Principal Component Analysis 188
7.3 Kernel Principal Component Analysis 203
7.4 Singular Value Decomposition 210
7.6 Exercises 215

### PART II. FREQUENT PATTERN MINING

#### 8 Itemset Mining ................................. 219

8.1 Frequent Itemsets and Association Rules 219
8.2 Itemset Mining Algorithms 223
8.3 Generating Association Rules 237
8.5 Exercises 239

#### 9 Summarizing Itemsets ........................... 244

9.1 Maximal and Closed Frequent Itemsets 244
9.2 Mining Maximal Frequent Itemsets: GenMax Algorithm 247
9.3 Mining Closed Frequent Itemsets: Charm Algorithm 250
9.4 Nonderivable Itemsets 252
9.6 Exercises 258

#### 10 Sequence Mining ............................... 261

10.1 Frequent Sequences 261
10.2 Mining Frequent Sequences 262
10.3 Substring Mining via Suffix Trees 269
10.5 Exercises 279

#### 11 Graph Pattern Mining .......................... 282

11.1 Isomorphism and Support 282
11.2 Candidate Generation 286
11.3 The gSpan Algorithm 290
11.5 Exercises 299

#### 12 Pattern and Rule Assessment ................... 303

12.1 Rule and Pattern Assessment Measures 303
12.2 Significance Testing and Confidence Intervals 318
12.4 Exercises 330

### PART III. CLUSTERING

#### 13 Representative-based Clustering ............... 334

13.1 K-means Algorithm 334
13.2 Kernel K-means 339
13.3 Expectation-Maximization Clustering 343
13.5 Exercises 361

#### 14 Hierarchical Clustering ....................... 364

14.1 Preliminaries 364
14.2 Agglomerative Hierarchical Clustering 366
14.4 Exercises and Projects 373

#### 15 Density-based Clustering ...................... 375

15.1 The DBSCAN Algorithm 375
15.2 Kernel Density Estimation 379
15.3 Density-based Clustering: DENCLUE 385
15.5 Exercises 391

#### 16 Spectral and Graph Clustering ................. 394

16.1 Graphs and Matrices 394
16.2 Clustering as Graph Cuts 401
16.3 Markov Clustering 417
16.5 Exercises 424

#### 17 Clustering Validation ......................... 426

17.1 External Measures 426
17.2 Internal Measures 441
17.3 Relative Measures 450
17.5 Exercises 465

### PART IV. CLASSIFICATION

#### 18 Probabilistic Classification .................. 469

18.1 Bayes Classifier 469
18.2 Naive Bayes Classifier 475
18.3 K Nearest Neighbors Classifier 479
18.5 Exercises 482

#### 19 Decision Tree Classifier ...................... 483

19.1 Decision Trees 485
19.2 Decision Tree Algorithm 487
19.4 Exercises 499

#### 20 Linear Discriminant Analysis .................. 501

20.1 Optimal Linear Discriminant 501
20.2 Kernel Discriminant Analysis 508
20.4 Exercises 515

#### 21 Support Vector Machines ....................... 517

21.1 Support Vectors and Margins 517
21.2 SVM: Linear and Separable Case 523
21.3 Soft Margin SVM: Linear and Nonseparable Case 527
21.4 Kernel SVM: Nonlinear Case 533
21.5 SVM Training: Stochastic Gradient Ascent 537
21.7 Exercises 544

#### 22 Classification Assessment ..................... 546

22.1 Classification Performance Measures 546
22.2 Classifier Evaluation 560
22.3 Bias-Variance Decomposition 570
22.4 Ensemble Classifiers 574
22.6 Exercises 585

### PART V. REGRESSION

#### 23 Linear Regression ............................. 589

23.1 Linear Regression Model 589
23.2 Bivariate Regression 590
23.3 Multiple Regression 596
23.4 Ridge Regression 606
23.5 Kernel Regression 611
23.6 L1 Regression: Lasso 615

#### 24 Logistic Regression ........................... 623

24.1 Binary Logistic Regression 623
24.2 Multiclass Logistic Regression 630
24.4 Exercises 635

#### 25 Neural Networks ............................... 637

25.1 Artificial Neuron: Activation Functions 637
25.2 Neural Networks: Regression and Classification 642
25.3 Multilayer Perceptron: One Hidden Layer 648
25.4 Deep Multilayer Perceptrons 660
25.6 Exercises 670

#### 26 Deep Learning ................................. 672

26.1 Recurrent Neural Networks 672
26.2 Gated RNNS: Long Short-Term Memory Networks 682
26.3 Convolutional Neural Networks 694
26.4 Regularization 712