## Preface ix

### PART I. DATA ANALYSIS FOUNDATIONS

#### 1 Data Mining and Analysis ....................... 1

1.1 Data Matrix 1
1.2 Attributes 3
1.3 Data: Algebraic and Geometric View 4
1.4 Data: Probabilistic View 14
1.5 Data Mining 30
1.7 Exercises 30

#### 2 Numeric Attributes ............................. 33

2.1 Univariate Analysis 33
2.2 Bivariate Analysis 42
2.3 Multivariate Analysis 48
2.4 Data Normalization 52
2.5 Normal Distribution 54
2.7 Exercises 60

#### 3 Categorical Attributes ......................... 63

3.1 Univariate Analysis 63
3.2 Bivariate Analysis 72
3.3 Multivariate Analysis 82
3.4 Distance and Angle 87
3.5 Discretization 89
3.7 Exercises 91

#### 4 Graph Data ..................................... 93

4.1 Graph Concepts 93
4.2 Topological Attributes 97
4.3 Centrality Analysis 102
4.4 Graph Models 112
4.6 Exercises 132

#### 5 Kernel Methods ................................. 134

5.1 Kernel Matrix 138
5.2 Vector Kernels 144
5.3 Basic Kernel Operations in Feature Space 148
5.4 Kernels for Complex Objects 154
5.6 Exercises 161

#### 6 High-dimensional Data .......................... 163

6.1 High-dimensional Objects 163
6.2 High-dimensional Volumes 165
6.3 Hypersphere Inscribed within Hypercube 168
6.4 Volume of Thin Hypersphere Shell 169
6.5 Diagonals in Hyperspace 171
6.6 Density of the Multivariate Normal 172
6.7 Appendix: Derivation of Hypersphere Volume 175
6.9 Exercises 180

#### 7 Dimensionality Reduction ....................... 183

7.1 Background 183
7.2 Principal Component Analysis 187
7.3 Kernel Principal Component Analysis 202
7.4 Singular Value Decomposition 208
7.6 Exercises 214

### PART II. FREQUENT PATTERN MINING

#### 8 Itemset Mining ................................. 217

8.1 Frequent Itemsets and Association Rules 217
8.2 Itemset Mining Algorithms 221
8.3 Generating Association Rules 234
8.5 Exercises 237

#### 9 Summarizing Itemsets ........................... 242

9.1 Maximal and Closed Frequent Itemsets 242
9.2 Mining Maximal Frequent Itemsets: GenMax Algorithm 245
9.3 Mining Closed Frequent Itemsets: Charm Algorithm 248
9.4 Nonderivable Itemsets 250
9.6 Exercises 256

#### 10 Sequence Mining ............................... 259

10.1 Frequent Sequences 259
10.2 Mining Frequent Sequences 260
10.3 Substring Mining via Suffix Trees 267
10.5 Exercises 277

#### 11 Graph Pattern Mining .......................... 280

11.1 Isomorphism and Support 280
11.2 Candidate Generation 284
11.3 The gSpan Algorithm 288
11.5 Exercises 297

#### 12 Pattern and Rule Assessment ................... 301

12.1 Rule and Pattern Assessment Measures 301
12.2 Significance Testing and Confidence Intervals 316
12.4 Exercises 328

### PART III. CLUSTERING

#### 13 Representative-based Clustering ............... 333

13.1 K-means Algorithm 333
13.2 Kernel K-means 338
13.3 Expectation-Maximization Clustering 342
13.5 Exercises 361

#### 14 Hierarchical Clustering ....................... 364

14.1 Preliminaries 364
14.2 Agglomerative Hierarchical Clustering 366
14.4 Exercises and Projects 373

#### 15 Density-based Clustering ...................... 375

15.1 The DBSCAN Algorithm 375
15.2 Kernel Density Estimation 379
15.3 Density-based Clustering: DENCLUE 385
15.5 Exercises 391

#### 16 Spectral and Graph Clustering ................. 394

16.1 Graphs and Matrices 394
16.2 Clustering as Graph Cuts 401
16.3 Markov Clustering 416
16.5 Exercises 423

#### 17 Clustering Validation ......................... 425

17.1 External Measures 425
17.2 Internal Measures 440
17.3 Relative Measures 448
17.5 Exercises 462

### PART IV. CLASSIFICATION

#### 18 Probabilistic Classification .................. 467

18.1 Bayes Classifier 467
18.2 Naive Bayes Classifier 473
18.3 K Nearest Neighbors Classifier 477
18.5 Exercises 479

#### 19 Decision Tree Classifier ...................... 481

19.1 Decision Trees 483
19.2 Decision Tree Algorithm 485
19.4 Exercises 496

#### 20 Linear Discriminant Analysis .................. 498

20.1 Optimal Linear Discriminant 498
20.2 Kernel Discriminant Analysis 505
20.4 Exercises 512

#### 21 Support Vector Machines ....................... 514

21.1 Support Vectors and Margins 514
21.2 SVM: Linear and Separable Case 520
21.3 Soft Margin SVM: Linear and Nonseparable Case 524
21.4 Kernel SVM: Nonlinear Case 530
21.5 SVM Training: Stochastic Gradient Ascent 534