Table of Contents

Preface ix

PART I. DATA ANALYSIS FOUNDATIONS

1 Data Mining and Analysis ....................... 3

1.1 Data Matrix 3
1.2 Attributes 4
1.3 Data: Algebraic and Geometric View 5
1.4 Data: Probabilistic View 16
1.5 Further Reading 28
1.6 Exercises 28

2 Numeric Attributes ............................. 29

2.1 Univariate Analysis 29
2.2 Bivariate Analysis 40
2.3 Multivariate Analysis 46
2.4 Data Normalization 50
2.5 Normal Distribution 52
2.6 Further Reading 58
2.7 Exercises 58

3 Categorical Attributes ......................... 61

3.1 Univariate Analysis 61
3.2 Bivariate Analysis 70
3.3 Multivariate Analysis 81
3.4 Distance and Angle 86
3.5 Discretization 87
3.6 Further Reading 89
3.7 Exercises 90

4 Graph Data ..................................... 92

4.1 Graph Concepts 92
4.2 Topological Attributes 96
4.3 Centrality Analysis 101
4.4 Graph Models 111
4.5 Further Reading 131
4.6 Exercises 131

5 Kernel Methods ................................. 134

5.1 Kernel Matrix 138
5.2 Vector Kernels 144
5.3 Basic Kernel Operations in Feature Space 149
5.4 Kernels for Complex Objects 155
5.5 Further Reading 161
5.6 Exercises 161

6 High-dimensional Data .......................... 163

6.1 High-dimensional Objects 163
6.2 High-dimensional Volumes 167
6.3 Hypersphere Inscribed within Hypercube 170
6.4 Volume of Thin Hypersphere Shell 171
6.5 Diagonals in Hyperspace 172
6.6 Density of the Multivariate Normal 173
6.7 Appendix: Derivation of Hypersphere Volume 177
6.8 Further Reading 181
6.9 Exercises 181

7 Dimensionality Reduction ....................... 184

7.1 Background 184
7.2 Principal Component Analysis 188
7.3 Kernel Principal Component Analysis 203
7.4 Singular Value Decomposition 210
7.5 Further Reading 215
7.6 Exercises 215

PART II. FREQUENT PATTERN MINING

8 Itemset Mining ................................. 219

8.1 Frequent Itemsets and Association Rules 219
8.2 Itemset Mining Algorithms 223
8.3 Generating Association Rules 237
8.4 Further Reading 238
8.5 Exercises 239

9 Summarizing Itemsets ........................... 244

9.1 Maximal and Closed Frequent Itemsets 244
9.2 Mining Maximal Frequent Itemsets: GenMax Algorithm 247
9.3 Mining Closed Frequent Itemsets: Charm Algorithm 250
9.4 Nonderivable Itemsets 252
9.5 Further Reading 258
9.6 Exercises 258

10 Sequence Mining ............................... 261

10.1 Frequent Sequences 261
10.2 Mining Frequent Sequences 262
10.3 Substring Mining via Suffix Trees 269
10.4 Further Reading 279
10.5 Exercises 279

11 Graph Pattern Mining .......................... 282

11.1 Isomorphism and Support 282
11.2 Candidate Generation 286
11.3 The gSpan Algorithm 290
11.4 Further Reading 298
11.5 Exercises 299

12 Pattern and Rule Assessment ................... 303

12.1 Rule and Pattern Assessment Measures 303
12.2 Significance Testing and Confidence Intervals 318
12.3 Further Reading 330
12.4 Exercises 330

PART III. CLUSTERING

13 Representative-based Clustering ............... 334

13.1 K-means Algorithm 334
13.2 Kernel K-means 339
13.3 Expectation-Maximization Clustering 343
13.4 Further Reading 360
13.5 Exercises 361

14 Hierarchical Clustering ....................... 364

14.1 Preliminaries 364
14.2 Agglomerative Hierarchical Clustering 366
14.3 Further Reading 372
14.4 Exercises and Projects 373

15 Density-based Clustering ...................... 375

15.1 The DBSCAN Algorithm 375
15.2 Kernel Density Estimation 379
15.3 Density-based Clustering: DENCLUE 385
15.4 Further Reading 390
15.5 Exercises 391

16 Spectral and Graph Clustering ................. 394

16.1 Graphs and Matrices 394
16.2 Clustering as Graph Cuts 401
16.3 Markov Clustering 417
16.4 Further Reading 422
16.5 Exercises 424

17 Clustering Validation ......................... 426

17.1 External Measures 426
17.2 Internal Measures 441
17.3 Relative Measures 450
17.4 Further Reading 464
17.5 Exercises 465

PART IV. CLASSIFICATION

18 Probabilistic Classification .................. 469

18.1 Bayes Classifier 469
18.2 Naive Bayes Classifier 475
18.3 K Nearest Neighbors Classifier 479
18.4 Further Reading 480
18.5 Exercises 482

19 Decision Tree Classifier ...................... 483

19.1 Decision Trees 485
19.2 Decision Tree Algorithm 487
19.3 Further Reading 498
19.4 Exercises 499

20 Linear Discriminant Analysis .................. 501

20.1 Optimal Linear Discriminant 501
20.2 Kernel Discriminant Analysis 508
20.3 Further Reading 515
20.4 Exercises 515

21 Support Vector Machines ....................... 517

21.1 Support Vectors and Margins 517
21.2 SVM: Linear and Separable Case 523
21.3 Soft Margin SVM: Linear and Nonseparable Case 527
21.4 Kernel SVM: Nonlinear Case 533
21.5 SVM Training: Stochastic Gradient Ascent 537
21.6 Further Reading 543
21.7 Exercises 544

22 Classification Assessment ..................... 546

22.1 Classification Performance Measures 546
22.2 Classifier Evaluation 560
22.3 Bias-Variance Decomposition 570
22.4 Ensemble Classifiers 574
22.5 Further Reading 584
22.6 Exercises 585

PART V. REGRESSION

23 Linear Regression ............................. 589

23.1 Linear Regression Model 589
23.2 Bivariate Regression 590
23.3 Multiple Regression 596
23.4 Ridge Regression 606
23.5 Kernel Regression 611
23.6 L1 Regression: Lasso 615
23.7 Further Reading 621

24 Logistic Regression ........................... 623

24.1 Binary Logistic Regression 623
24.2 Multiclass Logistic Regression 630
24.3 Further Reading 635
24.4 Exercises 635

25 Neural Networks ............................... 637

25.1 Artificial Neuron: Activation Functions 637
25.2 Neural Networks: Regression and Classification 642
25.3 Multilayer Perceptron: One Hidden Layer 648
25.4 Deep Multilayer Perceptrons 660
25.5 Further Reading 670
25.6 Exercises 670

26 Deep Learning ................................. 672

26.1 Recurrent Neural Networks 672
26.2 Gated RNNS: Long Short-Term Memory Networks 682
26.3 Convolutional Neural Networks 694
26.4 Regularization 712
26.5 Further Reading 717
26.6 Exercises 718

27 Regression Evaluation ......................... 720

27.1 Univariate Regression 721
27.2 Multiple Regression 735
27.3 Further Reading 752
27.4 Exercises 752

Index 755