Book

Prologue to 2nd Edition xvii
Prologue to 1st Edition xix

CHAPTER 1 Introduction 1
1.1 IF DATA HAD MASS, THE EARTH WOULD BE A BLACK HOLE 1
1.2 LEARNING 4
1.2.1 Machine Learning 4
1.3 TYPES OF MACHINE LEARNING 5
1.4 SUPERVISED LEARNING 6
1.4.1 Regression 6
1.4.2 Classification 8
1.5 THE MACHINE LEARNING PROCESS 10
1.6 A NOTE ON PROGRAMMING 11
1.7 A ROADMAP TO THE BOOK 12

CHAPTER 2 Preliminaries 15
2.1 SOME TERMINOLOGY 15
2.1.1 Weight Space 16
2.1.2 The Curse of Dimensionality 17
2.2 KNOWING WHAT YOU KNOW: TESTING MACHINE LEARNING ALGORITHMS 19
2.2.1 Overfitting 19
2.2.2 Training, Testing, and Validation Sets 20
2.2.3 The Confusion Matrix 21
2.2.4 Accuracy Metrics 22
2.2.5 The Receiver Operator Characteristic (ROC) Curve 24
2.2.6 Unbalanced Datasets 25
2.2.7 Measurement Precision 25
2.3 TURNING DATA INTO PROBABILITIES 27
2.3.1 Minimising Risk 30
SOME BASIC STATISTICS 32
2.4.1 Averages 32
2.4.2 Variance and Covariance 32
2.4.3 The Gaussian 34
PRACTICE QUESTIONS 37

CHAPTER 3 Neurons, Neural Networks, and Linear Discriminants 39
3.1 THE BRAIN AND THE NEURON 39
3.1.1 Hebb’s Rule 40
3.1.2 McCulloch and Pitts Neurons 40
3.1.3 Limitations of the McCulloch and Pitts Neuronal Model 42
3.2 NEURAL NETWORKS 43
3.3 THE PERCEPTRON 43
3.3.1 The Learning Rate 46
3.3.2 The Bias Input 46
3.3.3 The Perceptron Learning Algorithm 47
3.3.4 An Example of Perceptron Learning: Logic Functions 48
3.3.5 Implementation 49
3.4 LINEAR SEPARABILITY 55
3.4.1 The Perceptron Convergence Theorem 57
3.4.2 The Exclusive Or (XOR) Function 58
3.4.3 A Useful Insight 59
3.4.4 Another Example: The Pima Indian Dataset 61
3.4.5 Preprocessing: Data Preparation 63
3.5 LINEAR REGRESSION 64
3.5.1 Linear Regression Examples 66
PRACTICE QUESTIONS 68

CHAPTER 4 The Multi-layer Perceptron 71
4.1 GOING FORWARDS 73
4.1.1 Biases 73
4.2 GOING BACKWARDS: BACK-PROPAGATION OF ERROR 74
4.2.1 The Multi-layer Perceptron Algorithm 77
4.2.2 Initialising the Weights 80
4.2.3 Different Output Activation Functions 81
4.2.4 Sequential and Batch Training 82
4.2.5 Local Minima 82
4.2.6 Picking Up Momentum 84
4.2.7 Minibatches and Stochastic Gradient Descent 85
4.2.8 Other Improvements 85
4.3 THE MULTI-LAYER PERCEPTRON IN PRACTICE 85
4.3.1 Amount of Training Data 86
4.3.2 Number of Hidden Layers 86
4.3.3 When to Stop Learning 88
4.4 EXAMPLES OF USING THE MLP 89
4.4.1 A Regression Problem 89
4.4.2 Classification with the MLP 92
4.4.3 A Classification Example: The Iris Dataset 93
4.4.4 Time-Series Prediction 95
4.4.5 Data Compression: The Auto-Associative Network 97
4.5 A RECIPE FOR USING THE MLP 100
4.6 DERIVING BACK-PROPAGATION 101
4.6.1 The Network Output and the Error 101
4.6.2 The Error of the Network 102
4.6.3 Requirements of an Activation Function 103
4.6.4 Back-Propagation of Error 104
4.6.5 The Output Activation Functions 107
4.6.6 An Alternative Error Function 108
PRACTICE QUESTIONS 109

CHAPTER 5 Radial Basis Functions and Splines 111
5.1 RECEPTIVE FIELDS 111
5.2 THE RADIAL BASIS FUNCTION (RBF) NETWORK 114
5.2.1 Training the RBF Network 117
5.3 INTERPOLATION AND BASIS FUNCTIONS 119
5.3.1 Bases and Basis Expansion 122
5.3.2 The Cubic Spline 123
5.3.3 Fitting the Spline to the Data 123
5.3.4 Smoothing Splines 124
5.3.5 Higher Dimensions 125
5.3.6 Beyond the Bounds 127
PRACTICE QUESTIONS 128

CHAPTER 6 Dimensionality Reduction 129
6.1 LINEAR DISCRIMINANT ANALYSIS (LDA) 130
6.2 PRINCIPAL COMPONENTS ANALYSIS (PCA) 133
6.2.1 Relation with the Multi-layer Perceptron 137
6.2.2 Kernel PCA 138
6.3 FACTOR ANALYSIS 141
6.4 INDEPENDENT COMPONENTS ANALYSIS (ICA) 142
6.5 LOCALLY LINEAR EMBEDDING 144
6.6 ISOMAP 147
6.6.1 Multi-Dimensional Scaling (MDS) 147
PRACTICE QUESTIONS 151

CHAPTER 7 Probabilistic Learning 153
7.1 GAUSSIAN MIXTURE MODELS 153
7.1.1 The Expectation-Maximisation (EM) Algorithm 154
7.1.2 Information Criteria 158
7.2 NEAREST NEIGHBOUR METHODS 158
7.2.1 Nearest Neighbour Smoothing 160
7.2.2 Efficient Distance Computations: the KD-Tree 160
7.2.3 Distance Measures 165
PRACTICE QUESTIONS 168

CHAPTER 8 Support Vector Machines 169
8.1 OPTIMAL SEPARATION 170
8.1.1 The Margin and Support Vectors 170
8.1.2 A Constrained Optimisation Problem 172
8.1.3 Slack Variables for Non-Linearly Separable Problems 175
8.2 KERNELS 176
8.2.1 Choosing Kernels 178
8.2.2 Example: XOR 179
8.3 THE SUPPORT VECTOR MACHINE ALGORITHM 179
8.3.1 Implementation 180
8.3.2 Examples 183
8.4 EXTENSIONS TO THE SVM 184
8.4.1 Multi-Class Classification 184
8.4.2 SVM Regression 186
PRACTICE QUESTIONS 188

CHAPTER 9 Optimisation and Search 189
9.1 GOING DOWNHILL 190
9.1.1 Taylor Expansion 193
9.2 LEAST-SQUARES OPTIMISATION 194
9.2.1 The Levenberg–Marquardt Algorithm 194
9.3.2 Conjugate Gradients and the MLP 201
9.4 SEARCH: THREE BASIC APPROACHES 204
9.4.1 Exhaustive Search 204
9.4.2 Greedy Search 205
9.4.3 Hill Climbing 205
9.5 EXPLOITATION AND EXPLORATION 206
9.6 SIMULATED ANNEALING 207
9.6.1 Comparison 208
PRACTICE QUESTIONS 209

CHAPTER 10 Evolutionary Learning 211
10.1 THE GENETIC ALGORITHM (GA) 212
10.1.1 String Representation 213
10.1.2 Evaluating Fitness 213
10.1.3 Population 214
10.1.4 Generating Offspring: Parent Selection 214
10.2 GENERATING OFFSPRING: GENETIC OPERATORS 216
10.2.1 Crossover 216
10.2.2 Mutation 217
10.2.3 Elitism, Tournaments, and Niching 218
10.3 USING GENETIC ALGORITHMS 220
10.3.1 Map Colouring 220
10.3.2 Punctuated Equilibrium 221
10.3.3 Example: The Knapsack Problem 222
10.3.4 Example: The Four Peaks Problem 222
10.3.5 Limitations of the GA 224
10.3.6 Training Neural Networks with Genetic Algorithms 225
10.4 GENETIC PROGRAMMING 225
10.5 COMBINING SAMPLING WITH EVOLUTIONARY LEARNING 227
PRACTICE QUESTIONS 229

CHAPTER 11 Reinforcement Learning 231
11.1 OVERVIEW 232
11.2 EXAMPLE: GETTING LOST 233
11.2.1 State and Action Spaces 235
11.2.2 Carrots and Sticks: The Reward Function 236
11.2.3 Discounting 237
11.2.4 Action Selection 237
11.2.5 Policy 238
11.3 MARKOV DECISION PROCESSES 238
11.3.1 The Markov Property 238
11.3.2 Probabilities in Markov Decision Processes 239
11.4 VALUES 240
11.5 BACK ON HOLIDAY: USING REINFORCEMENT LEARNING 244
11.6 THE DIFFERENCE BETWEEN SARSA AND Q-LEARNING 245
11.7 USES OF REINFORCEMENT LEARNING 246
PRACTICE QUESTIONS 247

CHAPTER 12 Learning with Trees 249
12.1 USING DECISION TREES 249
12.2 CONSTRUCTING DECISION TREES 250
12.2.1 Quick Aside: Entropy in Information Theory 251
12.2.2 ID3 251
12.2.3 Implementing Trees and Graphs in Python 255
12.2.4 Implementation of the Decision Tree 255
12.2.5 Dealing with Continuous Variables 257
12.2.6 Computational Complexity 258
12.3 CLASSIFICATION AND REGRESSION TREES (CART) 260
12.3.1 Gini Impurity 260
12.3.2 Regression in Trees 261
12.4 CLASSIFICATION EXAMPLE 261
PRACTICE QUESTIONS 264

CHAPTER 13 Decision by Committee: Ensemble Learning 267
13.1 BOOSTING 268
13.1.2 Stumping 273
13.2 BAGGING 273
13.2.1 Subagging 274
13.3 RANDOM FORESTS 275
13.3.1 Comparison with Boosting 277
13.4 DIFFERENT WAYS TO COMBINE CLASSIFIERS 277
PRACTICE QUESTIONS 280

CHAPTER 14 Unsupervised Learning 281
14.1 THE K-MEANS ALGORITHM 282
14.1.1 Dealing with Noise 285
14.1.2 The k-Means Neural Network 285
14.1.3 Normalisation 287
14.1.4 A Better Weight Update Rule 288
14.1.5 Example: The Iris Dataset Again 289
14.1.6 Using Competitive Learning for Clustering 290
14.2 VECTOR QUANTISATION 291
14.3 THE SELF-ORGANISING FEATURE MAP 291
14.3.1 The SOM Algorithm 294
14.3.2 Neighbourhood Connections 295
14.3.3 Self-Organisation 297
14.3.4 Network Dimensionality and Boundary Conditions 298
14.3.5 Examples of Using the SOM 300
PRACTICE QUESTIONS 303

CHAPTER 15 Markov Chain Monte Carlo (MCMC) Methods 305
15.1 SAMPLING 305
15.1.1 Random Numbers 305
15.1.2 Gaussian Random Numbers 306
15.2 MONTE CARLO OR BUST 308
15.3 THE PROPOSAL DISTRIBUTION 310
15.4 MARKOV CHAIN MONTE CARLO 313
15.4.1 Markov Chains 313
15.4.2 The Metropolis–Hastings Algorithm 315
15.4.3 Simulated Annealing (Again) 316
15.4.4 Gibbs Sampling 318
PRACTICE QUESTIONS 320

CHAPTER 16 Graphical Models 321
16.1 BAYESIAN NETWORKS 322
16.1.1 Example: Exam Fear 323
16.1.2 Approximate Inference 327
16.1.3 Making Bayesian Networks 329
16.2 MARKOV RANDOM FIELDS 330
16.3 HIDDEN MARKOV MODELS (HMMS) 333
16.3.1 The Forward Algorithm 335
16.3.2 The Viterbi Algorithm 337
16.3.3 The Baum–Welch or Forward–Backward Algorithm 339
16.4 TRACKING METHODS 343
16.4.1 The Kalman Filter 343
16.4.2 The Particle Filter 350
PRACTICE QUESTIONS 356

CHAPTER 17 Symmetric Weights and Deep Belief Networks 359
17.1 ENERGETIC LEARNING: THE HOPFIELD NETWORK 360
17.1.1 Associative Memory 360
17.1.2 Making an Associative Memory 361
17.1.3 An Energy Function 365
17.1.4 Capacity of the Hopfield Network 367
17.1.5 The Continuous Hopfield Network 368
17.2 STOCHASTIC NEURONS — THE BOLTZMANN MACHINE 369
17.2.1 The Restricted Boltzmann Machine 371
17.2.2 Deriving the CD Algorithm 375
17.2.3 Supervised Learning 380
17.2.4 The RBM as a Directed Belief Network 381
17.3 DEEP LEARNING 385
17.3.1 Deep Belief Networks (DBN) 388
PRACTICE QUESTIONS 393

CHAPTER 18 Gaussian Processes 395
18.1 GAUSSIAN PROCESS REGRESSION 397
18.1.2 Implementation 402
18.1.3 Learning the Parameters 403
18.1.4 Implementation 404
18.1.5 Choosing a (set of) Covariance Functions 406
18.2 GAUSSIAN PROCESS CLASSIFICATION 407
18.2.1 The Laplace Approximation 408
18.2.2 Computing the Posterior 408
18.2.3 Implementation 410
PRACTICE QUESTIONS 413

APPENDIX A Python 415
A.1 INSTALLING PYTHON AND OTHER PACKAGES 415
A.2 GETTING STARTED 415
A.2.1 Python for MATLAB® and R users 418
A.3 CODE BASICS 419
A.3.1 Writing and Importing Code 419
A.3.2 Control Flow 420
A.3.3 Functions 420
A.3.4 The doc String 421
A.3.5 map and lambda 421
A.3.6 Exceptions 422
A.3.7 Classes 422
A.4 USING NUMPY AND MATPLOTLIB 423
A.4.1 Arrays 423
A.4.2 Random Numbers 427
A.4.3 Linear Algebra 427
A.4.4 Plotting 427
A.4.5 One Thing to Be Aware of 429