**Context:** The data set has many features, not all of them are (equally) relevant.

**Problem:** Because of the number of features the model may be overfitting. Some features may just be noise, while other features only make a small contribution to the outcome.

**Solution:** Only (the most) relevant features are selected as input for the learning algorithm. This can be done by hand or by using an algorithm. Two of these algorithms are Minimum-redundancy-maximum-relevance feature selection and Correlation feature selection.

**Related patterns:** regularization

**Context:** Only unprepared data about individuals is available.

**Problem:** Data about individuals is available, but this data cannot be used directly by learning algorithms.

**Solution:** Feature extraction is the process of making features from available data. Some approaches to feature extraction are edge detection or motion detection for image processing and Principal Component Analysis to remove redundant data (correlations are detected).

**Related patterns:** -

**Context:** Features in the data set have very different ranges.

**Problem:** Some learning algorithms are sensitive to the usage of different scales across features.

**Solution:** Two methods to feature scaling are well-known: normalization and standardization. Normalization scales all features to a value between 0 and 1. Standardization scales a feature in such a way that the new mean value is 0 and the standard deviation is 1. Using normalization one knows for sure that the new value is in the [0,1] range, but the data may be scaled to a very small interval within this range. Using standardization the data has a better distribution, but there may be outliners that result in a large range of possible values.

**Related patterns:** -

**Context:** Features in the data set have continuous ranges.

**Problem:** Sometimes it makes more sense to have groups of numerical values (e.g. people with age between 20 and 30) instead of individual values for training a learning algorithm. Learning algorithms may provide different results for categorical features in comparison to numerical features.

**Solution:** Discretization is the process of mapping features from a continuous to a discrete domain. Important choices to make in discretization is the number of categories one needs and/or the number of individuals that need to be in the categories. Thresholding is a specific type of discretization used to make Boolean features.

**Related patterns:** -

**Context:** Data is missing.

**Problem:** The data set contains individuals with missing data while the preferred learning algorithm requires complete data. Deleting all individuals with missing data would result in a data set that is too small or would lead to incorrect patterns if the missing is not missing at random.

**Solution:** Imputation is the process of filling missing data with a certain value. In single imputation the missing data is replaced by a value such as the mean or mode. Multiple imputation consists of three steps: imputation multiple times with random variation, analysis of all imputation results, and combination of the results.

**Related patterns:** -

**Context:** Supervised learning needs to be applied.

**Problem:** An individual belongs to a certain class. For a training data set the individuals are labeled (i.e. the class to which the individual belongs is known). Based on characteristic of the individual (features) one wants to predict to which class the individual belongs. This type of problem is called a classification problem.

**Solution:** A neural network tries to simulate a human brain. It consists of multiple artificial neurons having input connectors (comparable to dendrites in a human neuron) and one output connector (comparable to the axon in a human neuron). The neurons are ordered in different layers: one input layer, one output layer, and one or multiple hidden layers. The neurons in the input layer contain the features, those in the output layer the possible output classes. The neurons in the hidden layer are the glue between the neurons in the input and output layer. Each neuron can be activated by signals from the proceeding neurons. Whether the neuron gives a positive signal to the next neuron(s) depends on a threshold function and weights. Usually the threshold function is the sigmoid function. The weights are trained using the training data. Though neural networks are mainly used for classification, they can also be used for regression.

**Related patterns:** Other patterns that can be used for classification are: decision trees, random forests, SVMs, logistic regression and Naive Bayes classifiers.

**Context:** Supervised learning needs to be applied.

**Problem:** An individual belongs to a certain class. For a training data set the individuals are labeled (i.e. the class to which the individual belongs is known). Based on characteristic of the individual (features) one wants to predict to which class the individual belongs. This type of problem is called a classification problem.

**Solution:** A decision tree aims at minimizing entropy. Entropy is a measure of chaos; something that is very ordered has a very low entropy, something that is very messy has a very high entropy. A decision tree is structure that resembles a flow chart. Every node in the tree represents a decision that needs to be taken for determining the class. Based on the training data the most relevant features and their values are selected for lowering the entropy of the data set. In contrast to many other approaches to classification, decision trees are easy to interpret by humans. Though decision trees are mainly used for classification, they can also be used for regression.

**Related patterns:** Other patterns that can be used for classification are: neural networks, random forests, SVMs, logistic regression and Naive Bayes classifiers.

**Context:** Supervised learning needs to be applied.

**Problem:** An individual belongs to a certain class. For a training data set the individuals are labeled (i.e. the class to which the individual belongs is known). Based on characteristic of the individual (features) one wants to predict to which class the individual belongs. This type of problem is called a classification problem.

**Solution:** A random forest is a model ensemble. An ensemble combines multiple models to achieve better results than a single model would. A random forest consists of multiple decision trees. Each tree in the forest has a different random subset of the features (subspace sampling) and the trees are fed with different subsets of the training data (bagging). Though a random forest gives more accurate results than a single decision tree, it is harder to read and takes more computational time to generate.

**Related patterns:** Other patterns that can be used for classification are: neural networks, decision trees, SVMs, logistic regression and Naive Bayes classifiers. An optimization pattern that is applied in creating random forests is bagging.

**Context:** Supervised learning needs to be applied.

**Problem:** An individual belongs to a certain class. For a training data set the individuals are labeled (i.e. the class to which the individual belongs is known). Based on characteristic of the individual (features) one wants to predict to which class the individual belongs. This type of problem is called a classification problem.

**Solution:** A Support Vector Machine (SVM) maps all features of the training data to points in space. This space has two, three or even more dimensions, depending on the number of features of the training data. It constructs a separator for dividing the points in space into different subsets. This is done by determining a maximum margin between the points. This separator is called a hyperplane. Using a so-called kernel-trick, SVMs can also be used for constructing non-linear models. Though SVMs are mainly used for classification, they can also be used for regression.

**Related patterns:** Other patterns that can be used for classification are: neural networks, decision trees, random forests, logistic regression and Naive Bayes classifiers.

**Context:** Supervised learning needs to be applied.

**Problem:** An individual belongs to a certain class. For a training data set the individuals are labeled (i.e. the class to which the individual belongs is known). Based on characteristic of the individual (features) one wants to predict to which class the individual belongs. This type of problem is called a classification problem.

**Solution:** The name logistic regression might be a bit confusion. It seems to imply a solution to a regression problem while in fact it is used for classification problems. Logistic regression uses the Sigmoid function with weights for the different features. The weights are determined using the training data.

**Related patterns:** Other patterns that can be used for classification are: neural networks, decision trees, random forests, SVMs, and Naive Bayes classifiers.

**Context:** Supervised learning needs to be applied.

**Problem:** An individual belongs to a certain class. For a training data set the individuals are labeled (i.e. the class to which the individual belongs is known). Based on characteristic of the individual (features) one wants to predict to which class the individual belongs. This type of problem is called a classification problem.

**Solution:** A Naive Bayes classifier is a statistical approach to classification. It is based on Bayes’ theorem and assumes that the features are independent. Often features in a data set have some correlation. Nevertheless, Naive Bayes actually also performance quite well in real-world problems (in which features are often not independent). While other approaches may perform better in general, Bayesian classifiers have as a main advantage that they only require a small training set.

**Related patterns:** Other patterns that can be used for classification are: neural networks, decision trees, random forests, SVMs, and logistic regression.

**Context:** Unsupervised learning needs to be applied.

**Problem:** One wants to find sets of related individuals. This type of problem is called a clustering problem.

**Solution:** Hierarchical clustering produces a tree structure of clusters. Each level in the tree represents a division of the individuals in clusters. The lower one moves into the tree, the more clusters are used to divide the individuals. The main advantage of this approach is that one does not need to determine the amount of clusters in advance. The tree structure produced by hierarchical clustering is called a dendrogram. This dendrogram is created by determining the similarity between individuals based on their features.

**Related patterns:** Another patterns that can be used for clustering is K-means clustering.

**Context:** Unsupervised learning needs to be applied.

**Problem:** One wants to find sets of related individuals. This type of problem is called a clustering problem.

**Solution:** K-means clustering is an iterative approach for dividing a data set into K distinct clusters. One needs to determine in advance how many clusters are desired. K-means is a heuristic algorithm and the cluster division depends on the initial clusters. Therefore one can try multiple initializations and choose the best result. K-means clustering performs better than hierarchical clustering in regard to computational time.

**Related patterns:** Another pattern that can be used for clustering is hierarchical clustering.

**Context:** Supervised learning needs to be applied.

**Problem:** An individual has a certain numerical feature that one wants to predict. For a training data set the individuals are labeled (i.e. the value of interest for the individual is known). Based on characteristic of the individual (features) one wants to predict the value of the individual. This type of problem is called a regression problem.

**Solution:** Linear regression assumes a linear connection between the features and the output value. The training data is used to determine the weights of the different features. Linear regression is very simple, but often real-world problems don’t fit into this linearity assumption.

**Related patterns:** Another patterns that can be used for regression is non-linear regression.

**Context:** Supervised learning needs to be applied.

**Problem:** An individual has a certain numerical feature that one wants to predict. For a training data set the individuals are labeled (i.e. the value of interest for the individual is known). Based on characteristic of the individual (features) one wants to predict the value of the individual. This type of problem is called a regression problem.

**Solution:** Non-linear regression is an approach to regression that does not assume a linear connection between the features and the output value.

**Related patterns:** Another patterns that can be used for regression is linear regression.

**Context:** Optimization in supervised learning is desired.

**Problem:** Applying a single learning algorithm results into underfitting or overfitting.

**Solution:** In boosting weak learners are applied iteratively to the data set. While in bagging each individual model is trained separately, in boosting the results of the previous trained model are passed to the next model. Individuals that are misclassified by a model get additional weight in training the next model.

**Related patterns:** Bagging is another ensemble approach.

**Context:** Optimization in supervised learning is desired.

**Problem:** A classifier or regression estimation is instable (very sensitive to noise in the data) and therefore overfitting.

**Solution:** Bagging (Bootstrap aggregating) is a method for creating multiple models (models ensemble) by making different random samples of the original data set. Each of the models is trained with one of those samples. The name bagging is short for bootstrap aggregating. A bootstrap sample is a random sample with replacement. The final result of the different models is done by voting in classification problems or by averaging in regression problems. Bagging works well for instable classifiers, e.g. decision trees and neural networks. A classifier is unstable if small changes in training data lead to significantly different models.

**Related patterns:** Random forests are model ensembles based on bagging. Boosting is another ensemble approach.

**Context:** The data set has many features, not all of them are (equally) relevant.

**Problem:** A lot of features are presents that all contribute to some extent to the prediction. Overfitting may occur, especially if there is little data available.

**Solution:** Regularization means that the values of weights of the features are reduced. This results in a simpler prediction function, which is less sensitive to overfitting.

**Related patterns:** feature selection

**Context:** Supervised learning needs to be applied.

**Problem:** The holdout method may not work for determining how good a model is for two reasons. First, not enough data may be available for setting aside a portion of the data set for testing purposes (approximately 30% of the data cannot be used to train the model). Second, the split of the data may be unfortunate since the split is only made once. We do not know the variance of the performance for different data sets.

**Solution:** Instead of only one split, multiple splits of the data are made. Well-known types of splits are random subsampling, Leave-one-out, and K-fold. In random subsamples several combinations of random selections of training data and test data are made. Leave-one-out means that the data is split into combinations of n-1 training data individuals and 1 test individual. The total number of experiments is n. K-fold means that the data is split into k partitions, so the training data size is (k-1)*n/k and the test data size is n/k. All individuals are used only once as test data. The total number of experiments is k. So in cross validation we get multiple combinations of a training data set and test data set. For each of these combinations a model is trained and tested. The average accuracy and variance can be calculated.

**Related patterns:** Holdout method

**Context:** Supervised learning needs to be applied.

**Problem:** In a supervised learning problem one wants to determine how good the trained model is.

**Solution:** Before the model is trained, the data is split into a training data set and a test data set. The training data set is used to train the model. The test data set is kept apart and used to determine how good the model is.

**Related patterns:** Cross validation

**Context:** The performance of a model needs to be known.

**Problem:** In binary classification problems one wants to make a trade-off between sensitivity (no false negatives) and specificity (no false positives).

**Solution:** A Receiver Operating Characteristic (ROC) curve is a graph that shows the performance of a binary classifier as its discrimination threshold is varied. The curve has the true positive rate on one axis and the false positive rate on the other axis. Each point on the ROC curve corresponds to a specific confusion matrix. The best possible classifier would result in a point in the upper left corner (0 false positive rate, 1.0 true positive rate). Because of this, the closer the ROC curve is to the upper left corner, the higher the overall accuracy of the model. The larger the Area Under the Curve (AUC), the better the classifier performs.

**Related patterns:** confusion matrix

**Context:** The performance of a model needs to be known.

**Problem:** In classification problems one wants to see the results of the classification of the test set.

**Solution:** A confusion matrix is a matrix with the predicted classes in the columns and the actual classes in the rows. The confusion matrix of a perfect classifier would only have numbers larger than zero on its diagonal. Sometimes a table of confusion is also called a confusion matrix. A table of confusion has two rows and two columns and shows the number of false positives, false negatives, true positives, and true negatives.

**Related patterns:** ROC curve