Macheine Learning Data
The concepts of datasets, samples, labels in Machine Learning.
Sources:
Mu Li et al. 1. Introduction. Dive into Deep Learning.
Cross-validation: evaluating estimator performance
Requirements
Python: 3.12
OS: Ubuntu22.04, x86_64
Requirements:
1 | scikit-learn==1.3.2 |
Data
In order to work with data usefully, we typically need to come up with a suitable numerical representation.
Let \(D = \{ x_1,x_2,..,x_m \}\) detotes the dataset of \(m\) data points. Each data point (or example, data instance, sample) \(x_i\) is described as a set of attributes (or features). Say each data point has \(d\) features, then it can be represented as a \(d\)-dimention vector \[ x_i=\{ x_{i1}, x_{i2}, \dots, x_{id} \}, \]
where \(x_{ij}\) is the value on the \(j\)-th feature of data point \(x_i\), \(d\) is called the dimensionality of \(x_i\).
In supervised learning, each sample has a special feature "label". The goal of supervised learning is to predict the value of the label.
When the context is clear,we usually refer to "samples" as those with labels. It's not confusing.
The splitting of the dataset: Doing well on the training data does not guarantee that we will do well on unseen data. So we will typically want to
Partition of a dataset: Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice to split the dataset into two partitions:
- the training dataset (or training set), for learning model parameters.
- the test dataset (or test set), which is held out for evaluation.
The quality of data: We need the right data, which includes the comleteness of the features, the comleteness of the samples and ethics, etc.
Imagine applying a skin cancer recognition system in the wild that had never seen black skin before.
例如, 如果某个诊断皮肤癌的系统从未见过黑人, 那么由于黑人在数据中没有被表示过, 即样本不完备, 可能形成误诊.
再例如, ML模型可能无意中捕捉到历史上的不公正现象, 并将其自动化. 假设一个招聘系统学习其历史数据, 而该公司素来有歧视女性员工的传统, 该系统很可能会学习“选雇佣男性”这一倾向.
Partition of a dataset
1 | from sklearn.model_selection import train_test_split, cross_val_score |
Load the data from a dataset.
1 | # Load iris data. |
We can partition the dataset and use 60% of it as the training set, 40% as testing set:
1 | ratio_of_testing_set = 0.4 |
Then we traing an model, say SVM, on the training set and evaluate it on the testing set:
1 | clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train) |
Drawbacks
However, this method has two drawbacks
- By partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model,
- and the results can depend on a particular random choice for the pair of (train, validation) sets. For instance, how to we know which part of dataset shoule be chosn as testing set, [0, 0.4], [0.3, 0.7] or [0.6, 1.0]? The decision depends on the random seed
random_state
.
Cross Validation
A solution to this problem is a procedure called cross-validation (CV for short).
The basic approach, called k-fold CV, is as follows:
- the dataset is split into k smaller sets.
- Choose one set as the testing set.
- The left \(k-1\) sets are merged as the traing set.
- Use this partition to train and evalute our model, get one score.
- Then redo step 2-4 for \(k\) times, each iteration choose a different set as the testing set.
- The final score is the mean of all the scores in the iterations.
- Evaluate the model on the testing set.
1 | # Example of 5-fold cross validation. |
This approach can be computationally expensive, but does not waste too much data
Confusion Matrix
A confusion matrix is a table that is often used to evaluate the performance of a classification algorithm. It provides a summary of the number of correct and incorrect predictions made by a classifier on a dataset. The matrix has four entries, which are typically described as True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). These entries are organized in a table like this:
Predicted Actual | Positive | Negative |
---|---|---|
Positive | TP | FP |
Negative | FN | TN |
Here's what each term in the confusion matrix represents:
- True Positive (TP): The classifier correctly predicted instances of the positive class. For example, the classifier correctly identified emails as spam.
- True Negative (TN): The classifier correctly predicted instances of the negative class. For example, the classifier correctly identified non-spam emails.
- False Positive (FP): Also known as a Type I error, this occurs when the classifier incorrectly predicts positive instances when the actual class is negative. For example, the classifier incorrectly identifying a non-spam email as spam.
- False Negative (FN): Also known as a Type II error, this occurs when the classifier incorrectly predicts negative instances when the actual class is positive. For example, the classifier incorrectly identifying a spam email as non-spam.
From the confusion matrix, various performance metrics can be calculated to assess the performance of a classifier, including:
- Accuracy: (TP + TN) / (TP + TN + FP + FN)
- Precision: TP / (TP + FP)
- Recall (Sensitivity or True Positive Rate): TP / (TP + FN)
- Specificity (True Negative Rate): TN / (TN + FP)
- F1 Score: 2 * (Precision * Recall) / (Precision + Recall)
ROC and AUC
1 | from matplotlib import pyplot as plt |
ROC (Receiver Operating Characteristic): ROC is a graphical representation of a classification model's performance across various discrimination thresholds.
The ROC curve is created by plotting the true positive rate against the false positive rate for different threshold values.
1 | # Load breast cancer data. |
AUC (Area Under the ROC Curve): represents the area under the ROC curve. It provides a single scalar value that quantifies the classifier's ability to discriminate between the positive and negative classes.
- A higher AUC indicates better discrimination.
- The AUC value ranges from 0 to 1, where 0.5 suggests no discrimination (similar to random guessing), and 1 indicates perfect discrimination.
1 | # Compute AUC score |