Data Preparation
Repository ยท Notebook
Subscribe to our newsletter
๐ฌ Receive new lessons straight to your inbox (once a month) and join 40K+ developers in learning how to responsibly deliver value with ML.
Intuition
We'll start by first preparing our data by ingesting it from source and splitting it into training, validation and test data splits.
Ingestion
Our data could reside in many different places (databases, files, etc.) and exist in different formats (CSV, JSON, Parquet, etc.). For our application, we'll load the data from a CSV file to a Pandas DataFrame using the read_csv
function.
Here is a quick refresher on the Pandas library.
1 |
|
1 2 3 4 |
|
In our data engineering lesson we'll look at how to continually ingest data from more complex sources (ex. data warehouses)
Splitting
Next, we need to split our training dataset into train
and val
data splits.
- Use the
train
split to train the model.Here the model will have access to both inputs (features) and outputs (labels) to optimize its internal weights.
- After each iteration (epoch) through the training split, we will use the
val
split to determine the model's performance.Here the model will not use the labels to optimize its weights but instead, we will use the validation performance to optimize training hyperparameters such as the learning rate, etc.
- Finally, we will use a separate holdout
test
dataset to determine the model's performance after training.This is our best measure of how the model may behave on new, unseen data that is from a similar distribution to our training dataset.
Tip
For our application, we will have a training dataset to split into train
and val
splits and a separate testing dataset for the test
set. While we could have one large dataset and split that into the three splits, it's a good idea to have a separate test dataset. Over time, our training data may grow and our test splits will look different every time. This will make it difficult to compare models against other models and against each other.
We can view the class counts in our dataset by using the pandas.DataFrame.value_counts
function:
1 |
|
1 2 |
|
tag natural-language-processing 310 computer-vision 285 other 106 mlops 63 Name: count, dtype: int64
For our multi-class task (where each project has exactly one tag), we want to ensure that the data splits have similar class distributions. We can achieve this by specifying how to stratify the split by using the stratify
keyword argument with sklearn's train_test_split()
function.
Creating proper data splits
What are the criteria we should focus on to ensure proper data splits?
Show answer
- the dataset (and each data split) should be representative of data we will encounter
- equal distributions of output values across all splits
- shuffle your data if it's organized in a way that prevents input variance
- avoid random shuffles if your task can suffer from data leaks (ex.
time-series
)
1 2 3 |
|
How can we validate that our data splits have similar class distributions? We can view the frequency of each class in each split:
1 2 |
|
tag natural-language-processing 248 computer-vision 228 other 85 mlops 50 Name: count, dtype: int64
Before we view our validation split's class counts, recall that our validation split is only test_size
of the entire dataset. So we need to adjust the value counts so that we can compare it to the training split's class counts.
1 2 |
|
tag natural-language-processing 248 computer-vision 228 other 84 mlops 52 Name: count, dtype: int64
These adjusted counts looks very similar to our train split's counts. Now we're ready to explore our dataset!
Upcoming live cohorts
Sign up for our upcoming live cohort, where we'll provide live lessons + QA, compute (GPUs) and community to learn everything in one day.
To cite this content, please use:
1 2 3 4 5 6 |
|