Blog
A Beginner's Guide to Using Custom Metrics in Amazon SageMaker Training Jobs
Amazon SageMaker Series - Article 5
We are authoring a series of articles to feature the end-to-end process of data preparation to model training, deployment, pipeline building, and monitoring using the Amazon SageMaker Pipelines. This fifth article of the Amazon SageMaker series focuses on how to define custom metrics in Amazon SageMaker training jobs.
Amazon SageMaker allows you to track custom metrics during training jobs, giving you the ability to monitor the performance of your machine learning models and make informed decisions on how to evaluate different models. Whether you are using a custom loss function, tracking a metric not included by default in SageMaker, or building a custom training script, being able to define your own custom metrics can be a valuable tool. In this blog post, we will demonstrate how to use custom metrics in SageMaker training jobs.
Custom metrics defined in a SageMaker training job are automatically logged to Amazon CloudWatch and can be viewed from both the CloudWatch and SageMaker console, under the selected training job. You can view a graph of the custom metric over time and use it to track the performance of the training job. Additionally, if you are using SageMaker Experiments, your custom metrics will be available in SageMaker Studio. SageMaker Experiments allows you to compare different training runs and resulting models, as well as their associated training metrics, to help you make informed decisions about your model.
How it Works
SageMaker custom metrics work by searching through all the logs written to standard output (stdout) or standard error (stderr) for a specific regular expression. In order to define a new metric, you must provide the name for the metric and the regular expression used to search training logs for this metric. For each metric, these parameters are provided as part of the training job definition.
You may be creating your training job from one of a few different ways. If you are using an end to end notebook with the SageMaker SDK, you will likely be using a framework estimator such as a PyTorch estimator or MXNet estimator, and training with the class’s fit() method. The Code Sample section below shows an example of how to define custom metrics using a PyTorch estimator. Any other estimator which extends the Base SageMaker Estimator class can define custom metrics in the same way.
If you are instead interacting with SageMaker at a lower level using the API directly or boto3, you can still define custom metrics for your training job when using CreateTrainingJob. As part of the AlgorithmSpecification parameter, there is a MetricDefinitions parameter which expects the same input parameters as the example below to define metrics.
Code Sample
Before you can use custom metrics in a SageMaker training job, you will need to set up your training job as you normally would. This includes selecting a training algorithm, authoring your training script, setting up input data, and choosing any hyperparameters. Here, we are using a PyTorch estimator with a training script 'pytorch-train.py'.
As part of the training script 'pytorch-train.py', we must include some code to print the custom metric to stdout. Here, we use the print function, but you can also use a logging function that writes to stdout.
print("training loss %f" %training_loss)
Now, in our notebook we can define the estimator to use for training our model. Here we define the PyTorch EStimator, including the training entry point script 'pytorch-train.py'
The metric_definitions parameter is used to define our custom metric. Name is provided as the name for the metric that will appear in the console. Regex is the regular expression which is used to search training job logs for the metric.
pytorch_estimator = PyTorch(
entry_point = 'pytorch-train.py',
instance_type='ml.p3.2xlarge',
instance_count=1,
framework_version='1.8.0',
py_version='py3',
hyperparameters = {'epochs': 20, 'batch-size': 64, 'learning-rate': 0.1},
metric_definitions=[
{'Name': 'training_loss', 'Regex': 'training loss ([0-9\\.]+)'}
]
)
Now, all that’s left is to start the training job by specifying the S3 location of the train and test datasets.
pytorch_estimator.fit({'train': 's3://my-data-bucket/path/to/my/training/data',
'test': 's3://my-data-bucket/path/to/my/test/data'})
That’s it! To view the new custom metric in the SageMaker console, go to the "Training jobs" page and select your training job. Then, click on the "Metrics" tab to view a graph of your custom metric over time. The custom metrics will also be logged in CloudWatch, and If you are using SageMaker Experiments with this training job, they will also be available in the SageMaker Experiments console in SageMaker Studio.
Conclusion
Custom metrics are a powerful tool for tracking and monitoring specific metrics during SageMaker training jobs. With built-in integrations with Amazon CloudWatch and SageMaker Experiments, you can easily gain valuable insights into your model’s performance with minimal code changes.
FAQ
-
How can users configure advanced settings for custom metrics in complex scenarios, such as multi-class classification problems?
For configuring advanced settings for custom metrics in Amazon SageMaker, particularly in complex scenarios like multi-class classification, users can define and utilize custom scripts to specify and evaluate these metrics, adapting to the unique requirements of their models.
-
Is it possible to monitor custom metrics in real-time during the training process, and if so, how can this be achieved?
Custom metrics can be monitored in real time during the training of models in Amazon SageMaker through the integration of Amazon CloudWatch. This allows for the tracking of these metrics as the model trains, providing insights into performance and accuracy as adjustments are made.
-
How do custom metrics in Amazon SageMaker integrate with other AWS services for enhanced monitoring and analysis?
Custom metrics in Amazon SageMaker seamlessly integrate with other AWS services, such as AWS CloudWatch and AWS Lambda, for enhanced monitoring, alerting, and automated response mechanisms. This integration facilitates a more comprehensive analysis and utilization of the training job metrics, enabling more effective decision-making and optimization of machine learning models.
Author Spotlight:
Caitlin Berger
Keep Up To Date With AWS News
Stay up to date with the latest AWS services, latest architecture, cloud-native solutions and more.