Table of contents

Introduction

The steps involved in hyperparameter tuning

2.1.

Set the hyperparameter tuning parameters for your training task.

2.2.

gcloud

2.3.

Python

Code Checking in your training application

3.1.

Handling the command-line arguments

3.2.

Reporting your hyperparameter metric to AI Platform Training

3.3.

Keras

3.4.

Estimator

Hyperparameter tuning job details while running

4.1.

Getting Results

4.2.

Setting Limits to the number of trails.

Failed trials Handling

Running parallel trials

Trails stopping

Resuming a completed hyperparameter tuning job

8.1.

gcloud

8.2.

Python

Hyperparameter tuning with Cloud TPU

10.

Frequently Asked Questions

10.1.

Why is hyperparameter tuning important?

10.2.

What is Keras?

10.3.

What is a runtime error?

11.

Conclusion

Last Updated: Mar 27, 2024

Medium

Using hyperparameter tuning

Q: Why is hyperparameter tuning important?

A machine learning model's behaviour can largely be controlled via hyperparameter adjustment.

Author Muskan Sharma

Do you think IIT Guwahati certified course can help you in your career?

Yes

Introduction

This article demonstrates how to train your model using hyperparameter tuning for the AI Platform. A target variable that you choose is optimized through hyperparameter tuning. The hyperparameter metric is the name of the target variable.

The steps involved in hyperparameter tuning

It would help if you took the following actions to use hyperparameter tuning in your training job:

Include a HyperparameterSpec in your TrainingInput object to specify the hyperparameter tuning configuration for your training job.

The following code should be included in your training application:

To establish the hyperparameters for your training trial, parse the command-line arguments that indicate the hyperparameters you want to modify.

Include your hyperparameter metric in the graph's summary.

Set the hyperparameter tuning parameters for your training task.

To save the hyperparameter tuning configuration for your training task, create a HyperparameterSpec object, then include it as the hyperparameters object in your TrainingInput object.

Trial jobs will be produced by the hyperparameter tuning job. You can define a custom machine type in the TrainingInput object to quicken the training trial job process. For instance, you can define the masterType to be n1-standard-8 and leave the worker configuration blank to create the trial jobs with each trial job using n1-standard-8 VMs.

Set the hyperparameterMetricTag in your HyperparameterSpec to a value corresponding to the selected metric. AI Platform Training searches for a metric with the name training/hptuning/metric if a hyperparameterMetricTag is not specified. How to construct a setup for metric1 is demonstrated in the example below:

gcloud

To your configuration YAML file, provide the configuration details for your hyperparameters. Here is an illustration. See hptuning config—yaml in the census estimator sample for a functioning configuration file.

trainingInput:
  scaleTier: CUSTOM
  masterType: complex_model_m
  workerType: complex_model_m
  parameterServerType: large_model
  workerCount: 9
  parameterServerCount: 3
  hyperparameters:
    goal: MAXIMIZE
    hyperparameterMetricTag: metric1
    maxTrials: 30
    maxParallelTrials: 1
    enableTrialEarlyStopping: True
    params:
    - parameterName: hidden1
      type: INTEGER
      minValue: 40
      maxValue: 400
      scaleType: UNIT_LINEAR_SCALE
    - parameterName: numRnnCells
      type: DISCRETE
      discreteValues:
      - 1
      - 2
      - 3
      - 4
    - parameterName: rnnCellType
      type: CATEGORICAL
      categoricalValues:
      - BasicLSTMCell
      - BasicRNNCell
      - GRUCell
      - LSTMCell
      - LayerNormBasicLSTMCell

Python

To add to your training input, create a dictionary that represents your hyperparameterSpec. As stated in the training job configuration guide, the example assumes that you have already created a TrainingInput dictionary with the name training inputs.

Code Checking in your training application

Manage the hyperparameter command-line arguments in your application and submit your hyperparameter measure to AI Platform Training.

Handling the command-line arguments

When calling your training application, AI Platform Training sets command-line arguments. Put command-line arguments to use in your code:

Give each hyperparameter argument a name and parse it with your preferred argument parser. The names of the arguments must coincide with the parameter names you supplied in the task configuration, as previously mentioned.
Give the hyperparameters in your training code the values from the command-line arguments.

Reporting your hyperparameter metric to AI Platform Training

Whether you use TensorFlow for training or not will affect how you provide your hyperparameter measure to the AI Platform Training service. Additionally, it depends on whether you are training with a runtime version or a customized container.

A runtime version of TensorFlow

You can submit your hyperparameter metric to AI Platform Training by publishing the metric to a TensorFlow summary if you train using TensorFlow and utilize an AI Platform Training runtime version. One of the following functions should be used:

The following examples given below show the basics of two different ways to write your hyperparameter metric to summary. Both the instances assume you are training a regression model, and they write the root-mean-square-error between the ground-truth labels and evaluation predictions as a hyperparameter metric named metric1.

Keras

class MyMetricCallback(tf.keras.callbacks.Callback):

    def on_epoch_end(self, epoch, logs=None):
        tf.summary.scalar('metric1', logs['RootMeanSquaredError'], epoch)

logdir = "logs/scalars/" + datetime.now().strftime("%Y%m%d-%H%M%S")
file_writer = tf.summary.create_file_writer(logdir + "/metrics")
file_writer.set_as_default()

model = tf.keras.Sequential(
    tf.keras.layers.Dense(1, activation='linear', input_dim=784))
model.compile(
    optimizer='rmsprop',
    loss='mean_squared_error',
    metrics=['RootMeanSquaredError'])

model.fit(
    x_train,
    y_train,
    batch_size=64,
    epochs=10,
    steps_per_epoch=5,
    verbose=0,
    callbacks=[MyMetricCallback()])

Estimator

# Create metric for the hyperparameter tuning
def my_metric(labels, predictions):
    # Note that the different types of estimator provide different different
    # keys on the predictions Tensor. predictions['predictions'] is for
    # regression output.
    pred_values = predictions['predictions']
    return {'metric1': tf.compat.v1.metrics.root_mean_squared_error(labels, pred_values)}

# estimator creation to train and evaluate
def train_and_evaluate(output_dir):

    estimator = tf.estimator.DNNLinearCombinedRegressor(...)

    estimator = tf.estimator.add_metrics(estimator, my_metric)

    train_spec = ...
    eval_spec = tf.estimator.EvalSpec(
        start_delay_secs = 60, # start evaluating after 60 seconds
        throttle_secs = 300,  # evaluate every 300 seconds
        ...)
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

Hyperparameter tuning job details while running

You can keep track of hyperparameter adjustments by obtaining the detailed status of your ongoing training job.

A training job with hyperparameter tuning sets the following values for the TrainingOutput object in the response's Job resource:

isHyperparameterTuningJob is true by default.

A list of HyperparameterOutput objects is present in the field trials.

The TF_CONFIG environment variable is another place where you can find the trial ID. See the tutorial on obtaining information from TF CONFIG.

Getting Results

You may check the outcomes of each trial when the training runs are finished in the Google Cloud console. To obtain the results, you may use call projects.jobs.get. The metrics for all runs are contained in the TrainingOutput object in the task resource, with the metrics for the best-tuned run being identified.

The job description contains the findings of each study. You may filter trials by rmse, learning rate, and training steps via the Google Cloud console. Find the trial that produced the hyperparameter metric's ideal value. You can utilize the hyperparameter values displayed for that trial in subsequent runs of your model if the trial passes your criteria for model success.

The HyperparameterOutput FAILED trial status can either indicate that the training for that trial was unsuccessful or that trial failed to report the hyperparameter tuning measure. In the latter scenario, the parent job may still be successful even if the trial fails. You can check the trial log to see if the trial's training was unsuccessful.

Setting Limits to the number of trails.

The maxTrials value in the HyperparameterSpec object should be configured according to how many trials you want the service to be able to execute.

When determining how many trials to permit, there are two competing interests to take into account:

time (and consequently cost)
accuracy

You shouldn't set your maximum to a value lower than ten times the amount of hyperparameters you utilize to get the most out of the hyperparameter tuning.

Failed trials Handling

The training task may need to terminate early if your hyperparameter tweaking trials fail. You can specify the number of unsuccessful trials by setting the maxFailedTrials field in the HyperparameterSpec. AI Platform Training ends the training project if these many tries are unsuccessful. The Values must be less than or equal to maxTrials and maxFailedTrials.

AI Platform Training handles failing trials according to the following principles if maxFailedTrials is zero or not set at all:

If the first attempt at your work fails, AI Platform Training terminates it immediately. Further trials will likely encounter failure because the initial trial's failure points to a potential issue with your training code. By terminating the work, you can identify the issue without needing to conduct additional tests or incur additional expenses.
If the first trial succeeds, then AI Platform Training might end the job after failures during subsequent trials based on at least one of the following criteria:
- The number of the failed trials has grown too high.
- The ratio of the failed trials to successful trials has grown too high.

Running parallel trials

By specifying maxParallelTrials in the HyperparameterSpec object, you can determine how many trials will be performed concurrently.

Running parallel trials reduces the time the training job takes (real-time—the total processing time required is not typically changed). However, running in the parallel can reduce the effectiveness of tuning jobs overall. That is because hyperparameter tuning uses the results of the previous trials to inform the values to assign to the hyperparameters of the subsequent trials. When running in parallel, some trials starts without benefiting from the results of any trials still running.

Trails stopping

Set the enableTrialEarlyStopping value in the HyperparameterSpec = TRUE to allow early trial termination.

Resuming a completed hyperparameter tuning job

A finished hyperparameter tuning task can be continued to start from a partially optimized state. This enables the reuse of the expertise amassed during the prior hyperparameter tweaking.

To continue a hyperparameter tuning job, submit a new one with the setup shown below:

Set the job ID of the prior trial as the value for resumePreviousJobId in the HyperparameterSpec.
Specify values for maxTrials and maxParallelTrials.

The resumePreviousJobId configuration is used in the examples below:

gcloud

trainingInput:
  scaleTier: CUSTOM
  masterType: complex_model_m
  workerType: complex_model_m
  parameterServerType: large_model
  workerCount: 9
  parameterServerCount: 3
  hyperparameters:
    enableTrialEarlyStopping: TRUE
    maxTrials: 30
    maxParallelTrials: 1
    resumePreviousJobId: [PREVIOUS_JOB_IDENTIFIER]

Python

# Add hyperparameter tuning to the job config.
hyperparams = {
    'enableTrialEarlyStopping': True,
    'maxTrials': 30,
    'maxParallelTrials': 1,
    'resumePreviousJobId': [PREVIOUS_JOB_IDENTIFIER]}

# Add the hyperparameter specification to training inputs dictionary.
training_inputs['hyperparameters'] = hyperparams

# Build the job spec.
job_spec = {'jobId': my_job_name, 'trainingInput': training_inputs}

Hyperparameter tuning with Cloud TPU

It's recommended to use the eval metrics property in TPUEstimatorSpec if you're executing your hyperparameter tuning job with Cloud TPU on the AI Platform Training.

Frequently Asked Questions

Why is hyperparameter tuning important?

A machine learning model's behavior can largely be controlled via hyperparameter adjustment.

What is Keras?

The open-source software package known as Keras provides a Python interface for artificial neural networks. Keras provides the TensorFlow library interface.

What is a runtime error?

When a program is syntactically sound but has a bug only discovered during program execution, it is said to have a runtime error.

Conclusion

This blog has extensively discussed Using hyperparameter tuning. We hope this blog has helped you learn about how to use hyperparameter tuning. If you want to learn more, check out the excellent content on the Coding Ninjas Website:

Manage to serve accounts in GCP, Overview of hyperparameter tuning in GCP

Refer to our guided paths on the Coding Ninjas Studio platform to learn more about DSA, DBMS, Competitive Programming, Python, Java, JavaScript, etc.

Refer to the links problems, top 100 SQL problems, resources, and mock tests to enhance your knowledge.

For placement preparations, visit interview experiences and interview bundle.

Do upvote our blog to help other ninjas grow. Happy Coding!

Live masterclass

Get hired with 25L+ CTC Interview-ready GenAI project @Amazon

by Anubhav Sinha

02 Mar, 2026

03:00 PM

16+ registered

Zero to Data Analyst: Amazon Analyst Roadmap for 30L+ CTC

by Abhishek Soni

01 Mar, 2026

06:30 AM

202+ registered

Beginner to GenAI Engineer Roadmap for 30L+ CTC at Amazon

by Shantanu Shubham

01 Mar, 2026

08:30 AM

65+ registered

PowerBI + AI for Data Analytics: Secure 30L+ CTC at Netflix

by Ashwin Goyal

02 Mar, 2026

01:30 PM

93+ registered

Get hired with 25L+ CTC Interview-ready GenAI project @Amazon

by Anubhav Sinha

02 Mar, 2026

03:00 PM

16+ registered

Zero to Data Analyst: Amazon Analyst Roadmap for 30L+ CTC

by Abhishek Soni

01 Mar, 2026

06:30 AM

202+ registered

View more events