Code360 powered by Coding Ninjas X Code360 powered by Coding Ninjas X
Last Updated: Mar 27, 2024

Training of scikit-learn and XGBoost in an AI Platform

Leveraging ChatGPT - GenAI as a Microsoft Data Expert
Prerita Agarwal
Data Specialist @
23 Jul, 2024 @ 01:30 PM


Ensemble models almost consistently outperform individual models in machine learning. Different machine learning models get combined into one in an ensemble model. However, before we start running and implementing our models in the workflow process, we must train them. This tutorial will discover how to use XGBoost and Scikit-Learn to train a machine learning model.


In this post, using a dataset containing Census Income Data, we train a straightforward model to train our model. The model seeks to forecast an individual's income level. We modify our model training code to get data from Cloud Storage and upload our saved model file to Cloud Storage. We develop a training application package to conduct training on the AI Platform Training.

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job


Scikit Learn and XGBoost's training procedures can get divided into two steps. The initial setup stage and the training stage for the AI platform.

Setup Phase

We should set up the GCP project and environment before starting to train our module. We can configure our environment either locally or remotely on a cloud shell. Although it offers a quick way to test AI Platform Training, Cloud Shell is inappropriate for continuing development work. To run our model locally, we must ensure that the frameworks get installed. Run the required command to install XGBoost, pandas, and scikit-learn.

We want a Cloud Storage bucket to store our training code and dependencies. The simplest is using a specific Cloud Storage bucket in the same project we use for AI Platform Training for this lesson.

Think about an alternative project where we might be employing a bucket. We must ensure that the AI Platform Training service account can access our training code and dependencies in Cloud Storage. The training job is unsuccessful without the proper authorisation.

A bucket should be used or placed in the same area where we run training jobs. Today, India is the subject of our attention so we will select the "Mumbai (asia-south1)" region. When choosing a name for the new bucket, we must be careful. The name needs to be distinct among all Cloud Storage buckets.

The subsequent steps require the following variables.

  • TRAINER_PACKAGE_PATH <./census_training>: An application for bundled training on Google Cloud Storage. This package directory receives the model file created below.
  • MAIN_TRAINER_MODULE <census_training.train> : Specifies the file to run on the AI Platform. This has the following formatting: <folder_name.python_file_name>
  • JOB_DIR <gs://$BUCKET_NAME/scikit_learn_job_dir> : the location on Google Cloud Storage that will be used to output the job.
  • RUNTIME_VERSION: the AI Platform version to utilise for the task. The training service uses the AI Platform runtime version 1.0 by default if we do not specify a runtime version.
  • PYTHON_VERSION: The appropriate version of Python to use is Python 3.5. We can use it with runtimes that are at least version 1.4. The training service uses Python 2.7 if we do not specify a Python version.


  • PROJECT_ID <YOUR_PROJECT_ID>: with the ID of your project. Use PROJECT ID, which corresponds to our Google Cloud Platform project.
  • BUCKET_NAME <YOUR_BUCKET_NAME>: use the bucket ID you made before.
  • JOB_DIR <gs://YOUR_BUCKET_NAME/scikit_learn_job_dir>: use the bucket ID you made before.
  • REGION <REGION>: choose a Google Cloud region from the options provided by google or stick with the default region provided The model gets deployed in the region.


%env REGION asia-south1
%env TRAINER_PACKAGE_PATH ./census_training
%env MAIN_TRAINER_MODULE census_training.train
%env JOB_DIR gs://<BUCKET_NAME>/scikit_learn_job_dir
! mkdir census_training


The Census Income Data Set used for training in this sample is made available by the UC Irvine Machine Learning Repository. It would be necessary for the user to submit the data to GCS if they utilise their usual development procedure with their data. It enables access to such data by the AI Platform. In this instance, we have uploaded the data to GCS rather than having them get it from UC Irvine and then upload it to GCS.

A third party gives this dataset. Regarding the accuracy or any other features of this dataset, Coding Ninjas makes no representation, warranty, or further assurances.

Training  phase

After the initial setup gets finished, we can begin training our model on the AI Platform. These are the tasks we will be performing right away. The actions will get carried out sequentially.

Our Python training module will get developed. We can also refer to this step as building our Python model file.

Create an application package for training.

Post the training job.

Part 1 Create a Python Model File

The python model file given below will be the first thing we construct before uploading it to the AI Platform. It is similar to how a scikit-learn model gets typically created. However, there are two significant variations:

At the beginning of the file, download the data from GCS so that the AI Platform may access it.

At the end of the file, we will export/save the model to GCS so we may utilise it for predicting.

The data gets loaded into a pandas DataFrame by the code in this file so that scikit-learn may use it. The model then gets fitted to the training set of data. The model then gets saved to a file that can be published to AI Platform's prediction service using sklearn's built-in version of "joblib".

We wish to test our model locally on a modest dataset in a typical scenario. Before using it with the more extensive dataset on the AI Platform, it must get tested to ensure it functions. It ensures that no time or money gets wasted.


In [2]:
%%writefile ./census_training/
# from here, we START and setup
import datetime
import pandas as pd

from import storage

from sklearn.ensemble import RandomForestClassifier
from sklearn.externals import joblib
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import LabelBinarizer
from sklearn.pipeline import Pipeline

# Here, the user has to REPLACE '<BUCKET_NAME>' with their GCS #BUCKET_NAME
# Here, the setup ends

# Here, the user has to REPLACE '<BUCKET_NAME>' with their GCS #BUCKET_NAME
# Here, the setup ends

# ---------------------------------------
# 1. Add code to download the data from Google Cloud Storage. We are here using the publicly hosted data.
#The information can then be used to train our model on the AI Platform.
# ---------------------------------------
# starting the data download process
# This public bucket contains data from the census.
bucket = storage.Client().bucket('cloud-samples-data')

#this is the path to the data in the public bucket.
blob = bucket.blob('ml-engine/sklearn/census_data/')
# obtaining/downloading the data
# the data download gets ended.

# ---------------------------------------
# The model code would be placed here. Here is a sample model made with the census data.
# ---------------------------------------
# Here, we start defining and loading data.
#These are columns taken directly from census data files. We express our #input data's format, considering any unneeded columns. 

# the training census dataset is loaded here.
with open('./', 'r') as train_data:
    raw_training_data = pd.read_csv(train_data, header=None, names=COLUMNS)

# Removing the "income-level" column from our list of features we are attempting to forecast.
#The Dataframe should be changed to a list of lists.
train_features = raw_training_data.drop('income-level', axis=1).values.tolist()
# Now, here we Convert the Dataframe to a lists of lists, then create our training labels list.
train_labels = (raw_training_data['income-level'] == ' <=50K').values.tolist()
# This is the endpoint to define and load data.

We are now starting the conversion of category features.
The categorical properties of the census data set must be converted to numerical values because they are categorical. Each categorical column will get converted using a list of pipelines, and when they are all combined using FeatureUnion, the RandomForestClassifier will get invoked.

categorical_pipelines = []
Separately extracting and converting each categorical column to a numerical value is required. For each categorical column, a pipeline that extracts one feature column using # SelectKBest(k=1) and a LabelBinarizer() to change the absolute value to a numerical value will be used to do this. The feature column will be chosen and extracted using a scores array (made below). The scores array is formed by iterating over the COLUMNS and determining if they are CATEGORICAL COLUMNS.
for i, col in enumerate(COLUMNS[:-1]):

Construct an array of scores to obtain each classified column. We have an  example here:
 #  data = [39, 'Federal-gov', 77516, '11th', 13, 'Widowed', 'Sales', 
        #         'Wife', 'Black', 'Female', 2174, 0, 40, 'Cambodia']
             # Construct the scores array.
        scores = [0] * len(COLUMNS[:-1])

        # The categorical column that we wish to extract is this one.
        scores[i] = 1
        skb = SelectKBest(k=1)
        skb.scores_ = scores

        # Changing the numerical value of the category column
        lbn = LabelBinarizer()
        r = skb.transform(train_features)

        # Making the pipeline to extract the category feature.
            ('categorical-{}'.format(i), Pipeline([
                ('SKB-{}'.format(i), skb),
                ('LBN-{}'.format(i), lbn)])))
# [the category feature conversion ends here]
# creating the pipeline in progress
#constructing a pipeline to extract numerical features
skb = SelectKBest(k=6)
#Use the numerical features from COLUMNS.
skb.scores_ = [1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0]
categorical_pipelines.append(('numerical', skb))

# By utilising FeatureUnion, combine all the features.
preprocess = FeatureUnion(categorical_pipelines)

# Build the classifier.
classifier = RandomForestClassifier()

# Adjust the features to the classifier by changing them., train_labels)

# Make a single pipeline for the entire model.
pipeline = Pipeline([
    ('union', preprocess),
    ('classifier', classifier)
# Ending to create-pipeline

# ---------------------------------------
# 2. Save and export the model to Google Cloud Services
# ---------------------------------------
# [STARTING export-to- Google Cloud Services]
# the model is getting exported as a file
model = 'model.joblib'
joblib.dump(pipeline, model)

# This uploads the model to Google Cloud Services
bucket = storage.Client().bucket(BUCKET_NAME)
blob = bucket.blob('{}/{}'.format('census_%Y%m%d_%H%M%S'),
# this ends the export to GCS

Part 2 Create Trainer Package

The code and any dependencies must be uploaded to a Google Cloud Storage location before we can use AI Platform to run our trainer application. The Google Cloud Platform project should be able to access it.


In [3]:
%%writefile ./census_training/
The outcome will be similar to this
"Writing ./census_training/"

Part 3 Submit Training Job

The job must then get submitted for training on the AI platform. The job will be submitted through gcloud and has the following flags:

  • job-name: Name of the position (mixed-case letters, numbers, and underscores only, starting with a letter). In this case it is: census_training_$(date +"%Y%m%d_%H%M%S")
  • job-dir: It leads to a Google Cloud Storage location where the job output will be stored.
  • package-path: a pre-staged training programme packed and stored in a Google Cloud Storage location. If we employ the gcloud command-line tool, this process is primarily automated.
  • module-name: identifies the main trainer package module by name. We call the main module's Python file when we launch the application. Specify the top module name in the —module-name parameter when submitting the job using the gcloud command.
  • region: We want our work performed in the Google Cloud Compute region. Running our training job in the same area as the Cloud Storage bucket where our training data gets kept would be beneficial. Choose a region from this list or stick with the default "us-central1".
  • runtime-version: The AI Platform version to utilise for the task. The training service uses the AI Platform runtime version 1.0 by default if we do not specify a runtime version. See the list of runtime versions for further details.
  • python-version: The version of Python to use for the task. Runtime version 1.4 or higher is required for Python 3.5. Python 2.7 is used by the training service if a Python version is not specified.
  • scale-tier: a scale tier specifies the kind of processing cluster to carry out our work. It can be the CUSTOM scale tier, where we expressly state how many and what kind of machines to employ.

Verify that gcloud got set to the most recent PROJECT ID.


In [4]:
! gcloud config set project $PROJECT_ID

We should see a similar output.

Job [census_training_20220903_092412] submitted successfully.
The job is still active. We may view the status of our job with the command
 $ gcloud ml-engine jobs describe census_training_20220903_092412
or continue streaming the logs with the command
 $ gcloud ml-engine jobs stream-logs census_training_20220903_092412
jobId: census_training_20220903_092412
state: QUEUED

Frequently Asked Questions

Is XGBoost faster on GPU?

When comparing, the current running time is about 13.1 seconds (using an Nvidia GeForce GTX 1080). therefore XGBoost runs approximately 4.4 times faster than the CPU.

Is XGBoost prone to overfitting?

These models perform with remarkable predicted accuracy when combined into an ensemble. High model complexity, which makes them challenging to analyse and may cause overfitting, is the price for this performance.

What distinguishes sklearn from Scikit-learn?

Both Scikit-learn and sklearn refer to the same package. However, there are some things we must be mindful of. First, we can install the package using the Scikit-learn or sklearn identifiers. However, installing Scikit-learn using pip and the scikit-learn identifier is advised.


In this article, we learnt about training XGBoost and Scikit-learn. We read about the datasets and packages of data analytics. We also saw how to load datasets and find missing data. We also discussed the workflow stages, how to install and train XGBoost, and its input parameters. You can also consider our Online Coding Courses such as the Machine Learning Course to give your career an edge over others.

Thank you Image

Topics covered
Setup Phase
Training  phase
Part 1 Create a Python Model File
Part 2 Create Trainer Package
Part 3 Submit Training Job
Frequently Asked Questions
Is XGBoost faster on GPU?
Is XGBoost prone to overfitting?
What distinguishes sklearn from Scikit-learn?