Azure Custom Vision: download all training images with Python and Power Shell

Alibek Jakupov
Oct 11, 2020
3 min read

Updated: Nov 19, 2021

In the previous tutorial we found out how to download training images from Azure Custom Vision. In this article we are going to learn how to download ALL the training images from your project by adding Power Shell. Up we go!

Before we start

A little reminder...

What is Azure Custom Vision?

Quote from the official web-site:

Customize and embed state-of-the-art computer vision for specific domains. Build frictionless customer experiences, optimize manufacturing processes, accelerate digital marketing campaigns—and more. No machine learning expertise is required.

There are a lot of well-written references explaining how to upload your images from the local storage to Azure Custom Vision workspace, using SDKs or simple HTTP requests. Most of them concern training classification/object detection models and managing iterations.

When may this happen?

Imagine you uploaded your images and after uploading them they have mysteriously dissappeared from your laptop. Or, which is more likely to happen, you used all the predictions to retrain your model and you need to re-work them using OpenCV. Sounds reasonable, right?

In the previous article we have seen an official API reference that allows doing many useful manipulations with your training data.

Step 1: Get the training images count

This is the simplest thing to do, either you go the portal to get the number of training images, or get programmatically using the API. Your request should be something like:

https://{endpoint}/customvision/v3.3/Training/projects/{projectId}/images/tagged/count[?iterationId][&tagIds]

And the response will look something like:

{
 "code": "NoError",
 "message": "string"
}

Step 2: prepare python script

If you remember from the previous article the download limit is 256 images per request, so we've decided to do the work manually, by changing the skip parameter. Thus, to download all the photos, we had to adjust the skip size which will always be equal to:

skip_size = (iteration_n - 1) * 256

E.G. first iteration, skip = 0, second - 256, and so on and so forth.

And dont forger to oder your images by setting

'orderBy': 'Oldest'

in your request.

However, programming is not about manual work, it's all about making the machine work for you. Moreover, in my last CV project I had 6000+ training images, so it would require running the script manually more than 24 times!

Even a rookie developer would find a more elegant solution, so here's what I've done. I simply added a command line argument corresponding to the skip parameter.

# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-s", "--skip", type=int)
args = vars(ap.parse_args())

That I then pass to the request

# query parameters
# here only several are used
params = urllib.parse.urlencode({
 # Format - int32. Maximum number of images to return. Defaults to 50, limited to 256.
 'take': take,
 # Format - int32. Number of images to skip before beginning the image batch. Defaults to 0.
 'skip': args["skip"],
 'orderBy': 'Oldest'
})

Look, but it still means that I will have to run the script manually from command line, and I still have to keep track of my skip and take, somewhere in the notebook, right?

Well, no! There is an excellent facility to run your script automatically using PowerShell, and of course I will provide you the code. But, at first, let's finish with our Python script.

Here's full code:

# coding: utf-8

"""Download training photos from Azure Custom Vision workspace
"""

import http.client
import urllib.request
import urllib.parse
import urllib.error
import base64
import json
import os
import argparse
import time


# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-s", "--skip", type=int)
args = vars(ap.parse_args())

take = 256

# create tag folders if they do not exist
tag_folders = ['ok', 'ko']
for folder in tag_folders:
 if not os.path.exists(folder):
        os.makedirs(folder)

# request headers
# only training key is needed
headers = {
 # Request headers
 'Training-Key': '<your-training-key>'
}

# query parameters
# here only several are used
params = urllib.parse.urlencode({
 # Format - int32. Maximum number of images to return. Defaults to 50, limited to 256.
 'take': take,
 # Format - int32. Number of images to skip before beginning the image batch. Defaults to 0.
 'skip': args["skip"],
 'orderBy': 'Oldest'
})


# base url
conn = http.client.HTTPSConnection(
 'southcentralus.api.cognitive.microsoft.com')
conn.request(
 "GET", "/customvision/v3.0/training/projects/<your-project-id>/images/tagged?%s" % params, "{body}", headers)
response = conn.getresponse()
# get response as a raw string
data = response.read()
# convert the string to a json object
data_json = json.loads(data)

for item in data_json:
 # uri for image download
    originalImageUri = item['originalImageUri']
 # image tag
    tag_name = item["tags"][0]["tagName"]
 # to not erase previously saved photos counter (file_counter) = number of photos in a folder + 1
    file_counter = len([name for name in os.listdir(
        tag_name) if os.path.isfile(os.path.join(tag_name, name))])
 # as the tag name corresponds to the folder name so just save a photo to a corresponding folder
    output_file = os.path.join(tag_name, str(file_counter) + '.png')
 # download image from uri
    print (output_file, tag_name)
 try:
        urllib.request.urlretrieve(originalImageUri, output_file)
 except:
        print ("Retry")
        urllib.request.urlretrieve(originalImageUri, output_file)
conn.close()

Save the script and give it a clear name. The mine is called download_from_cloud.py

Step 3: Prepare Power Shell script

Quote from the official documentation:

PowerShell is a cross-platform task automation and configuration management framework, consisting of a command-line shell and scripting language. Unlike most shells, which accept and return text, PowerShell is built on top of the .NET Common Language Runtime (CLR), and accepts and returns .NET objects. This fundamental change brings entirely new tools and methods for automation.

So we'll run our Python script automatically by changing the skip parameter at each iteration.

Here you are:

FOR /L %i IN (0,256,3000) DO (python download_from_cloud.py -s %i)

In my project I had about 2900 images so I've just put 3000 to be sure all the images are downloaded. You now may open the powershell window and go to the desired folder.

Tips and Tricks

Run the script from the VM. It will download the images way much faster than from your local, especially when you have an ADSL connection (like me).
To open the Power Shell window in the special directory, simply press Shift + Right-Click -> Open Power Shell window here