• Alibek Jakupov

Azure OCR with PDF files

Updated: May 3



Azure OCR is an excellent tool allowing to extract text from an image by API calls.

Azure's Computer Vision service provides developers with access to advanced algorithms that process images and return information. To analyze an image, you can either upload an image or specify an image URL. The images processing algorithms can analyze content in several different ways, depending on the visual features you're interested in. For example, Computer Vision can determine if an image contains adult or racy content, or it can find all of the human faces in an image. You can use Computer Vision in your application by using either a native SDK or invoking the REST API directly. This page broadly covers what you can do with Computer Vision.

quote from the official documentation.


And what we are interested in is Optical Character Recognition

You can use Computer Vision to extract text from an image into a machine-readable character stream using optical character recognition (OCR). If needed, OCR corrects the rotation of the recognized text and provides the frame coordinates of each word. OCR supports 25 languages and automatically detects the language of the recognized text. You can also use the Read API to extract both printed and handwritten text from images and text-heavy documents. The Read API uses updated models and works for a variety objects with different surfaces and backgrounds, such as receipts, posters, business cards, letters, and whiteboards. Currently, English is the only supported language.

However if we want to analyze a pdf file with OCR there is no direct way to do this. Here we provide a fully working code allowing to analyse a pdf image on fly and extract a text as an array of lines. No deep stuff. Enjoy.


# coding: utf-8
"""Convert all the pdf from a given to image and send image to Azure OCR
"""
import json
import requests
import os
import io
from pdf2image import convert_from_bytes, convert_from_path
from PIL import Image
import time
import pandas as pd
import urllib
from pdf2image import convert_from_bytes, convert_from_path
import os
import ntpath
import numpy as np

from boltons.setutils import IndexedSet
import re
import string

def pil_to_array(pil_image):
    """convert a PIL image object to a byte array

    Arguments:
        pil_image {PIL} -- Pillow image object

    Returns:
        {bytes} -- PIL image object in a form of byte array
    """
    image_byte_array = io.BytesIO()
    pil_image.save(image_byte_array, format='PNG')
    image_data = image_byte_array.getvalue()
 return image_data


def image_to_text(image_data):
    """convert an image object to an array of text lines 

    Arguments:
        image_data {bytes} -- image byte array

    Returns:
        list -- array of strings representing lines
    """
    # azure subscription key
    subscription_key = "b935a573e2fb467ea3461a9bb56cfd7e"
    assert subscription_key
    # azure vision api
    vision_base_url = "https://westeurope.api.cognitive.microsoft.com/vision/v2.0/"
    # ocr subsection
    ocr_url = vision_base_url + "ocr"
    headers = {'Ocp-Apim-Subscription-Key': subscription_key,
 'Content-Type': 'application/octet-stream'}
    params = {'language': 'unk', 'detectOrientation': 'true'}

    get response from the server
    response = requests.post(ocr_url, headers=headers, params=params, data=image_data)
    response.raise_for_status()
    get json data to parse it later
    analysis = response.json()
    # all the line from a page, including noise
    full_text = []
    for region in analysis['regions']:
        line = region['lines']
        for element in line:
            line_text = ' '.join([word['text'] for word in element['words']])
            full_text.append(line_text.lower())
    # clean array containing only important data
    user_requests = []
 for line in full_text:
        user_requests.append(line)

 return user_requests


def get_information(input_path):
    # points of interest from all the pages
    global_poi = []
    get and array of PIL image objects -> an object per page
    images = convert_from_path(input_path)
    # create a byte array for each page
    for image in images:
        byte_array = pil_to_array(image)
        page_poi = image_to_text(byte_array)
        global_poi += page_poi
 return global_poi

PATH = "your\\pdf-file\\path\\file.pdf"
poi = get_information(PATH)
items = poi

Hope you will find this helpful.

©2018 by macnabbs. Proudly created with Wix.com