©2018 by macnabbs. Proudly created with Wix.com

  • Alibek Jakupov

Azure OCR with PDF files

Azure OCR is an excellent tool allowing to extract text from an image by API calls.

Azure's Computer Vision service provides developers with access to advanced algorithms that process images and return information. To analyze an image, you can either upload an image or specify an image URL. The images processing algorithms can analyze content in several different ways, depending on the visual features you're interested in. For example, Computer Vision can determine if an image contains adult or racy content, or it can find all of the human faces in an image. You can use Computer Vision in your application by using either a native SDK or invoking the REST API directly. This page broadly covers what you can do with Computer Vision.

quote from the official documentation.

And what we are interested in is Optical Character Recognition

You can use Computer Vision to extract text from an image into a machine-readable character stream using optical character recognition (OCR). If needed, OCR corrects the rotation of the recognized text and provides the frame coordinates of each word. OCR supports 25 languages and automatically detects the language of the recognized text. You can also use the Read API to extract both printed and handwritten text from images and text-heavy documents. The Read API uses updated models and works for a variety objects with different surfaces and backgrounds, such as receipts, posters, business cards, letters, and whiteboards. Currently, English is the only supported language.

However if we want to analyze a pdf file with OCR there is no direct way to do this. Here we provide a fully working code allowing to analyse a pdf image on fly and extract a text as an array of lines. No deep stuff. Enjoy.

# coding: utf-8 """Convert all the pdf from a given to image and send image to Azure OCR """ import json import requests import os import io from pdf2image import convert_from_bytes, convert_from_path from PIL import Image import time import pandas as pd import urllib from pdf2image import convert_from_bytes, convert_from_path import os import ntpath import numpy as np from boltons.setutils import IndexedSet import re import string def pil_to_array(pil_image): """convert a PIL image object to a byte array Arguments: pil_image {PIL} -- Pillow image object Returns: {bytes} -- PIL image object in a form of byte array """ image_byte_array = io.BytesIO() pil_image.save(image_byte_array, format='PNG') image_data = image_byte_array.getvalue() return image_data def image_to_text(image_data): """convert an image object to an array of text lines using Azure Vision API Arguments: image_data {bytes} -- image byte array Returns: list -- array of strings representing lines """ # azure subscription key subscription_key = "b935a573e2fb467ea3461a9bb56cfd7e" assert subscription_key # azure vision api vision_base_url = "https://westeurope.api.cognitive.microsoft.com/vision/v2.0/" # ocr subsection ocr_url = vision_base_url + "ocr" # request headers. Important: content should be bytestream as we are sending an image from local headers = {'Ocp-Apim-Subscription-Key': subscription_key, 'Content-Type': 'application/octet-stream'} # request parameters: language is unknown, and we do detect orientation params = {'language': 'unk', 'detectOrientation': 'true'} # get response from the server response = requests.post(ocr_url, headers=headers, params=params, data=image_data) response.raise_for_status() # get json data to parse it later analysis = response.json() # all the line from a page, including noise full_text = [] for region in analysis['regions']: line = region['lines'] for element in line: line_text = ' '.join([word['text'] for word in element['words']]) full_text.append(line_text.lower()) # clean array containing only important data user_requests = [] for line in full_text: user_requests.append(line) return user_requests def get_information(input_path): """get a pdf file and return an array of lines Arguments: pdf_file {[type]} -- [description] Returns: [type] -- [description] """ # points of interest from all the pages global_poi = [] # get and array of PIL image objects -> an object per page images = convert_from_path(input_path) # create a byte array for each page for image in images: byte_array = pil_to_array(image) page_poi = image_to_text(byte_array) global_poi += page_poi return global_poi PATH = "your\\pdf-file\\path\\file.pdf" poi = get_information(PATH) items = poi Hope you will find this helpful.