Read Excel files from SFTP Server "on-fly"
Updated: May 1, 2020
In this short article we are going to cover some tips and tricks that may be useful in treating Excel files taken from SFTP server. Up we go!
There are plenty of tutorials explaining how establish a connection using Secure File Transfer Protocol (SFTP) between your Python client and a server. Moreover, there are enough of articles demonstrating how to read excel files using Python libraries. However, there is no information on how to bring both of them together. As I faced some challenges in setting this solution up and running, I find it important to share some workarounds that were found during the work.
Why may it be needed?
In cases when you cannot save your file locally I will definitely have to process a byte stream. For instance, if you are writing a serverless solution, like Azure Function, when there is no in-built permanent storage and you need to treat a file object. It may also be prohibited to save any file locally according to your client contract, this list of possible reasons is actually endless. Anyway, being able to work directly with octet data is quite useful and gives you even more tool to build elegant and powerful solutions.
Connect to server
Not surprisingly, we start from establishing a secure connections. Even if someone may find it extremely easy, there are some tricky step to cope with. For instance, if you are working on Python 3+ you may have probably seen the following message:
E:\Program Files (x86)\Anaconda3\lib\site-packages\pysftp__init__.py:61: UserWarning: Failed to load HostKeys from C:\Users\JohnCalvin.ssh\known_hosts. You will need to explicitly load HostKeys (cnopts.hostkeys.load(filename)) or disableHostKey checking (cnopts.hostkeys = None). warnings.warn(wmsg, UserWarning) Traceback (most recent call last): File "E:\OneDrive\Python\GIT\DigitalCloud\pysftp_tutorial.py", line 14, in push_file_to_server() File "E:\OneDrive\Python\GIT\DigitalCloud\pysftp_tutorial.py", line 7, in push_file_to_server s = sftp.Connection(host='18.104.22.168', username='root', password='********') File "E:\Program Files (x86)\Anaconda3\lib\site-packages\pysftp__init__.py", line 132, in init self._tconnect['hostkey'] = self._cnopts.get_hostkey(host) File "E:\Program Files (x86)\Anaconda3\lib\site-packages\pysftp__init__.py", line 71, in get_hostkey raise SSHException("No hostkey for host %s found." % host) paramiko.ssh_exception.SSHException: No hostkey for host 22.214.171.124 found. Exception ignored in: > Traceback (most recent call last): File "E:\Program Files (x86)\Anaconda3\lib\site-packages\pysftp__init__.py", line 1013, in del self.close() File "E:\Program Files (x86)\Anaconda3\lib\site-packages\pysftp__init__.py", line 784, in close if self._sftp_live: AttributeError: 'Connection' object has no attribute '_sftp_live'
So, in order to avoid this problem you need to do the following:
cnopts = pysftp.CnOpts() cnopts.hostkeys = None
The full code for this step:
import logging import pysftp import xlrd from io import BytesIO import os def main(): # FTP Server parameters USERNAME = "username" PASWORD = "password" SERVER = '0.0.0.0' # array with all the input files excel_array =  # workaround to set pysftp up and running on Python 3on # and avoid "AttributeError: 'Connection' object has no attribute '_sftp_live'" cnopts = pysftp.CnOpts() cnopts.hostkeys = None # establish a connection to the SFTP server with pysftp.Connection(host=SERVER, username=USERNAME, password=PASWORD, cnopts=cnopts) as sftp: logging.info("SFTP Connection succesfully established ... ")
You now should be able to get all the needed data from the server, so we can pass to the following step.
Get needed files
Right now we are going to switch to the needed folder. It is done pretty easily using one single command
which actually means Change Working Directory.
Next, we need to get the directory structure and filter out its content as we need only excel files
# Obtain structure of the remote directory '/var/www/vhosts' directory_structure = sftp.listdir_attr() # get all the files (no matter what extension) from the input folder for attr in directory_structure: # get a file extension extension = str(os.path.splitext(attr.filename)).lower() # filter on file extension ; pdf and images only if extension in ('.xls', '.xlsx'): excel_array.append(attr.filename)
Create a file object and read Excel date
We are now ready to iterate over our Excel workbooks array and create a file object for each of them.
To create a file object you need to use BytesIO class and write file contents to it:
# create file object from a file name (obtained from the GET request) to read the file directly to a variable file_object = BytesIO() sftp.getfo(unprocessed_file, file_object)
And here is the tip: if you have already seen an official xlrd documentation you may have noticed that the majority of cases suppose using local file storage. So it is expected that you provide a full path to your file. However, in our case it will not work out, because the xlrd library will look for a file in the directory that the script is stored in. So, we need to use file_contents parameter and, which is important, use getvalue() function in order to obtain octet data.
workbook = xlrd.open_workbook(file_contents=file_object.getvalue())
You are now free to do everything you can imagine with your workbooks.
for unprocessed_file in excel_array: # get a file name without extension clean_filename = os.path.splitext(unprocessed_file) # get a file extension extension = str(os.path.splitext(unprocessed_file)).lower() # create file object from a file name (obtained from the GET request) to read the file directly to a variable file_object = BytesIO() sftp.getfo(unprocessed_file, file_object) if extension in ('.xls', '.xlsx'): # if the file is an Excel workbook then get all text from all the cells # read workbook from binary data workbook = xlrd.open_workbook(file_contents=file_object.getvalue()) # get the first sheet only first_sheet = workbook.sheet_by_index(0) # get rows and columns rows = first_sheet.nrows columns = first_sheet.ncols # read all the content from all the active cells for i in range(columns): for j in range (rows): print(str(first_sheet.cell_value(j, i)).lower().strip())
Hope this was helpful.