Sentence Transformers on Databricks: Tips and Tricks
What is Sentence Transformers?
Sentence Transformers are Multilingual Sentence, Paragraph, and Image Embeddings using BERT & Co. These transformers are expected to improve downstream NLP task performances such as semantic textual similarity (STS) and natural language inference (NLI) that need to infer reasoning about inter-sentence relations.
There is a framework providing an easy method to compute dense vector representations for sentences, paragraphs, and images, called sentence-transformers. The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and achieve state-of-the-art performance in various task. Text is embedding in vector space such that similar text is close and can efficiently be found using cosine similarity. The resulting embeddings are generally of high quality and typically work quite well for document-level embeddings.
We can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similar, semantic search, or paraphrase mining. The framework is based on PyTorch and Transformers and offers a large collection of pre-trained models tuned for various tasks. Further, it is easy to fine-tune your own models.
Why to use them?
A key disadvantage of BERT is that no independent sentence embeddings are computed. As a higher means of abstraction, sentence embeddings can play a central role to achieve good downstream performances like machine reading comprehension (MRC).
The specifics of NLP applications are well-abstracted by downstream tasks. For this reason, downstream performance is a good indicator for a language model. When pre-trained language models are used for downstream task evaluations, pre-trained models can generate additional feature representations in addition to being provided as a platform for fine-tuning.
I've also recently find an interesting approach of using sentence transformers to build topic models. These models are extremely interesting as they represent all the inner semantic relations of a documents, yet providing a sparce representation of your documents. There's an excellent tutorial explaining how to implement this algorithm, step-by-step.
Azure Databricks is a data analytics tool designed specifically for Microsoft Azure cloud services. Azure Databricks SQL Analytics and Azure Databricks Workspace are two environments for creating data-intensive applications.
Azure Databricks SQL Analytics provides an easy-to-use platform for analysts who want to run SQL queries on their data lake, create multiple visualization types to explore query results from different perspectives, and build and share dashboards. Azure Databricks Workspace provides an interactive workspace that enables collaboration between data engineers, data scientists, and machine learning engineers. For a big data pipeline, the data (raw or structured) is ingested into Azure through Azure Data Factory in batches, or streamed near real-time using Apache Kafka, Event Hub, or IoT Hub. This data lands in a data lake for long term persisted storage, in Azure Blob Storage or Azure Data Lake Storage. As part of your analytics workflow, use Azure Databricks to read data from multiple data sources and turn it into breakthrough insights using Spark.
Thus, if you are running your topic modeling over huge data you may need to run parallelization job on your Spark cluster. Moreover, you may simply want to use Azure Databricks workspace as your working environment. So in this article we are going to see how to set your topic modeling experiment up and running on the Databricks environment.
According to the official documentation the process is quite straightforward. You can simply install the package via pip:
pip install -U sentence-transformers
As you may want to install the library permanently on the cluster, you install as a cluster library. Cluster libraries can be used by all notebooks running on a cluster. You can install a cluster library directly from a public repository such as PyPI or Maven, or create one from a previously installed workspace library.
Then you need to initialize your model
from sentence_transformers import SentenceTransformer model = SentenceTransformer('distilbert-base-nli-mean-tokens') embeddings = model.encode(data, show_progress_bar=True)
Where data is your corpus.
This is where are facing some issues. If you run the code above on Databricks you will see the following error.
FileNotFoundError: [Errno 2] No such file or directory: '/root/.cache/torch/sentence_transformers/sbert.net_models_distilbert-base-nli-mean-tokens'
You can find the models here... download them and unzip... The code is not that well tested for Windows, as a lot of frameworks have sadly only limited support for Windows.
Thus we go to the cloud storage and select the needed model zip (distilbert-base-nli-mean-tokens in our case).
After downloading the model we unzip the archive and upload the model folder to the Databricks.
Then you run the following code:
model = SentenceTransformer('path/to/unzipped/model-folder')
This runs perfectly on Windows, but how we upload the folder on the Databricks? One solution would be to upload the folder to the Azure Storage and provide a URL instead of the path, because if we have a look on the source code:
if model_path.startswith( 'http://') or model_path.startswith( 'https://'): model_url = model_path folder_name = model_url.replace( "https://", "").replace( "http://", "").replace( "/", "_")[:250][0:-4] #remove .zip file end
Thus, theoretically, it is possible. I will give it a try and will keep you updated.
Yet, there's a simpler solution!
Once again, if we refer to the official documentation:
You can install libraries in three modes: workspace, cluster-installed, and notebook-scoped.
We already know cluster-scoped libraries, but what about workspace libraries?
Workspace libraries serve as a local repository from which you create cluster-installed libraries. A workspace library might be custom code created by your organization, or might be a particular version of an open-source library that your organization has standardized on.
Thus if we simply add the followin cell:
!pip install -U sentence-transformers
Now we can run the following code on Databricks:
from sklearn.datasets import fetch_20newsgroups from sentence_transformers import SentenceTransformer import umap import hdbscan data = fetch_20newsgroups(subset='all')['data'] model = SentenceTransformer('distilbert-base-nli-mean-tokens') embeddings = model.encode(data, show_progress_bar=True) umap_embeddings = umap.UMAP(n_neighbors=15, n_components=5, metric='cosine').fit_transform(embeddings) cluster = hdbscan.HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom').fit(umap_embeddings)
umap and hdbscan are installed on the cluster, and to install the sentence_transformers add the following cell on before:
!pip install -U sentence-transformers
However, you will have to re-run this cell after each shut-down. Consequently, Rookie still keeps searching. Nevertheless, if this resolved your issue like it did for Rookie, I will be completely satisfied.
Hope you find it useful!