At the end of the day, the goals are simple: safety and security.
Jodi Rell
There's an excellent document providing guidance and approaches to securing access and connectivity to data in ADLS from Databricks. There are six security patterns proposed in this reference, and I do recommend implementing each of them, to understand better which one will perfectly fit your needs. In this tutorial we going to cover one if these security patterns, that the reference authors called 'Session Scoped Service Principal'. Up we go!
Session scoped Service principal
This pattern allows governing the access control at the session level so we were able to share our cluster by multiple groups of users, each using a set of service principal credentials. At first, I have tried creating a service principal and a mount point using this service principal at the desired level to better understand underlying mechanism. As the reference required certain level of expertise, there was no explanation of how to create a service principal. Not surprisingly, for a rookie developers as myself, it would be quite useful. Here's the complete guide, hopefully you will find it useful. Here I will just provide a brief summary of the steps I implemented to create a service principal and mount it to the desired folder at the desired level.
Create Azure AD application and service principal
Sign in to your Azure Account through the Azure portal.
Select Azure Active Directory.
Select App registrations.
Select New registration.
Name the application. Select a supported account type, which determines who can use the application. Under Redirect URI, select Web for the type of application you want to create. Enter the URI where the access token is sent to. You can't create credentials for a Native application. You can't use that type for an automated application. After setting the values, select Register.
Set up service principal to use it in scripts and apps
Go to your ADLS resource on the Azure portal
Select the particular folder/container to assign the application to.
Select Access control (IAM).
Select Add role assignment.
Select the role you wish to assign to the application. For example, to allow the application to execute actions like reboot, start and stop instances, select the Contributor role.
N.B. we've assigned a Blob Data Contributor role to each of our Service Principals.
Get tenant and app ID values for signing in
Select Azure Active Directory.
From App registrations in Azure AD, select your application.
Copy the Directory (tenant) ID and store it in your application code. The directory (tenant) ID can also be found in the default directory overview page.
Copy the Application ID and store it in your application code.
Select Certificates & secrets.
Select Client secrets -> New client secret.
Provide a description of the secret, and a duration. When done, select Add. After saving the client secret, the value of the client secret is displayed. Copy this value because you won't be able to retrieve the key later. You will provide the key value with the application ID to sign in as the application. Store the key value where your application can retrieve it.
I then created a high concurrency cluster to ensure resources are shared fairly among a number of concurrent users and jobs.
A High Concurrency cluster is a managed cloud resource. The key benefits of High Concurrency clusters are that they provide Apache Spark-native fine-grained sharing for maximum resource utilization and minimum query latencies.
High Concurrency clusters work only for SQL, Python, and R. The performance and security of High Concurrency clusters is provided by running user code in separate processes, which is not possible in Scala. In addition, only High Concurrency clusters support table access control. To create a High Concurrency cluster, in the Cluster Mode drop-down select High Concurrency.
We are now able to access the desired folder with our service principal.
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": \
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": \
"<application-id>",
"fs.azure.account.oauth2.client.secret": \
"<client-secret>",
"fs.azure.account.oauth2.client.endpoint": \
"https://login.microsoftonline.com/<directory-id>/oauth2/token"}
# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "abfss://<file-system-name>@<storage-account-name>.\
dfs.core.windows.net/",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
N.B. <file-system-name> is simple your container name.
This already looks good, we've added some granularity and are able to control the access to the ADLS. But nevertheless, in this scenario the secret and the mount point are accessible to any user on any cluster in that workspace. Thus, you will have to create multiple workspaces to secure access to different groups of users with different permissions, which is unlikely to satisfy most security requirements - it is too coarse grained, much like RBAC on Blob containers.
Consequently, the most logical solution is to find a way to govern the access control at the session level so a cluster may be shared by multiple groups of users, each using a set of service principal credentials. The trick is that you can ensure that each user group has access to the folder, through their ability to use the <client-secret>. Thus, each service principal is given the required level of access and each Databricks user group is mapped to the secret scope storing the credential for this user principal (<client-secret> in our case, which is stored in the secret scope). Again, as the reference was meant to be used by experts, there is no explanation of what is the secret scope and how to configure. Consequently, let's discuss it here.
Managing secrets begins with creating a secret scope. A secret scope is a collection of secrets identified by a name. A workspace is limited to a maximum of 100 secret scopes. To reference secrets stored in an Azure Key Vault, you can create a secret scope backed by Azure Key Vault. You can then leverage all of the secrets in the corresponding Key Vault instance from that secret scope.
Create a key vault using the Azure portal
Go to Azure Portal
From the Azure portal menu, or from the Home page, select Create a resource.
In the Search box, enter Key Vault.
From the results list, choose Key Vault.
On the Key Vault section, choose Create.
On the Create key vault section provide the following information : Name, Subscription, Resource Group, Location. Leave the other options to their defaults
After providing the information above, select Create.
Take note of the two properties listed below:
Vault Name
Vault URI
Add a secret to Key Vault
Navigate to your new key vault in the Azure portal
On the Key Vault settings pages, select Secrets.
Click on Generate/Import.
On the Create a secret screen choose the following values:
Upload options: Manual.
Name: Type a name for the secret. The secret name must be unique within a Key Vault. The name must be a 1-127 character string, starting with a letter and containing only 0-9, a-z, A-Z, and -. For more information on naming, see Key Vault objects, identifiers, and versioning
Value: Type a value for the secret. Key Vault APIs accept and return secret values as strings.
Leave the other values to their defaults. Click Create.
Create an Azure Key Vault-backed secret scope
Verify that you have Contributor permission on the Azure Key Vault instance
Go to https://<databricks-instance>#secrets/createScope. This URL is case sensitive; scope in createScope must be uppercase.
Enter the name of the secret scope (case insesitive)
Use the Manage Principal drop-down to specify that only the Creator of the secret scope have MANAGE permission for this secret scope
Enter the DNS Name
Click the Create button.
These properties are available from the Properties tab of an Azure Key Vault in your Azure portal.
Add a Databricks user group
Go to the Admin Console and click the Groups tab.
Click + Create Group.
Enter a group name and click Confirm.
Grant read permissions
First of all, you need to set up the Databricks CLI.
Run
pip install databricks-cli
using the appropriate version of pip for your Python installation.
Before you can run CLI commands, you must set up authentication. To authenticate to the CLI you use a personal access token. To set up the token:
Click the user profile icon in the upper right corner of your Databricks workspace.
Click User Settings.
Go to the Access Tokens tab.
Click the Generate New Token button.
Optionally enter a description (comment) and expiration period.
Click the Generate button.
Copy the generated token and store in a secure location.
To configure the CLI to use the personal access token, run databricksconfigure--token. The command issues the prompts:
Databricks Host (should begin with https://):Token:
After you complete the prompts, your access credentials are stored in the file ~/.databrickscfg. The file should contain entries like:
host = https://<databricks-instance>token = <personal-access-token>
Below is an example CLI command of how to grant read permissions to the "GrWritersA" Databricks group on "SsWritersA" secret scope.
databricks secrets put-acl --scope SsWritersA --principal GrWritersA --permission READ
databricks secrets get-acl --scope SsWritersA --principal GrWritersA
Principal Permission
— — — — — — — — — — — —
GrWritersA READ
And here is sample OAuth code, which is very similar to the code used above.
# authenticate using a service principal and OAuth 2.0
spark.conf.set("fs.azure.account.auth.type", \
"OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type", \
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id", \
"enter-your-service-principal-application-id-here")
spark.conf.set("fs.azure.account.oauth2.client.secret", \
dbutils.secrets.get(scope = "secret-scope-name", \
key = "secret-name"))
spark.conf.set("fs.azure.account.oauth2.client.endpoint", \
"https://login.microsoftonline.com//enter-your-tenant-id-here/oauth2/token")
# read data in delta format
readdf=spark.read.format("delta")\
.load("abfs://file-system-name@storage-account-name.dfs.core.windows.net/path-to-data")
Consequently, if the User B doesn't belong to the SsWritersA group, he will be able to see the code, but will be unable to get the value of client secret stored in secret-scope-name. Simple and elegant.
Hope this useful.
Comments