Azure Databricks is a Unified Analytics Platform built by the creators of Apache Spark. Databricks is the first Unified Analytics Platform that can handle all your analytical needs from ETL to training AI models. Databricks is committed to security by taking a Security-First Approach while building the product. This blog features on one such new security features provided by Databricks.
There are two methods to connect to Azure Data Lake,
- API Method
- Mount Method
To Connect through the API Method or Mount method, a Service Principal ID and key would be provided. The security concern with connecting through a Service Principal Key is that anyone who has access to the Databricks instance will have access to all the files in Data Lake, even if they don’t individually have access to files.
Now that is a security concern, how do we solve it? That’s where AD Credentials Passthrough comes into the picture. Say, we have marketing and finance related files in the Data Lake and we do not want marketing to access the finance files. We need to implement Role-Based Access Control, in Databricks. We can use this Credentials Passthrough method to achieve this goal. By enabling this option, Databricks would pass your AD access token to the Data Lake and fetch only the data the user has access to read. This works with Databricks instances in the premium tier, and with high concurrency clusters.
Below are the Advanced options, you’ll find the Data Lake Credential Passthrough option. As you can see, this works for both Data Lake Gen1 and Gen2. Now, let see how it works. For testing purposes, we have a file in Data Lake for which one user has access and the other does not have access.
First, we try to access that file with Finance User credentials and we can read the file.
Now, we try the same with the Marketing User credentials and receive an access denied error.
As you can see, the AD Credentials have been used to get a token which has been passed on to the Data Lake to check whether the user has access to the file. We can implement this with a mounted path, While creating the mount connection, do not provide the information needed in the regular config and use this.
ADLS Gen 1
ADLS Gen 2
For testing purposes, we have removed access for the file to the Finance User and created 2 mount paths. Vbitraining(service principal key) and vbitraining1. When I try to access the file with both the mount points you can see the error.
There are some limitations with using this method:
- It’s not supported in Scala, currently supported in Python and SQL.
- Supports Data Lake Gen1 and Gen2 only, other storage options do not work with this method.
- It does not support the usage of some deprecated methods, but sc and spark methods work without issues
Now that we’ve seen this method, we can provide access to the marketing distribution list for their folder, so only the team can access it.