If you’ve been working with CSV files in Databricks, you must be familiar with a very useful option called inferSchema while loading CSV files. It is the default option that is widely used by developers to identify the columns, data types, and nullability, automatically while reading the file.
In the below example, the .csv file is read through spark.read.csv function by providing file path, inferSchema option, and header.
By setting the header to ‘true’, Databricks uses the first row of the file for column names.
Below is the code executed in Databricks:
With the inferSchema option is set to true, Databricks will run a pass over the complete file and determine the column names, data types, and nullability. The obtained output is the schema of the data frame inferred by Databricks.
There are situations where the inferSchema option will not work as expected. It sometimes detects the data type and nullability state incorrectly. In the example above, the Account Number Field is detected as a long data type, but the source has the account number stored as a string.
An identifier field like account number would never be used for aggregation or simple addition and subtraction. Let’s assume it should be kept as a string. Also, the column may contain alpha characters in the future and in that case detecting the data type as a long would cause a failure when loading.
To overcome this, you can apply a User-Defined Schema in Databricks to a file.
In the below code, the pyspark.sql.types will be imported using specific data types listed in the method. Here, the Struct Field takes 3 arguments – FieldName, DataType, and Nullability. Once provided, pass the schema to the spark.cread.csv function for the DataFrame to use the custom schema.
In the obtained output, the schema of the DataFrame is as defined in the code:
Another advantage of using a User-Defined Schema in Databricks is improved performance. Spark by default loads the complete file to determine the data types and nullability to build a solid schema. If the file is too large, running a pass over the complete file would take a lot of time.
But, User-Defined Schema in Databricks avoids the pass over the file, hence, performance will have a significant improvement with large files.
Building a User-Defined Schema in Databricks manually is time-consuming, especially when the file has many columns. However, this method will help to maintain the intended schema and also help improving performance by a great extent.
Let us know your thoughts and feedback about working with User-Defined Schemas in Databricks
Click here to take a look at our Microsoft Azure Offerings.