SKIP N ROWS
reading csv in databricks
Did you know you can easily skip N rows of a CSV file in Databricks? 🤔
Unfortunately, PySpark lacks the ability to skip rows directly, so we need to use the underlying RDD syntax and filter the rows. This is not very user-friendly for such a basic operation!
+-----------------------+--------------------+
|This row is unnecessary| and you can skip it|
+-----------------------+--------------------+
| Name| Age|
| John Doe| 28|
| Jane Smith| 34|
| Alice Brown| 45|
| Bob White| 22|
+-----------------------+--------------------+
This row is unnecessary, and you can skip it
Name, Age, Occupation
John Doe, 28, Software Engineer
Jane Smith, 34, Doctor
Alice Brown, 45, Teacher
Bob White, 22, Student
skip = (
spark.read.format('csv')
.option("inferSchema", True)
.option("header", False)
.option('sep', '|')
.option('skipRows', 1)
.load('skiprows.csv')
)
display(skip)
Incorrect output in spark 2.4.1
🪄✨
Use the `skipRows` option
Databricks skips first header!