Using the read.csv() method: This is the most common way to read a CSV file in PySpark. You can specify the file path, delimiter, header, and other options as parameters.
df = spark.read.csv(“file_path.csv”, header=True, inferSchema=True)
Using the read.format(“csv”) method: This method allows you to specify the file format and read options as separate arguments.
df = spark.read.format(“csv”).option(“header”, “true”).option(“inferSchema”, “true”).load(“file_path.csv”)
Using the spark-csv package: This is an external package that provides a csv method for reading CSV files.
# Install the package
!pip install pyspark-csv
# Read the CSV file
df = sqlContext.read.format(‘com.databricks.spark.csv’).options(header=’true’, inferSchema=’true’).load(‘file_path.csv’)
Using the spark.read.csv() method: This is the most common way to read a CSV file in PySpark. You can use this method to read a CSV file directly into a DataFrame. Here’s an example
df = spark.read.csv(“path/to/csv/file.csv”, header=True, inferSchema=True)
Using the spark.read.format() method: This method allows you to specify the format of the input file explicitly. Here’s an example
df = spark.read.format(“csv”) \
.option(“header”, “true”) \
.option(“inferSchema”, “true”) \
.load(“path/to/csv/file.csv”)
Using the spark.read.text() method: This method reads a CSV file as a text file, and you can use PySpark’s string manipulation functions to parse the data. Here’s an example:
rdd = spark.read.text(“path/to/csv/file.csv”).rdd
header = rdd.first()
data = rdd.filter(lambda row: row != header)
df = data.map(lambda row: row[0].split(“,”)).toDF(header[0].split(“,”))
Using third-party libraries: You can also use third-party libraries like pandas or dask to read CSV files into a DataFrame, and then convert that DataFrame to a PySpark DataFrame. Here’s an example using pandas:
import pandas as pd
pdf = pd.read_csv(“path/to/csv/file.csv”)
df = spark.createDataFrame(pdf)