Categories
apache-spark apache-spark-sql pandas pyspark python

How to make the first row as header when reading a file in PySpark and converting it to Pandas Dataframe

I am reading a file in PySpark and forming the rdd of it. I then convert it to a normal dataframe and then to pandas dataframe. The issue that I am having is that there is header row in my input file and I want to make this as the header of dataframe columns as well but they are read in as an additional row and not as header. This is my current code:

def extract(line):
return line
input_file = sc.textFile('file1.txt').zipWithIndex().filter(lambda (line,rownum): rownum>=0).map(lambda (line, rownum): line)
input_data = (input_file
.map(lambda line: line.split(";"))
.filter(lambda line: len(line) >=0 )
.map(extract)) # Map to tuples
df_normal = input_data.toDF()
df= df_normal.toPandas()

Now when I look at the df then the header row of text file becomes the first row of dataframe and there is additional header in df with 0,1,2... as header. How can I make the first row as header?