apache-spark json spark-dataframe

How to avoid generating crc files and SUCCESS files while saving a DataFrame?

I am using the following code to save a spark DataFrame to JSON file


the output result is:


  1. How do I generate a single JSON file and not a file per line?
  2. How can I avoid the *crc files?
  3. How can I avoid the SUCCESS file?

If you want a single file, you need to do a coalesce to a single partition before calling write, so:


Personally, I find it rather annoying that the number of output files depend on number of partitions you have before calling write – especially if you do a write with a partitionBy – but as far as I know, there are currently no other way.

I don’t know if there is a way to disable the .crc files – I don’t know of one – but you can disable the _SUCCESS file by setting the following on the hadoop configuration of the Spark context.

sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

Note, that you may also want to disable generation of the metadata files with:

sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")

Apparently, generating the metadata files takes some time (see this blog post) but aren’t actually that important (according to this). Personally, I always disable them and I have had no issues.