What causes BAD_FILE_FORMAT?

The file extension and actual format do not match (e.g. a .parquet file that is actually gzip-compressed CSV). An upstream ETL job wrote files in the wrong format to a location that Databricks reads as a specific format. A Parquet or ORC file was corrupted during write (incomplete footer or missing schema block). A Delta table location contains non-Delta files that were placed there manually. Auto Loader inferred the format incorrectly during initial schema inference

How do I fix BAD_FILE_FORMAT?

Verify the actual format of a suspect file: file on Linux, or download and inspect the header bytes.. Re-specify the format explicitly in the reader: spark.read.format('parquet').load(path).. Remove or quarantine corrupted files from the source path before re-running the job.. For Auto Loader, set cloudFiles.format explicitly instead of relying on inference.. If a Delta table contains stray non-Delta files, use VACUUM and clean up the directory before reading.

High severitydata qualityDatabricks →

Databricks Error:
BAD_FILE_FORMAT

What does this error mean?

Databricks could not read a file because its actual format does not match the declared format used in the read operation. A Parquet reader receiving a CSV file, or a Delta reader pointed at plain JSON files, will raise this error.

Common causes

1The file extension and actual format do not match (e.g. a .parquet file that is actually gzip-compressed CSV)
2An upstream ETL job wrote files in the wrong format to a location that Databricks reads as a specific format
3A Parquet or ORC file was corrupted during write (incomplete footer or missing schema block)
4A Delta table location contains non-Delta files that were placed there manually
5Auto Loader inferred the format incorrectly during initial schema inference

How to fix it

1Verify the actual format of a suspect file: file <filename> on Linux, or download and inspect the header bytes.
2Re-specify the format explicitly in the reader: spark.read.format('parquet').load(path).
3Remove or quarantine corrupted files from the source path before re-running the job.
4For Auto Loader, set cloudFiles.format explicitly instead of relying on inference.
5If a Delta table contains stray non-Delta files, use VACUUM and clean up the directory before reading.

Frequently asked questions

How do I detect format mismatches before they fail a job?

Use Auto Loader with cloudFiles.validateOptions=true to surface format mismatches at job start. Alternatively, add a schema validation notebook that reads the first file in each batch before the main pipeline runs.

Can a corrupted Parquet footer cause BAD_FILE_FORMAT?

Yes. A Parquet file with a missing or truncated footer looks like an unrecognizable format to the Parquet reader. Re-write the file from source or restore from a backup.

Source · docs.databricks.com/aws/en/error-messages/error-classes.html