What causes MALFORMED_RECORD_IN_PARSING?

A JSON field contains a value that cannot be cast to the declared schema type (e.g. a string where an integer is expected). A CSV row has a different number of fields than the header row. A multi-line JSON record is split across lines and read_csv or spark.read.json is used without multiLine option. An upstream system changed its output format without updating the reader schema. A file is partially corrupted or truncated mid-record

How do I fix MALFORMED_RECORD_IN_PARSING?

Switch the parse mode to PERMISSIVE and inspect the `_corrupt_record` column: `spark.read.option('mode', 'PERMISSIVE')`.. Use DROPMALFORMED to silently skip bad records, then investigate them separately from the _corrupt_record output.. Validate schema compatibility between the reader's schema and a sample of new incoming files before processing.. For multi-line JSON, add .option('multiLine', 'true') to the reader.. Add a schema validation step upstream (e.g. Great Expectations or Delta Live Tables expectations) to reject malformed files before the pipeline runs.

Medium severitydata qualityDatabricks →

Databricks Error:
MALFORMED_RECORD_IN_PARSING

What does this error mean?

The JSON or CSV parser encountered a record that does not conform to the expected schema or format. In FAILFAST mode, Databricks raises this error immediately; in PERMISSIVE mode, the bad record is replaced with NULL values.

Common causes

1A JSON field contains a value that cannot be cast to the declared schema type (e.g. a string where an integer is expected)
2A CSV row has a different number of fields than the header row
3A multi-line JSON record is split across lines and read_csv or spark.read.json is used without multiLine option
4An upstream system changed its output format without updating the reader schema
5A file is partially corrupted or truncated mid-record

How to fix it

1Switch the parse mode to PERMISSIVE and inspect the `_corrupt_record` column: `spark.read.option('mode', 'PERMISSIVE')`.
2Use DROPMALFORMED to silently skip bad records, then investigate them separately from the _corrupt_record output.
3Validate schema compatibility between the reader's schema and a sample of new incoming files before processing.
4For multi-line JSON, add .option('multiLine', 'true') to the reader.
5Add a schema validation step upstream (e.g. Great Expectations or Delta Live Tables expectations) to reject malformed files before the pipeline runs.

Frequently asked questions

How do I find out which records are malformed?

Read with mode PERMISSIVE and include the _corrupt_record column in your schema. Filter WHERE _corrupt_record IS NOT NULL to extract all bad rows for inspection.

Does Auto Loader handle MALFORMED_RECORD_IN_PARSING differently?

Auto Loader uses the same Spark reader options. Set cloudFiles.schemaEvolutionMode to rescue to route unexpected columns and malformed values to a _rescued_data column instead of failing.

Source · docs.databricks.com/aws/en/error-messages/error-classes.html