Can Power BI's PDF connector handle scanned (image-based) PDFs?

No. The PDF connector in Power Query relies on text extraction via IFilter, not OCR. If the PDF contains scanned images of text, Pdf.Tables() returns zero rows without an error. You need to run OCR (e.g., Azure AI Document Intelligence or Tesseract) upstream and feed the extracted text to Power BI as a CSV or database table.

Why does my PDF-based dataset refresh fail on the Power BI Service but work in Desktop?

Three common causes: (1) The On-premises Data Gateway machine lacks the required PDF IFilter, so Pdf.Tables is unsupported in that environment. (2) The dataset uses a Python or R script for PDF parsing, which the gateway does not support for scheduled refresh. (3) The file path points to a local drive that the cloud service cannot access. Check the gateway logs under the PBIEgwService user profile for the specific Mashup engine exception.

How do I prevent silent data corruption when the PDF layout changes between months?

Add a schema validation step in your M query that checks whether all expected column names exist after Table.PromoteHeaders. If any are missing, raise an error with Error.Record() listing the missing columns. This converts a silent corruption into a visible refresh failure that your monitoring can catch.

Power BI's PDF Connector Parses Your File Once — Then Breaks on Every Structural Change

The PDF connector infers structure — it does not understand it

Power BI Desktop ships a PDF connector built on top of the IFilter-based PDF parsing in Power Query. When you point it at a file and select a table, the M engine calls Pdf.Tables(), which scans the page for rectangular regions that look like tabular data. It returns one or more tables with column names derived from what it identifies as a header row.

This works remarkably well when you have a single, cleanly formatted PDF. The trap is that the connector has no schema memory. It re-infers the table layout on every refresh. If next month's vendor report adds a subtotal row between the header and the first data row, the connector may interpret the subtotal as the new header. If a merged cell spans two columns, the column count shifts and every downstream column reference in your DAX breaks.

The M code generated when you first connect looks roughly like this:

let
    Source = Pdf.Tables(File.Contents("C:\Reports\vendor_q1.pdf")),
    Table1 = Source{[Id="Table1"]}[Data],
    Promoted = Table.PromoteHeaders(Table1, [PromoteAllScalars=true])
in
    Promoted

That Table.PromoteHeaders call assumes the first row is always the header. It is not a guarantee — it is a guess that happened to be correct the first time. When the PDF structure drifts, the promoted headers become data values and the real headers end up as a data row. Your measures still compute. They compute on garbage.

The connector also has a hard limitation: it cannot parse PDFs that are image-based scans rather than text-based documents. No OCR runs inside Power Query. The refresh succeeds with zero rows returned, and unless you check row counts, you will not notice.

Scheduled refresh fails silently when the file path moves

During development, you work with File.Contents("C:\Reports\...") — a local path. Scheduled refresh on the Power BI Service cannot reach local paths. You need either a gateway-backed file share or a cloud-hosted file (SharePoint, OneDrive for Business, Azure Blob Storage).

Here is where PDF extraction compounds the usual file-connector problems. The On-premises Data Gateway supports the PDF connector, but only if the machine running the gateway has the correct PDF IFilter registered. On Windows Server, this is typically the Adobe PDF IFilter 64-bit or the built-in Windows IFilter for PDF that ships with Windows Server 2019 and later. If the IFilter is missing, the gateway logs an error in the mashup engine container:

Microsoft.Mashup.Engine.Interface.ResourceAccessDeniedOrDoesNotExistException:
  The resource 'Pdf.Tables' is not supported in the current environment.

This error does not surface as a clear message in the Power BI Service refresh history. You see Data source error with a generic recommendation to check credentials. The actual cause — a missing system component on the gateway machine — requires you to open the gateway logs at C:\Users\PBIEgwService\AppData\Local\Microsoft\On-premises data gateway\*.log and search for the Mashup engine exception.

SharePoint-hosted PDFs introduce a different failure mode. If the file is checked out by another user, or if the library enforces approval workflows, the connector receives a 423 Locked HTTP status. Power Query translates this into DataSource.Error: The remote server returned an error: (423) Locked. The refresh fails, but the error message does not tell you which file or which library caused it. When your dataset pulls from multiple PDF files across different SharePoint sites, isolating the locked file requires disabling tables one at a time and re-running the refresh.

PDF-to-Power BI pipeline architecture

Python and R scripts parse better but break the gateway contract

Many teams abandon the native PDF connector and switch to Python-based extraction using libraries like tabula-py, camelot-py, or pdfplumber. These libraries offer substantially better control over table detection — you can specify the page number, the bounding box coordinates, and the parsing strategy (lattice vs. stream). Inside Power BI Desktop, this works through the Python script data source:

import pandas as pd
import tabula

tables = tabula.read_pdf(
    r"C:\Reports\vendor_q1.pdf",
    pages="all",
    lattice=True
)
df = pd.concat(tables, ignore_index=True)

The problem arrives at scheduled refresh time. Python and R script data sources are not supported by the On-premises Data Gateway. Microsoft documents this limitation explicitly: gateway refresh is unavailable for datasets that use Python or R as a data source. Your dataset refreshes manually from Desktop but will never refresh on a schedule through the Service unless you restructure the pipeline.

The workaround is to move the Python parsing outside of Power BI entirely. Run the extraction in an Azure Data Factory pipeline activity, a Databricks notebook, or a scheduled Azure Function. Write the parsed output to a staging table in a SQL database or a Parquet file in Azure Data Lake Storage Gen2. Then point Power BI at the clean, tabular output — not the raw PDF.

This architecture shift solves two problems at once. First, the schema is validated at the parsing stage. If the PDF structure changes and tabula returns a DataFrame with unexpected columns, you can raise an exception and halt the pipeline before bad data reaches the dataset. Second, the Power BI dataset now connects to a standard data source that the gateway handles without complications.

The tradeoff is operational complexity. You now have a pipeline component that must be monitored independently. The PDF might arrive late, the parsing might fail on a new layout, or the staging table might not get populated before the Power BI refresh schedule fires.

Schema drift between PDF versions poisons measures without errors

The most dangerous failure mode with PDF extraction is not a failed refresh — it is a successful one that loads structurally wrong data. Consider a monthly report where the vendor adds a new column between "Net Amount" and "Tax Amount." The PDF connector re-infers the table and assigns new column names. If the new column is "Discount Amount," the old "Tax Amount" column might now be mapped to what Power Query calls Column6 instead of its promoted header name.

Your DAX measure SUM('VendorData'[Tax Amount]) still runs. It sums whatever column still carries that header. If the header promotion shifted, it might now be summing the discount values, or the column might contain nulls from a misaligned parse. No error fires. The visual renders. The number is wrong.

To defend against this, add explicit schema validation steps in your M query:

let
    Source = Pdf.Tables(File.Contents(filePath)),
    Table1 = Source{[Id="Table1"]}[Data],
    Promoted = Table.PromoteHeaders(Table1),
    ExpectedColumns = {"Vendor", "Net Amount", "Tax Amount", "Total"},
    ActualColumns = Table.ColumnNames(Promoted),
    SchemaValid = List.ContainsAll(ActualColumns, ExpectedColumns),
    Validated = if SchemaValid then Promoted
               else error Error.Record("Schema Mismatch",
                    "Expected columns missing: " &
                    Text.Combine(List.Difference(ExpectedColumns, ActualColumns), ", "))
in
    Validated

This forces the refresh to fail with a descriptive error when columns drift. A failed refresh is far better than a successful refresh with corrupted data — the failed refresh gets noticed, the corrupted data sits in a dashboard for days until someone questions the numbers in a Monday morning meeting.

For datasets where the PDF source changes structure frequently, consider computing row-count and column-count checksums during each refresh and logging them to a separate tracking table. A sudden drop in row count from 500 to 12 is a reliable signal that the parser misidentified the table boundaries.

Move parsing upstream and monitor the refresh outcome downstream

The reliable architecture for PDF-to-Power BI pipelines separates three concerns: file acquisition, structural parsing, and dataset refresh. Each can fail independently, and each needs its own monitoring.

File acquisition means getting the PDF into a location your pipeline can reach. Whether that is a SharePoint document library, an Azure Blob container, or an SFTP drop, build a check that confirms the file arrived and has a non-zero byte size before triggering the parse. Azure Data Factory supports event-based triggers on Blob Storage that fire when a new file lands. For SharePoint, a Logic App or Power Automate flow can watch a library folder and copy the file to Blob Storage.

Structural parsing runs in a compute environment you control — a Databricks notebook, an ADF mapping data flow, or an Azure Function running pdfplumber. This step should validate the output schema, assert minimum row counts, and write to a staging area in a format Power BI handles natively: Parquet, Delta Lake, or a SQL table.

Dataset refresh is the final mile. With the parsed data already sitting in a clean tabular format, the Power BI refresh becomes a standard import or DirectQuery operation. The gateway supports it, scheduled refresh works, and incremental refresh policies can apply if the staging table is partitioned by date.

MetricSign monitors this final stage. When a Power BI dataset refresh fails — or succeeds but arrives late relative to its schedule — MetricSign raises a refresh_delayed signal with the exact error context from the Power BI Service API. For PDF-sourced datasets where the real risk is silent schema drift rather than outright failure, pairing the M-level schema validation from the previous section with MetricSign's refresh monitoring closes the gap: the validation forces a visible failure, and MetricSign ensures that failure reaches the right person within minutes, not days.