国产精品久久国产精品99,日韩精品永久在线

^{<thead id="2483v"></thead>}

<u id="2483v"></u>

多種方法從pdf文件中提取表格數(shù)據(jù)

當前位置：點晴教程→知識管理交流 →『技術文檔交流』

admin

2025年8月28日 1:38 本文熱度 773

我們經(jīng)常遇到一些發(fā)布的pdf文件，需要獲取其中表格中的數(shù)據(jù)，比如如下的表格：

提取數(shù)據(jù)有多種方法，我們采用最簡單的python來實現(xiàn)。

建立python項目，建立文件readpdf.py如下

import 
tabula
# 檢查本地的java環(huán)境是否正確
tabula.environment_info()
#jpype.startJVM(jpype.getDefaultJVMPath())
# 從PDF文件中讀取表格數(shù)據(jù)到DataFrame列表
dfs = tabula.read_pdf("D:\\ai-hos\\doc\\my.pdf", pages='all', 
multiple_tables=True,force_subprocess=True)


for i, df in enumerate(dfs):
# 將每個DataFrame保存為CSV文件
df.to_csv(f"table_{i}.csv", index=False)

引用支持庫?

pip install tabula-py

pip install jpype1 --no-cache-dir

?應該確保本地的jdk安裝正確

java -version

這樣運行程序就可以解析pdf文件中的數(shù)據(jù)了。?

Python提取PDF表格數(shù)據(jù)還可以使用以下幾種方法：

1. 使用 pdfplumber 庫

安裝：pip install pdfplumber

示例代碼：

python

import pdfplumber
import pandas as pd


def extract_pdf_tables(pdf_path, start_page, end_page):
    with pdfplumber.open(pdf_path) as pdf:
        all_dfs = []
        for page in pdf.pages[start_page-1:end_page]:
            tables = page.extract_tables()
            for table in tables:
                if table:
                    df = pd.DataFrame(table[1:], columns=table[0])
                    all_dfs.append(df)
        combined_df = pd.concat(all_dfs, ignore_index=True)
        return combined_df


# 使用示例
pdf_path = "your_file.pdf"
start_page = 1 # 開始頁碼
end_page = 10 # 結(jié)束頁碼
data = extract_pdf_tables(pdf_path, start_page, end_page)
data.to_excel("output.xlsx", index=False)

2. 使用 camelot 庫

安裝：pip install camelot-py[cv]

示例代碼：

python

import camelot


def extract_pdf_tables(pdf_path):
    tables = camelot.read_pdf(pdf_path)
    combined_df = pd.concat([table.df for table in tables], ignore_index=True)
    return combined_df


# 使用示例
pdf_path = "your_file.pdf"
data = extract_pdf_tables(pdf_path)
data.to_csv("output.csv", index=False)?

3. 使用 tabula-py 庫

安裝：pip install tabula-py jpype1

示例代碼：

python

import tabula


def extract_pdf_tables(pdf_path):
    dfs = tabula.read_pdf(pdf_path, pages='all', multiple_tables=True)
    combined_df = pd.concat(dfs, ignore_index=True)
    return combined_df


# 使用示例
pdf_path = "your_file.pdf"
data = extract_pdf_tables(pdf_path)
data.to_excel("output.xlsx", index=False)

注意事項

文件路徑：確保PDF文件路徑正確，可使用絕對路徑或相對路徑。

表格格式：不同庫對表格格式的兼容性不同，若提取結(jié)果不理想，可嘗試更換庫或調(diào)整參數(shù)。

性能優(yōu)化：對于大型PDF文件，可分頁處理或使用多線程提高效率。

該文章在 2025/8/28 16:17:25 編輯過

關鍵字查詢

正在查詢...

點晴ERP是一款針對中小制造業(yè)的專業(yè)生產(chǎn)管理軟件系統(tǒng),系統(tǒng)成熟度和易用性得到了國內(nèi)大量中小企業(yè)的青睞。

點晴PMS碼頭管理系統(tǒng)主要針對港口碼頭集裝箱與散貨日常運作、調(diào)度、堆場、車隊、財務費用、相關報表等業(yè)務管理，結(jié)合碼頭的業(yè)務特點，圍繞調(diào)度、堆場作業(yè)而開發(fā)的。集技術的先進性、管理的有效性于一體，是物流碼頭及其他港口類企業(yè)的高效ERP管理信息系統(tǒng)。

點晴WMS倉儲管理系統(tǒng)提供了貨物產(chǎn)品管理,銷售管理,采購管理,倉儲管理,倉庫管理,保質(zhì)期管理,貨位管理,庫位管理,生產(chǎn)管理,WMS管理系統(tǒng),標簽打印,條形碼,二維碼管理,批號管理軟件。

點晴免費OA是一款軟件和通用服務都免費，不限功能、不限時間、不限用戶的免費OA協(xié)同辦公管理系統(tǒng)。

成人欧美一区二区三区的电影,日韩一级一欧美一级国产,国产成人国拍亚洲精品,无码人妻精品一区二区三区毛片,伊人久久无码大香线蕉综合

多種方法從pdf文件中提取表格數(shù)據(jù)

1. 使用 pdfplumber 庫

2. 使用 camelot 庫

3. 使用 tabula-py 庫