Pythonでテーブルデータを抽出し、Excelに変換する方法

概要

PDFファイルからテーブルデータを抽出し、Excel形式に変換することは、多くのビジネスやデータ分析の現場で必要とされています。Pythonを使えば、これを簡単に実現することができます。本記事では、Pythonライブラリを使ってPDFからテーブルデータを抽出し、罫線付きのExcelファイルに変換する方法をご紹介します。

使用例

作成されたExcelファイルは、以下のような場面で活用できます：

データ分析：抽出されたデータをExcelで編集し、グラフやピボットテーブルを作成する。
レポート作成：定期的に更新されるPDFレポートからデータを抽出し、Excelで統合管理する。
データマイグレーション：旧システムからのデータをPDF形式で受け取り、新システムにExcel形式で移行する。

必要なPythonライブラリとインストール方法

このプログラムを実行するためには、以下のPythonライブラリが必要です：

PyMuPDF（fitz）
camelot-py
openpyxl
tkinterdnd2
PyPDF2(2.10.5) ※3.0.0ではエラーが出ます。

これらのライブラリをインストールするためには、以下のコマンドを使用します：

pip install PyPDF2==2.10.5 camelot-py[cv] openpyxl tkinterdnd2 pymupdf

使用手順

必要なライブラリをインストールします。ターミナルまたはコマンドプロンプトを開き、上記のコマンドを入力してインストールを行ってください。
下記のプログラムコードをコピーし、メモ帳などのテキストエディタに貼り付けます。その後、ファイル名をpdf_to_excel.pyとして保存します。
pdf_to_excel.pyを実行します。ダブルクリックするか、ターミナルまたはコマンドプロンプトでpython pdf_to_excel.pyと入力して実行します。
GUIが表示されるので、PDFファイルをドラッグ＆ドロップするか、「Browse」ボタンを使用してPDFファイルを選択します。
出力先フォルダを指定し、「Convert」ボタンをクリックして変換を開始します。
変換が成功すると、指定したフォルダに罫線付きのExcelファイルが保存されます。

注意点

必要なライブラリのバージョン：指定されたバージョンのライブラリを使用してください。特に、PyPDF2のバージョン2.10.5を使用する必要があります。
PDFの内容：PDF内のテーブルレイアウトが複雑な場合、正しく抽出できないことがあります。その場合は、PDFの内容を確認し、手動で修正してください。
フォント：日本語文字が含まれる場合、フォントの問題で文字が正しく表示されないことがあります。その場合は、フォントを変更するか、手動で修正してください。

プログラム

下記のコードをメモ帳などに丸々コピーして、pythonファイル(pdf_to_excel.py)にしてください。

import fitz  # PyMuPDF
import openpyxl as px
from openpyxl.styles import Border, Side
import tkinter as tk
from tkinterdnd2 import TkinterDnD, DND_FILES
from tkinter import filedialog
from tkinter import messagebox
from tkinter import ttk
import os
import camelot

class PDFtoExcelConverter:
    def __init__(self, root):
        self.root = root
        self.root.title("PDF to Excel Converter")
        
        self.source_file_path = ""
        self.destination_folder_path = ""

        self.source_label = ttk.Label(root, text="Select PDF File:")
        self.source_label.grid(row=0, column=0, padx=5, pady=5, sticky="w")
        
        self.source_entry = ttk.Entry(root, width=50)
        self.source_entry.grid(row=0, column=1, padx=5, pady=5, sticky="ew")

        self.source_button = ttk.Button(root, text="Browse", command=self.select_source_file)
        self.source_button.grid(row=0, column=2, padx=5, pady=5)
        
        self.destination_label = ttk.Label(root, text="Select Destination Folder:")
        self.destination_label.grid(row=1, column=0, padx=5, pady=5, sticky="w")

        self.destination_entry = ttk.Entry(root, width=50)
        self.destination_entry.grid(row=1, column=1, padx=5, pady=5, sticky="ew")

        self.destination_button = ttk.Button(root, text="Browse", command=self.select_destination_folder)
        self.destination_button.grid(row=1, column=2, padx=5, pady=5)

        self.convert_button = ttk.Button(root, text="Convert", command=self.convert_pdf_to_excel)
        self.convert_button.grid(row=2, column=1, padx=5, pady=5)

        self.progress_label = ttk.Label(root, text="")
        self.progress_label.grid(row=3, column=0, columnspan=3, padx=5, pady=5)

        self.dnd_label = ttk.Label(root, text="Drag and Drop PDF File Here:")
        self.dnd_label.grid(row=4, column=0, columnspan=3, padx=5, pady=5)

        self.dnd_frame = ttk.Frame(root, relief="sunken", width=50, height=100)
        self.dnd_frame.grid(row=5, column=0, columnspan=3, padx=5, pady=5, sticky="ew")
        self.dnd_frame.drop_target_register(DND_FILES)
        self.dnd_frame.dnd_bind('<<Drop>>', self.on_drop)

    def select_source_file(self):
        self.source_file_path = filedialog.askopenfilename(filetypes=[("PDF Files", "*.pdf")])
        self.source_entry.delete(0, tk.END)
        self.source_entry.insert(0, self.source_file_path)

    def select_destination_folder(self):
        self.destination_folder_path = filedialog.askdirectory()
        self.destination_entry.delete(0, tk.END)
        self.destination_entry.insert(0, self.destination_folder_path)

    def on_drop(self, event):
        self.source_file_path = event.data.strip('{}')
        self.source_entry.delete(0, tk.END)
        self.source_entry.insert(0, self.source_file_path)

    def convert_pdf_to_excel(self):
        if not self.source_file_path:
            messagebox.showerror("Error", "Please select a PDF file.")
            return
        if not self.destination_folder_path:
            messagebox.showerror("Error", "Please select a destination folder.")
            return

        try:
            # PDFを読み込んでしおりを取得
            pdf_document = fitz.open(self.source_file_path)
            writer = px.Workbook()
            writer.remove(writer.active)  # デフォルトで作成される最初のシートを削除
            
            bookmarks = self.get_bookmarks(pdf_document)

            for page_num in range(len(pdf_document)):
                tables = camelot.read_pdf(self.source_file_path, flavor='stream', pages=str(page_num + 1))

                if tables:
                    sheet_name = bookmarks.get(page_num, f"{page_num + 1:03d}")
                    if len(sheet_name) > 30:
                        sheet_name = f"{sheet_name[:10]}…{sheet_name[-10:]}-{page_num + 1:03d}"
                    sheet = writer.create_sheet(title=sheet_name)

                    # テーブルの罫線のスタイルを設定
                    thin_border = Border(
                        left=Side(style='thin'),
                        right=Side(style='thin'),
                        top=Side(style='thin'),
                        bottom=Side(style='thin')
                    )

                    # ページのテーブルをシートに書き込む
                    for table_idx, table in enumerate(tables):
                        for row_idx, row in enumerate(table.df.values):
                            for col_idx, value in enumerate(row):
                                cell = sheet.cell(row=row_idx + 1, column=col_idx + 1)
                                cell.value = value
                                cell.border = thin_border  # 罫線を適用

            # 出力先フォルダにExcelファイルを保存
            output_file_path = os.path.join(self.destination_folder_path, os.path.splitext(os.path.basename(self.source_file_path))[0] + ".xlsx")
            writer.save(output_file_path)
            self.progress_label.config(text="Conversion successful!")

        except Exception as e:
            error_message = f"Error: {str(e)}"
            print(error_message)
            self.progress_label.config(text=error_message)

    def get_bookmarks(self, pdf_document):
        toc = pdf_document.get_toc()
        bookmarks = {}
        for item in toc:
            level, title, page_num = item
            bookmarks[page_num - 1] = title
        return bookmarks

root = TkinterDnD.Tk()
app = PDFtoExcelConverter(root)
root.mainloop()