Invoice Similarity Comparison

Overview

This project compares invoices based on their text content and visual layout. It extracts text from PDF invoices, analyzes key features, and calculates similarities using three methods:

Cosine Similarity: Measures the cosine of the angle between feature vectors.
Jaccard Similarity: Measures the overlap between sets of keywords.
Image Similarity: Compares visual layouts of invoices.

The goal is to find the most similar invoice from a database of training invoices to a given test invoice.

Requirements

Python 3.6 or higher
Libraries: PyPDF2, re, numpy, scikit-learn, Pillow, scikit-image

Installation

You can install the required libraries using pip:

bash pip install PyPDF2 numpy scikit-learn Pillow scikit-image

##Code Overview

#Cell 1: Import Required Libraries This cell imports all necessary libraries used for text extraction, feature extraction, and similarity calculations.

#Cell 2: Define Functions for Text Extraction and Feature Extraction Defines functions to extract text from PDFs and features such as invoice number, date, amount, and keywords.

#Cell 3: Define Functions for Calculating Similarity Contains functions to compute:

Cosine Similarity: Between feature vectors. Jaccard Similarity: Between sets of keywords. Image Similarity: Between visual representations of invoices.

#Cell 4: Extract Text and Features from Training PDFs Extracts text and features from a set of training PDF invoices.

#Cell 5: Extract Text and Features from Testing PDFs Extracts text and features from a set of testing PDF invoices.

#Cell 6: Store Training Data in In-Memory Database Stores extracted features of training invoices in an in-memory database.

#Cell 7: Define Function to Find the Most Similar Invoice Defines a function that finds the most similar invoice in the database based on text and image similarity.

#Cell 8: Find and Print the Most Similar Invoices for Test PDFs Finds and prints the most similar invoices for each test PDF with similarity scores displayed as percentages.

##Usage Update File Paths: Ensure the pdf_directory and test_pdfs paths in the code are correct and point to the directories containing your PDF files.

##Run the Code: Execute the cells sequentially in a Python environment (e.g., Jupyter Notebook).

##View Results: The most similar invoice for each test PDF will be printed along with the similarity scores in percentage format.

##Example Output Most similar invoice to 'invoice_77098.pdf': invoice_3.pdf with similarity score 85.6789% Most similar invoice to 'invoice_102857.pdf': invoice_1.pdf with similarity score 90.1234%

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Assignment.ipynb		Assignment.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Invoice Similarity Comparison

Overview

Requirements

Installation

About

Releases

Packages

Languages

Sudhanshu1805/Document-Similarity-Matching

Folders and files

Latest commit

History

Repository files navigation

Invoice Similarity Comparison

Overview

Requirements

Installation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages