A JavaScript library to find duplicate document images.
How does it work?
It extracts the text of images using OCR and uses levenshtein distance to calculate the similarity between two texts.
-
find
. Find the duplicated images. You can pass your own OCR results.async find(images:HTMLImageElement[],textLinesOfImages?:TextLine[][],progressCallback?:any):Promise<HTMLImageElement[]>
-
TextLine
export interface TextLine{ x:number; y:number; width:number; height:number; text:string; }
Via NPM:
npm install duplicate-document-image-finder
Via CDN:
<script type="module">
import { DuplicateDocumentImageFinder } from 'https://cdn.jsdelivr.net/npm/duplicate-document-image-finder/dist/duplicate-document-image-finder.js';
</script>
MIT