PDA

View Full Version : Indexing scanned PDF files



kiwiboi82
04-05-2011, 09:11 PM
Firstly forgive me if there is already a post about this, I did do a search but didn't come up with anything...


On behalf of a friend who wants help with his business, we are just wanting to know if anyone is aware of any low hassle ways of indexing pdf files, that are scanned images of text. (e.g. So he has received a fax and scanned it as an image onto his computer, but wants to be able to search for the text inside that document he has just scanned)

We are aware of Adobe Acrobat Pro's OCR feature, but my friend has thousands of previously scanned images with text in them, so this isn't practical to go through every single file running the OCR feature.

We are also aware that Google has OCR built in their search indexing service, but my friend doesn't want to upload thousands of confidential pdf files onto a public service just to get indexed.

We have also found online 2 programs so far:
Nitro PDF Professional (http://www.nitropdf.com/professional/ocr.htm) and Evernote (http://www.evernote.com/about/premium/), but are still confirming if these programs can index a large amount of files for instant search availability, within the scanned image using OCR methods.

So we are just wanting to know if anyone is aware of any different or even simpler methods of doing this, or even if it is possible.

(Also forgive me if anything above is incorrect, we are just going off blog postings on other sites)

Thank you for any assistance anyone is able to provide :)

Barnabas
05-05-2011, 10:37 AM
try this

http://www.foxitsoftware.com/pdf/ifilter/

Erayd
05-05-2011, 11:34 AM
try this

http://www.foxitsoftware.com/pdf/ifilter/

That product doesn't appear to be capable of OCR, which the OP needs.

:pf1mobmini:

Barnabas
05-05-2011, 12:54 PM
yup your right...my bad, I didn't read the post properly.

I currently scan stuff into OneNote that has built in OCR for pdfs / jpgs etc. Prob not practical for 1000's of documents though.

psycik
05-05-2011, 02:59 PM
couple of really simple methods...

Use the file name for Certain criteria, amounts etc, the file date gives you the month of bills etc. Then store all in a directory structure, year-month (yyyy-mm) and use the inbuilt windows search ability to find it.

2nd get an OCR capable PDF generator. Personally I like Adobe Professional for this, it works well, and has a scanning ability built in.

Then couple with the directory structure, you can use full text PDF seaching. But this is really slow to scan the text of every pdf.

After these methods you could start looking at packages. I could really find any simple enough for 2 may 3 indexes attached to a document type, so ended up writing my own, just a database with document types (power bill phone bill) and references a PDF file.

There's nothing stopping someone from doing similar with MSAccess/Excel and using a file:// hyperlink to reference the PDF.

At the other end of the scan is something like EMC's Documentum, but I'd say that way over the time (and is about $100KUSD) but it's concepts are similar for the document types/storage etc.

Some other options are Knowledge Management Document System (I looked at this for home and is relatively cheap compared to Documentum). Or up to Microsoft sharepoint.

Good luck.