PDA

View Full Version : Scanning



JJJJJ
25-01-2005, 03:28 PM
How do commercial printers do such a good job with scans? Do they just have a better scanner than I do?
I have pages and pages of typewritten script and pages from old books. Probably printed in Post script. How do I get them into my computer in readable form?
They scan OK but what they look like is nothing like what the originals do. Words missed out and mis-spelled. Impossible to read. I could edit out the errors but that would take twice as long as typeing the whole page.

What's got me beaten is why can I scan a page from a moden book, but not from an old one?
:badpc: :badpc: :badpc: :badpc: That's fixed the scanner,
Jack

Biggles
25-01-2005, 03:31 PM
Jack - are you talkign about optical character recognition? It's not clear from your opening comment about commercial printers.

JJJJJ
25-01-2005, 03:38 PM
Hi Bruce. Welcome back.
Just a plain everyday A4 scanner. An HP 3670.
If OCR is the problem why does it do such a good job with a modern book?
Jack

Biggles
25-01-2005, 03:51 PM
OCR is dependant on being able to tell the letter forms from the background paper, blemishes etc etc. A modern book, with clean white pages and crisp typeface, is lilely to OCR better than an old one where the pages have yellowed. Also, if the paper is light, so that text on the other side of the page can show through when the scanner light passes across the page, that can mess up OCR too. And if the paper is textured as opposed to smooth (as some poor-quality old paperback paper is) then that also can mess up the result, as the small indentations in the paper create shadows which get misinterpreted by the OCR software.

In other words, anything that reduces the distinction between the letters and the page makes OCR hadrer. We used to do a fiar bit of OCR here back in the dark ages, and sometimes I found it easier to photocopy the page first, increasing contrast and "whitening" up the paper and then OCR the photocopy not the original.

Some letter forms are easier to recognise than others too, and naturally the OCR package you use also has an effect - not all are as good as others.

And yes - commercial printers have much better scanners. Scanning for commercial printing is done on large drum scanners, rahter than small flatbed scanners. You get what you pay for, but then who wants to spent $50,000 for a scanner at home!

Graham L
25-01-2005, 04:03 PM
Quite a few years ago the new technology craze struck Parliament. Why didn't they get an efficient way of producing Hansard?

So, they got the Hansard reporters typewriters which would produce OCR type when they transcribed the shorthand records. And of course, this nice clean copy could be scanned into the typesetting system.

But. :(

The politicians get to "correct" the Hansard reports of what they said. So they were given the nice clean typed sheets to proofread. They scribbled over them as they tried to convert the actual speeches into English, added dark rings from the tea cups (and whisky glasses) ...

The system was abandoned.

JJJJJ
25-01-2005, 04:08 PM
Just a sample of what I'm trying to copy.
.J
JJ'.
'" ~ t.
.
t"... g
~)~
. .
t
~.
~ ~:;
~ (\i ~"~
.
~
~ '\ ~ .,~
...,;~
1
r;
5>
~ -,
~~
t --'l~~
, ~'~
~,'
Ct-. ,
f\
, ' I
10 Beach Road, ! '- /
Wanganui, " //
19th December'1939.
Mr.G.H.Scholefield, Box 1369 J WET...LINGTON.
,Dear Sir,
. Replying to your letter of the 7th inst., and trust the
foregoingparticul4TS will be of assistance.
. ~
ijeI1'ryJi\'a~arrj! Born in London, August 16th.1816.
Arrived in Wanganui, Aug.19th.1841, died in Wanganui, Nov.3rd.1898. His wife who arrived with him, was born
in London, 28th.June 1816. Her maiden name was J~e ~e~m,e;aI'e]:. She died in'Wanganui. His eldest SOD,
Thomas Wellington Nathan, Was born in WellingtQD,
8th April ~841, this is the correct birth date,
( having received advice from my Sister, Mrs.J.H.Gratebatch)
he died in Palmerston North on the 2nd March 1909.
Anthony Nathan died in Taihape, a few ~ears ago. He
married Sarah Anne Harris, daughter of Samuel Gregory
Harris.
Joseph Nathan died in Palmerston North a few years ago;
he marr~ed Annie Penfold.
"""', , William Nathan died a few years ago, somewhere near Napier.
. William was married, but I never heard the name of his
wife.
George Nathan, I think is still living, but I do not know; he married Mary Connery.
Jane Nathan died in Wanganui, Sept!lst .1912; she married
,. James Rapley 2., Cie...~~ C>it'",,'J:6~f.
Mary Nathan died in Wellington a few years ago j she married Thomas Bush.
Winifred I have not heard~of for years; she married
James Morey. .
Susan died in Wellington; she married Mr .Coker..
Norah Margaret Carol died in Wanganui, April 1s t .1878 j she martied William Gardner.
. I have 'heard of my .grandparents speak of another ~on . ~tV-fA
..
.
named Charlie, who died when \!l1ite young. V,,(n"" -. tJyQ;
(''''Jane, the little girl who came out with her parentis, was
~'
L
taken by the Maoris, and was kept for some time before
(
being released. She could speak the Maori languag~ just as efficiently as a Maori. She was the aunt who took
r
my siater and myself, when our mother died in 1871. y/
i I understand that there are many relatives living
I in New Zealand at the present time, but I have not been
in contact with any for years.
.f'
I am,

JJJJJ
25-01-2005, 04:13 PM
Would this work?
If I buy a digital camera and photograph each page. I can get that into my computer. But will I be able to edit it in Word? Will it convert to *.doc?

And would a digital camera take a clear enough photo?
Jack

godfather
25-01-2005, 04:17 PM
Jack, to reinforce what BB is saying, a couple of years back I was working overseas and a company I was investigating was spending in excess of $100,000 NZ to get accurate scanning of text, in a commercial environment.

Much of that was for the software.

Unless you are prepared to spend serious money on hardware and software, its not going to be an exact science. All you can do is try to get the best hard copy quality input possible, with the largest size of font. Even the font type significantly affects quality of OCR.

leonidas5
25-01-2005, 04:20 PM
The key, other than mentioned above about the quality of the original, is the ability of the OCR engine. I have been using Presto for some time and it does a pretty good job. The hardest thing to copy from is newsprint - thin and the reverse shows.

Presto is by an outfit called Newsoft - google will pull it up. I think they still have a trial period. Worth a look.

Leon

Metla
25-01-2005, 04:20 PM
Hows about a hand text scanner?

I priced one of these babies up for a guy a couple weeks back and they seemed pretty swish.

Although it would require more elbow grease to be applied,and i have no idea of the quality of the finished product.

Biggles
25-01-2005, 04:21 PM
Sorry Jack - I'm confused. Waht you've got there is obviously a page that has already been poorly OCRed. Are you proposing to OCR it again?

As for digicam versus a scanner - a scanner should do a better job than a digicam by simple dint of the fact that the page is flat and the image surface is "flat" in relation to it. I've copied pages with a camera when I haven't been able to take them to a scanner, and it can be damn hard to get a good clean shot - the bigest problem is you get variation in exposure across the page. That is, some of the page shows a white background, but at the edges of the image it often goes dark grey and that really messes up OCR, since the "whiteness" of the page varies significantly across the image.

JJJJJ
25-01-2005, 04:37 PM
That's right. It was scanned or copied or whatever at National Archives.
I want to get it into my computer so I can include it or part of it in a book I'm writing.
A copy of an original document looks more realistic than an obviously new typewritten copy.

Yes I should have mentioned that everything I am trying to scan is a copy or a scan of an original document. All the originals are in the Archives or the Turnbull Library and I don't think they would give them to me.

Jack

Biggles
25-01-2005, 04:44 PM
OK - so you'd rather have an image of the original, than an OCR copy of the text on it - right?

My mum uses her digicam for this purpose and with practice you can get a resonable result - which can be much improved by adjusting the image in an image editor afterwards (assuming that is that you want to clean the image up to look good not to OCR well). The problem is you
1] may not be allowed to sue a flash
2] may not want to anyhow since it will "blow out" the resulting image

so a tripod may be neccessary to get a good image without camera shake blurring it.

A camera that does good work at high ISO settings - ISO400 and up - would be a useful camera for this kind of work.

Metla
25-01-2005, 04:49 PM
uh....why not just scan it as an image?

Scouse
25-01-2005, 05:50 PM
Hi JJJJ. Agree with Metla. If it is a copy of the original you want for authenticity, etc., give the OCR a miss and just scan the thing. If you want to play with OCR, there are several freebees around, all of which create different results, each requiring different amount of correction and editing. ;)

Graham L
25-01-2005, 06:17 PM
~ ~:;
~ (\i ~"~
.
~
~ '\ ~ .,~
...,;~
1
r;
5>
~ -,

Looks pretty good OCR results tio me. Can't you read that? ;)

To get scanned pictures of (photocopied?) originals, it will pay to experiment with the scanner settings. You might find that black&white gives better contrast than grey-scale or colour.

Nomad
25-01-2005, 07:49 PM
if you want to use a camera, here is an article.
Link. (http://www.nikon.co.jp/main/eng/photo_world/kumon_dsc/kd02_e.htm#2.1.1)

Murray P
25-01-2005, 08:21 PM
I agree with those that say, why not scan in an image of the document. From there you have an image that you can inhance to your hearts content in an image editor.

I do this sort of thing all the time with some really messed up documents, including photo copies of photo copies of photo copies, poorly done diazo plans or photo copies of same or photo copies of folded, used and abused diazo plans (the worst).

To get the best results, scan at around 2-300dpi in grey scale, save a copy in an image editors native format. Then use the image editor to adjust contrast/brightness/flashfill, rub out spots, creases and copier speckling. Save back to a lighter weight format like JPEG (reduce size/resolution to suite) for inserting into your document or printing off or try the OCR software on it now if you want to insert it as text (a quote) rather than as a doc within a doc.

godfather
25-01-2005, 10:11 PM
I too use an image (not OCR) of documents embedded into Word for reports. Just scan as a picture, not as a document?