How good is OCR?

Subject: How good is OCR?
From: "Hart, Geoff" <Geoff-H -at- MTL -dot- FERIC -dot- CA>
To: "TECHWR-L" <techwr-l -at- lists -dot- raycomm -dot- com>
Date: Thu, 13 Nov 2003 09:06:21 -0500


Marc Santacroce wonders: <<A competitor is offering another product that
consists of an MS Word outline into which customers can cut-and-past
portions of their existing manuals. I can see this working for those who
have an MS word version of their manuals, but many of the customer base just
have a hardcopy version. Has OCR software improved such that this is a
viable
option?>>

OCR has gotten pretty good, but it's still topping out at an accuracy of
99.9% or thereabouts, with much lower rates if you don't know how to work
your scanner or the software properly. That still means an error rate of 1
in 1000 character--or a typo every 200 words or so. That's probably
acceptable for quick and dirty work, but less so if you're trying to produce
a really professional-looking product.

It can also be difficult to work with threaded multi-column layouts because
you have to manually define the text flow--and in poorly designed layouts,
that flow isn't always obvious even to the reader. Shouldn't be a problem
with manuals, but might be for fancy white papers and "tool tips"
newsletters, for instance.

One thing I can't say is how well current OCR works handles mixed text and
numbers. In many fonts, the number 1 and the lower-case L are pretty much
indistinguishable to the human eye, so I can't imagine the software doing a
much better job. Shouldn't generally be a problem when the text and numbers
are separate (e.g., tables vs. body text), but might pose the occasional
problem elsewhere, such as in scientific or engineering manuals.

<<I can't see this working for a pdf import.>>

Actually, it should work nearly flawlessly for PDF because the characters
are all clearly defined--there's no guesswork deciding what characters were
actually entered. Then again, I've edited manuscripts in which the author
typed capital-O instead of zero, lower-case L instead of 1, and ` (the grave
accent) instead of an apostrophe, so let's change the "flawlessly" to "quite
well given the limits of human technology". <g>

--Geoff Hart, ghart -at- [delete]videotron -dot- ca
Forest Engineering Research Institute of Canada
580 boul. St-Jean
Pointe-Claire, Que., H9R 3J9 Canada

"I have always wished that my computer would be as easy to use as my
telephone. My wish has come true. I no longer know how to use my
telephone."--Bjarne Stroustrup (originator of C++ programming language)

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

ROBOHELP FOR FRAMEMAKER TRIAL NOW AVAILABLE!

RoboHelp for FrameMaker is a NEW online publishing tool for FrameMaker that
lets you easily single-source content to online Help, intranet, and Web.
The interface is designed for FrameMaker users, so there is little or no
learning curve and no macro language required! Call 800-718-4407 for
competitive pricing or download a trial at: http://www.ehelp.com/techwr-l4

---
You are currently subscribed to techwr-l as:
archive -at- raycomm -dot- com
To unsubscribe send a blank email to leave-techwr-l-obscured -at- lists -dot- raycomm -dot- com
Send administrative questions to ejray -at- raycomm -dot- com -dot- Visit
http://www.raycomm.com/techwhirl/ for more resources and info.



Follow-Ups:

Previous by Author: Annual performance review, new manager...?
Next by Author: Offshoring: San Jose Mercury News article? (take II)
Previous by Thread: broken browse sequences in HTMLHelp
Next by Thread: Re: How good is OCR?


What this post helpful? Share it with friends and colleagues:


Sponsored Ads