kopia lustrzana https://gitlab.com/sane-project/website
125 wiersze
6.3 KiB
HTML
125 wiersze
6.3 KiB
HTML
<!-- received="Thu Nov 13 04:04:51 1997 PST" -->
|
|
<!-- sent="Thu, 13 Nov 1997 12:55:15 +0100 (MET)" -->
|
|
<!-- name="becka@rz.uni-duesseldorf.de" -->
|
|
<!-- email="becka@rz.uni-duesseldorf.de" -->
|
|
<!-- subject="Re: OCR Software..?!" -->
|
|
<!-- id="m0xVxrf-000BW0C@charon.beck-sw.de" -->
|
|
<!-- inreplyto="199711130017.TAA14813@lemur.magnet.com" -->
|
|
<title>sane-devel: Re: OCR Software..?!</title>
|
|
<h1>Re: OCR Software..?!</h1>
|
|
<a href="mailto:becka@rz.uni-duesseldorf.de"><i>becka@rz.uni-duesseldorf.de</i></a><br>
|
|
<i>Thu, 13 Nov 1997 12:55:15 +0100 (MET)</i>
|
|
<p>
|
|
<ul>
|
|
<li> <b>Messages sorted by:</b> <a href="date.html#107">[ date ]</a><a href="index.html#107">[ thread ]</a><a href="subject.html#107">[ subject ]</a><a href="author.html#107">[ author ]</a>
|
|
<!-- next="start" -->
|
|
<li> <b>Next message:</b> <a href="0108.html">Joey Nelson: "Re: Problem Scanning with UMAX S6E."</a>
|
|
<li> <b>Previous message:</b> <a href="0106.html">Michael Burghart: "Re: OCR Software..?!"</a>
|
|
<li> <b>In reply to:</b> <a href="0100.html">Andrew Kuchling: "Re: OCR Software..?!"</a>
|
|
<!-- nextthread="start" -->
|
|
<li> <b>Next in thread:</b> <a href="0105.html">Jonathan Buzzard: "Re: OCR Software..?!"</a>
|
|
<!-- reply="end" -->
|
|
</ul>
|
|
<!-- body="start" -->
|
|
<i>> my idea was that scanned data would wind up in a Tk</i><br>
|
|
<i>> text editing box, with possible errors (where the confidence value of</i><br>
|
|
<i>> the recognition is low) highlighted in red.</i><br>
|
|
<p>
|
|
You might evetually need a "segmentation preview" which allows (optionally)<br>
|
|
to manually interfere with the separation of text and graphics and the<br>
|
|
sequence in which the textboxes are to be processed.<br>
|
|
<p>
|
|
Moreover it would be nice, if you could turn on and off every manual<br>
|
|
step. So you could simply make a "quick-and-dirty" mass conversion<br>
|
|
and correct errors the next morning when the stack of sheets has been<br>
|
|
fed through the scanner as well as interactive operation.<br>
|
|
<p>
|
|
<i>> Recognition is the complicated part, of course. First you need to</i><br>
|
|
<i>> scan the image, then it's usually converted from grey-scale to 2-level</i><br>
|
|
<i>> black-and-white. Documents are often not perfectly aligned when</i><br>
|
|
<i>> they're scanned, so the angle at which they're tilted (called the</i><br>
|
|
<i>> "skew angle") has to be measured and compensated for.</i><br>
|
|
<p>
|
|
Yeah. If you want to compensate on the image side, do so before converting <br>
|
|
to b/w. Less quality loss.<br>
|
|
<p>
|
|
Moreover a "de-noise" filter would be appropriate to remove speckles.<br>
|
|
<p>
|
|
At small text sizes, it would eventually be nice to keep a grayscale image<br>
|
|
(though this considerably complicates algorithms). At least you should<br>
|
|
use an appropriate combined sharpening/smoothing filter (which preserves<br>
|
|
edges, but smooths areas) to get a good image of the letters.<br>
|
|
<p>
|
|
<i>> Then the image has to be segmented into words, and words into letters; </i><br>
|
|
Or digraphs. Many printed typefaces use this. An example is the combination<br>
|
|
"fi". In printed form, the dot of the i is often made up of a dot attached <br>
|
|
to the upper end of the f. Set a word containing this combination with TEX<br>
|
|
to see what I mean.<br>
|
|
<p>
|
|
<i>> each letter is then recognized, and usually a confidence value is </i><br>
|
|
<i>> attached to each letter.</i><br>
|
|
Yep. The same should happen on word level.<br>
|
|
<p>
|
|
<i>> Often there's a post-processing step which uses a language dictionary </i><br>
|
|
<i>> to correct errors; for example, if you're scanning English text, 'rn' </i><br>
|
|
<i>> might be a scanning error for "m".</i><br>
|
|
<p>
|
|
Yes. The matching algorithm for the dictionary search needs to be<br>
|
|
chosen in a way that takes typical scanning/matching errors into account.<br>
|
|
<p>
|
|
On letter level you could use language specific hidden-markov-chains to<br>
|
|
predict the possibility of certain next letters, which can be helpful for <br>
|
|
deciding between several possibilities. E.g. if the last recognized<br>
|
|
character was "q", the possibility for the next one being "u" is magnitudes<br>
|
|
higher than for it being "n".<br>
|
|
<p>
|
|
<i>> The two major techniques for recognizing letters seems to be either</i><br>
|
|
<i>> neural networks, or making a vector from easily measured</i><br>
|
|
<i>> characteristics of the bitmap containing a letter; for example, xocr</i><br>
|
|
<i>> takes a histogram of the letter at 128 different angles. This</i><br>
|
|
<i>> technique dates back at least to the 1970s, but neural networks seem</i><br>
|
|
<i>> to be what all modern systems use.</i><br>
|
|
<p>
|
|
The XOCR technique is not good. If it wasn't changed since my last look<br>
|
|
it _counted_pixels_ (!) from these angles. This doesn't even distinguish<br>
|
|
and O from a dot. Using the number of black/white transitions is a better<br>
|
|
measure.<br>
|
|
<p>
|
|
But do not make the standard OCR mistake to simply feed the character <br>
|
|
matrix to a neural net and then try to train it like mad.<br>
|
|
<p>
|
|
Feature recognition is still the most important part for a good OCR<br>
|
|
program. If you classify them using a neural net or something simpler like<br>
|
|
some weighted vector matching isn't too important. If your feature-<br>
|
|
recognition is not good, neither of them will work well.<br>
|
|
<p>
|
|
Neural nets can compensate a bit better for a bad recognizer, but<br>
|
|
at the price of additional training time and eventually less predictable<br>
|
|
behaviour.<br>
|
|
<p>
|
|
<i>> We should approach him, and get a freeware-OCR mailing list set up.</i><br>
|
|
Definitely a good idea. It is one of the few things missing in freeware.<br>
|
|
<p>
|
|
CU, Andy<br>
|
|
<p>
|
|
<pre>
|
|
--
|
|
Andreas Beck | Email : <<a href="mailto:becka@sunserver1.rz.uni-duesseldorf.de">becka@sunserver1.rz.uni-duesseldorf.de</a>>
|
|
<p>
|
|
<pre>
|
|
--
|
|
Source code, list archive, and docs: <a href="http://www.mostang.com/sane/">http://www.mostang.com/sane/</a>
|
|
To unsubscribe: echo unsubscribe sane-devel | mail <a href="mailto:majordomo@mostang.com">majordomo@mostang.com</a>
|
|
</pre>
|
|
<!-- body="end" -->
|
|
<p>
|
|
<ul>
|
|
<!-- next="start" -->
|
|
<li> <b>Next message:</b> <a href="0108.html">Joey Nelson: "Re: Problem Scanning with UMAX S6E."</a>
|
|
<li> <b>Previous message:</b> <a href="0106.html">Michael Burghart: "Re: OCR Software..?!"</a>
|
|
<li> <b>In reply to:</b> <a href="0100.html">Andrew Kuchling: "Re: OCR Software..?!"</a>
|
|
<!-- nextthread="start" -->
|
|
<li> <b>Next in thread:</b> <a href="0105.html">Jonathan Buzzard: "Re: OCR Software..?!"</a>
|
|
<!-- reply="end" -->
|
|
</ul>
|