sane-project-website

125 wiersze

6.3 KiB

HTML

Czysty Wina Historia

 <!-- received="Thu Nov 13 04:04:51 1997 PST" -->
 <!-- sent="Thu, 13 Nov 1997 12:55:15 +0100 (MET)" -->
 <!-- name="becka@rz.uni-duesseldorf.de" -->
 <!-- email="becka@rz.uni-duesseldorf.de" -->
 <!-- subject="Re: OCR Software..?!" -->
 <!-- id="m0xVxrf-000BW0C@charon.beck-sw.de" -->
 <!-- inreplyto="199711130017.TAA14813@lemur.magnet.com" -->
 <title>sane-devel: Re: OCR Software..?!</title>
 <h1>Re: OCR Software..?!</h1>
 <a href="mailto:becka@rz.uni-duesseldorf.de"><i>becka@rz.uni-duesseldorf.de</i></a><br>
 <i>Thu, 13 Nov 1997 12:55:15 +0100 (MET)</i>
 <p>
 <ul>
 <li> <b>Messages sorted by:</b> <a href="date.html#107">[ date ]</a><a href="index.html#107">[ thread ]</a><a href="subject.html#107">[ subject ]</a><a href="author.html#107">[ author ]</a>
 <!-- next="start" -->
 <li> <b>Next message:</b> <a href="0108.html">Joey Nelson: "Re: Problem Scanning with UMAX S6E."</a>
 <li> <b>Previous message:</b> <a href="0106.html">Michael Burghart: "Re: OCR Software..?!"</a>
 <li> <b>In reply to:</b> <a href="0100.html">Andrew Kuchling: "Re: OCR Software..?!"</a>
 <!-- nextthread="start" -->
 <li> <b>Next in thread:</b> <a href="0105.html">Jonathan Buzzard: "Re: OCR Software..?!"</a>
 <!-- reply="end" -->
 </ul>
 <!-- body="start" -->
 <i>&gt; my idea was that scanned data would wind up in a Tk</i><br>
 <i>&gt; text editing box, with possible errors (where the confidence value of</i><br>
 <i>&gt; the recognition is low) highlighted in red.</i><br>
 <p>
 You might evetually need a "segmentation preview" which allows (optionally)<br>
 to manually interfere with the separation of text and graphics and the<br>
 sequence in which the textboxes are to be processed.<br>
 <p>
 Moreover it would be nice, if you could turn on and off every manual<br>
 step. So you could simply make a "quick-and-dirty" mass conversion<br>
 and correct errors the next morning when the stack of sheets has been<br>
 fed through the scanner as well as interactive operation.<br>
 <p>
 <i>&gt; Recognition is the complicated part, of course.  First you need to</i><br>
 <i>&gt; scan the image, then it's usually converted from grey-scale to 2-level</i><br>
 <i>&gt; black-and-white.  Documents are often not perfectly aligned when</i><br>
 <i>&gt; they're scanned, so the angle at which they're tilted (called the</i><br>
 <i>&gt; "skew angle") has to be measured and compensated for.</i><br>
 <p>
 Yeah. If you want to compensate on the image side, do so before converting <br>
 to b/w. Less quality loss.<br>
 <p>
 Moreover a "de-noise" filter would be appropriate to remove speckles.<br>
 <p>
 At small text sizes, it would eventually be nice to keep a grayscale image<br>
 (though this considerably complicates algorithms). At least you should<br>
 use an appropriate combined sharpening/smoothing filter (which preserves<br>
 edges, but smooths areas) to get a good image of the letters.<br>
 <p>
 <i>&gt; Then the image has to be segmented into words, and words into letters; </i><br>
 Or digraphs. Many printed typefaces use this. An example is the combination<br>
 "fi". In printed form, the dot of the i is often made up of a dot attached <br>
 to the upper end of the f. Set a word containing this combination with TEX<br>
 to see what I mean.<br>
 <p>
 <i>&gt; each letter is then recognized, and usually a confidence value is </i><br>
 <i>&gt; attached to each letter.</i><br>
 Yep. The same should happen on word level.<br>
 <p>
 <i>&gt; Often there's a post-processing step which uses a language dictionary </i><br>
 <i>&gt; to correct errors; for example, if you're scanning English text, 'rn' </i><br>
 <i>&gt; might be a scanning error for "m".</i><br>
 <p>
 Yes. The matching algorithm for the dictionary search needs to be<br>
 chosen in a way that takes typical scanning/matching errors into account.<br>
 <p>
 On letter level you could use language specific hidden-markov-chains to<br>
 predict the possibility of certain next letters, which can be helpful for <br>
 deciding between several possibilities. E.g. if the last recognized<br>
 character was "q", the possibility for the next one being "u" is magnitudes<br>
 higher than for it being "n".<br>
 <p>
 <i>&gt; The two major techniques for recognizing letters seems to be either</i><br>
 <i>&gt; neural networks, or making a vector from easily measured</i><br>
 <i>&gt; characteristics of the bitmap containing a letter; for example, xocr</i><br>
 <i>&gt; takes a histogram of the letter at 128 different angles.  This</i><br>
 <i>&gt; technique dates back at least to the 1970s, but neural networks seem</i><br>
 <i>&gt; to be what all modern systems use.</i><br>
 <p>
 The XOCR technique is not good. If it wasn't changed since my last look<br>
 it _counted_pixels_ (!) from these angles. This doesn't even distinguish<br>
 and O from a dot. Using the number of black/white transitions is a better<br>
 measure.<br>
 <p>
 But do not make the standard OCR mistake to simply feed the character <br>
 matrix to a neural net and then try to train it like mad.<br>
 <p>
 Feature recognition is still the most important part for a good OCR<br>
 program. If you classify them using a neural net or something simpler like<br>
 some weighted vector matching isn't too important. If your feature-<br>
 recognition is not good, neither of them will work well.<br>
 <p>
 Neural nets can compensate a bit better for a bad recognizer, but<br>
 at the price of additional training time and eventually less predictable<br>
 behaviour.<br>
 <p>
 <i>&gt; We should approach him, and get a freeware-OCR mailing list set up.</i><br>
 Definitely a good idea. It is one of the few things missing in freeware.<br>
 <p>
 CU, Andy<br>
 <p>
 <pre>
 --
 Andreas Beck              |  Email :  &lt;<a href="mailto:becka@sunserver1.rz.uni-duesseldorf.de">becka@sunserver1.rz.uni-duesseldorf.de</a>&gt;
 <p>
 <pre>
 --
 Source code, list archive, and docs: <a href="http://www.mostang.com/sane/">http://www.mostang.com/sane/</a>
 To unsubscribe: echo unsubscribe sane-devel | mail <a href="mailto:majordomo@mostang.com">majordomo@mostang.com</a>
 </pre>
 <!-- body="end" -->
 <p>
 <ul>
 <!-- next="start" -->
 <li> <b>Next message:</b> <a href="0108.html">Joey Nelson: "Re: Problem Scanning with UMAX S6E."</a>
 <li> <b>Previous message:</b> <a href="0106.html">Michael Burghart: "Re: OCR Software..?!"</a>
 <li> <b>In reply to:</b> <a href="0100.html">Andrew Kuchling: "Re: OCR Software..?!"</a>
 <!-- nextthread="start" -->
 <li> <b>Next in thread:</b> <a href="0105.html">Jonathan Buzzard: "Re: OCR Software..?!"</a>
 <!-- reply="end" -->
 </ul>

125 wiersze 6.3 KiB HTML Czysty Wina Historia

125 wiersze

6.3 KiB

HTML

Czysty Wina Historia