use UTF-8 encoding

master
Michele Guerini Rocco 2019-05-15 18:20:12 +02:00
parent 7fe8e1ffe0
commit 1c3623d719
Signed by: rnhmjoj
GPG Key ID: 91BE884FBA4B591A
1 changed files with 27 additions and 27 deletions

54
README
View File

@ -25,7 +25,7 @@ cryptographic software is subject to U.S. export control laws and
regulations. The new 1997 Commerce Department Export Administration
Regulations (EAR) explicitly provide that "A printed book or other printed
material setting forth encryption source code is not itself subject to the
EAR." (see 15 C.F.R. §734.3(b)(2)). PGP, in an overabundance of caution,
EAR." (see 15 C.F.R. §734.3(b)(2)). PGP, in an overabundance of caution,
has only made available its source code in a form that is not subject to
those regulations. So, books containing cryptographic source code may be
published, and after they are published they may be exported, but only
@ -167,24 +167,24 @@ The first step to getting OrnniPage 7 to work well is to set it up with
options to disable all of its more advanced features for preserving font
changes and formatting. Look in the Seffings menu.
· Create a Zone Contents File with all of ASCII in it, plus the extra
· Create a Zone Contents File with all of ASCII in it, plus the extra
bullet, currency, yen and pilcrow symbols. Name it "Source Code".
· Create a Source Code style set. Within it, create a Source Code zone style
· Create a Source Code style set. Within it, create a Source Code zone style
and make it the default.
· Set the font to something fixed-width, like Courier.
· Set a fixed font size (10 point) and plain text, left-aligned.
· Set the tab character to a space.
· Set the text flow to hard line returns.
· Set the margins to their widest.
· The font mapping options are irrelevant.
· Set the font to something fixed-width, like Courier.
· Set a fixed font size (10 point) and plain text, left-aligned.
· Set the tab character to a space.
· Set the text flow to hard line returns.
· Set the margins to their widest.
· The font mapping options are irrelevant.
Go to the settings panel and:
· Under Scanner, set the brightness to manual. With careful setting of the
· Under Scanner, set the brightness to manual. With careful setting of the
threshold, this generates much better results than either the automatic
threshold or the 3D OCR. Around 144 has been a good setting for us; you
may want to start there.
· Under OCR, you'll build a training file to use later, but turn off
· Under OCR, you'll build a training file to use later, but turn off
automatic page orientation and select your Source Code style set in the
Output Options. Also set a reasonable reject character. (For test, we
used the pi symbol, which came across from the Macintosh as a weird
@ -228,26 +228,26 @@ specific Latin-1 characters to be processed.
They characters most in need of training are as follows:
· Zero is printed 'slashed.'
· Lowercase L has a curled tail to distinguish it clearly from other
· Zero is printed 'slashed.'
· Lowercase L has a curled tail to distinguish it clearly from other
vertical characters like 1 and I.
· The or-bar or pipe symbol '|' is printed "broken" with a gap in the
· The or-bar or pipe symbol '|' is printed "broken" with a gap in the
middle to distinguish it similarly.
· The underscore character has little "serifs" on the end to distinguish
· The underscore character has little "serifs" on the end to distinguish
it from a minus sign. We also raised it a just a tad higher than the
normal underscore character, which was too low in the character cell to
be reliably seen by OmniPage.
· Tabs are printed as a hollow right-pointing triangle, followed by blanks
· Tabs are printed as a hollow right-pointing triangle, followed by blanks
to the correct alignment position. If not trained enough, OmniPage
guesses this is a capital D. You should train OmniPage to recognize this
symbol as a currency symbol (Latin-1 244).
· Any spaces in the original that follow a space, or a blank on the printed
· Any spaces in the original that follow a space, or a blank on the printed
page, are printed as a tiny black triangle. You should train OmniPage to
recognize this as a center dot or bullet (Latin-1 267). We didn't use a
standard center dot because OmniPage confused it with a period.
· Any form feeds in the original are printed as a yen currency symbol
· Any form feeds in the original are printed as a yen currency symbol
(Latin-1 245).
· Lines over 80 columns long are broken after 79 columns by appending a big
· Lines over 80 columns long are broken after 79 columns by appending a big
ugly black block. You should train OmniPage to recognize this as a
pilcrow (paragraph symbol, Latin-1 266). We did this because after
deciding something black and visible was suitable, we found out the font
@ -264,16 +264,16 @@ to train on, use that.
Other things that need training:
· ~ (tilde), ^ (caret), ` (backquote) and ' (quote). These get dropped
· ~ (tilde), ^ (caret), ` (backquote) and ' (quote). These get dropped
frequently unless you train them.
· i, j and; (semicolon). These get mixed up.
· 3 and S. These also get mixed up.
· Q can fail to be recognized.
· C and [ can be confused.
· c/C, o/O, p/P, s/S, u/U, v/V, w/W, y/Y and z/Z are often confused. This
· i, j and; (semicolon). These get mixed up.
· 3 and S. These also get mixed up.
· Q can fail to be recognized.
· C and [ can be confused.
· c/C, o/O, p/P, s/S, u/U, v/V, w/W, y/Y and z/Z are often confused. This
can be helped by some training.
· r gets confused with c and n. I don't understand c, but it happens.
· f gets confused with i.
· r gets confused with c and n. I don't understand c, but it happens.
· f gets confused with i.
The OCR training pages have lots of useful examples of troublesome
characters. Scan a few pages of material, training each page, then scan a