translit
|
1251-alt.rus,
1251-k8.rus,
Makefile,
Makefile.os2,
Makefile.unx,
alt-1251.rus,
alt-gos.rus,
alt-koi8.rus,
announcement,
binaries_for_SunOS_5.4,
example.alt,
example.alt.uu,
example.ko8.uu,
example.pho,
example.tex,
gos-alt.rus,
gos-koi8.rus,
hex-text.rus,
k8-1251.rus,
k8-tavtt.rus,
koi7-8.rus,
koi7nl-8.rus,
koi8-7.rus,
koi8-alt.rus,
koi8-gos.rus,
koi8-lc.rus,
koi8-ltx.rus,
koi8-phg.rus,
koi8-php.rus,
koi8-tex.rus,
koi8-win.rus,
old-version-1.00,
old-version-1.01,
old-version-1.02,
paths.h,
phg-koi8.rus,
pho-8sim.rus,
pho-koi8.rus,
php-koi8.rus,
readme.doc,
reg_exp.c,
reg_exp.h,
reg_sub.c,
tex-koi8.rus,
translit-sun4,
translit.1,
translit.c,
translit.ps,
translit.tar.Z,
translit.txt,
translit.zip,
|
|
|
.TH TRANSLIT JKL "22-Jan-1997" JKL "Version 1.03"
.DA 22 Jan 1997
.SH NAME
.IP \fITRANSLIT\fR
Program to transliterate texts in different character sets. The program
converts input character codes (or sequences of codes) to a different set
of output character codes (or sequences of codes). Intended for
transliteration to/from phonetic representation of foreign letters with
Latin letters from/to special national codes used for these letters.
It supports simple matches, character lists and flexible matches via
regular expressions. The new transliteration schemes are easily added
by creating simple transliteration tables. Multiple character sets
are supported for input and output. It does not yet support UNICODE,
but some day it will.
.SH COPYRIGHT
Copyright (c) 1993 Jan Labanowski and JKL Enterprises, Inc.
.br
You may distribute the Software only as a complete set of files.
You may distribute the modified Software only if you retain the
Copyright notice and you do not delete original code, data, documentation
and associated files.
The Software is copyrighted. You may not sell the software or incorporate
it in the commercial product without written permission from
Jan Labanowski or JKL Enterprises, Inc. You are allowed to charge for media
and copying if you distribute the whole unaltered package.
.SH SYNOPSIS
.B translit
[
.B -i
.I inpfile
][
.B -o
.I outfile
][
.B -d
][
.B -t
.I transtbl \|\||\|\| transtbl
]
.br
.SH OPTIONS
.IP "\fB-i\fP \fIinpfile\fP"
.I inpfile
is a name of input file to be transliterated.
If "\fB-i\fP" is not specified, the input is taken from
standard input.
.IP "\fB-o\fP \fIoutfile\fP"
.I outfile
is an output file, where the transliterated
text is stored. If "\fB-o\fP" is not specified, the output is
directed to the standard output. Program will not overwrite the existing
file. If file exists, you need to delete it first.
.IP "\fB-d\fP"
Some information on character codes read from transliteration table file
are sent to standard error ("\fIstderr\fP"). Useful when developing
new transliteration tables.
.IP "\fB-t\fP \fItranstbl\fP"
.I transtbl
is a transliteration table file which you want to use. The "\fB-t\fP"
option may be omitted if the \fItranstbl\fR
is specified as the last parameter on the
command line. The program first tries to locate \fItranstbl\fR
file in the current directory, and if not found, it
searches the directory chosen at compilation/installation time in
"\fIpaths.h\fP". If no "\fItranstbl\fP" is given, the default file name
specified in "\fIpaths.h\fP" is taken. The compile/installation
time defaults in
"\fIpaths.h\fR" for the search directory and the default
file name can be overiden
by setting environment variables: TRANSP and TRANSF, respectively (see below).
.SH ENVIRONMENT VARIABLES
The default path to the directory holding transliteration tables can
be overiden by setting environment variable TRANSP. The default name
for the transliteration table can be overiden by setting TRANSF environment
variable. However, when the transliteration file is given on the command line,
it will overide the defaults and environment setting.
Here are some examples of setting environment
variables for different operating systems:
.sp
.in +2m
.br
\fIUN*X System\fR
.br
.nf
If you are using \fIcsh\fR (C-shell):
setenv TRANSP /home/john/translit/
setenv TRANSF koi8-tex.rus
If you are using \fIsh\fR (Bourne Shell):
set TRANSP=/home/john/translit/
export TRANSP
set TRANSF=koi8-tex.rus
export TRANSF
\fIVAX-VMS System\fR
TRANSP:==SYS$USER:[JOHN.TRANSLIT]
TRANSF:==KOI8-TEX.TBL
\fIPC-DOS or MS-DOS\fR
SET TRANSP=C:\|\\\|JOHN\|\\\|TRANSLIT\|\\
SET TRANSF=KOI8-TEX.TBL
.fi
.in -2m
Note that the directory path has to include concluding
slashes, \|\\\| or \|/\|\|.
.SH EXAMPLES
.ta 5m
.br
cat text.koi8 \|\||\|\| translit koi8-tex.rus > text.tex
.br
in UN*X is equivalent to:
.sp 1
translit -t koi8-tex.rus -o text.tex -i text.koi8
.br
and converts file text.koi8 to file text.tex using transliteration
specified in the file koi8-tex.rus.
.sp 1
translit -i text.koi8 koi8-cl.rus
.br
displays the converted text from file text.koi8 on your terminal. The
conversion table is koi8-cl.rus (KOI8 --> Library of Congress).
.sp 1
translit -i text.alt -t alt-koi8.rus \|\||\|\| translit -o text.tex -t koi8-tex.rus
.br
is essentially equivalent to the following two commands in UN*X or MS-DOS:
.br
translit -i text.alt -o junkfile -t alt-koi8.rus
.br
translit -i junkfile -o text.tex -t koi8-ltx.rus
.br
and converts the file in ALT character set to a LaTeX file for printing.
.sp
translit -i russ.txt pho-koi8.rus \|\||\|\| translit -o russ.tex koi8-ltx.rus
.br
converts file russ.txt from phonetic transliteration to LaTeX file russ.tex
for printing.
.sp 2
.SH TRANSLITERATION TABLES
The following transliteration files are available with the current
distribution. Consult the comments in the individual files for details.
.IP \fIkoi8-tex.rus\fP
Conversion table which changes the file in KOI8 (8 bit character set
used by RELCOM news service) to a Plain TeX file for printing with
\fIAMS\fR WNCYR fonts.
.IP \fIkoi8-ltx.rus\fP
Conversion table which changes the file in KOI8 (8 bit character set
used by RELCOM news service) to LaTeX file for printing with
\fIAMS\fR WNCYR fonts.
.IP \fIltx-koi8.rus\fP
Conversion table for the LaTeX to KOI8 conversion. Note that it will not
handle complicated cases, since LaTeX is a program, and only TeX can
convert a LaTeX source to the characters. However, it should work OK
for simple cases of text only files, and may need some editing for
complicated cases.
.IP \fIk8-tavtt.rus\fP
Converts KOI8 to Bill Tavolga cyrttf truetype font mapping.
.IP \fIhex-text.rus\fP
Converts hexcodes to actual codes. Some e-mail programs convert characters
with codes larger than 127 to hexadecimal numbers like =AB, =9C, etc.
This table converts hexadecimal numbers back to codes.
.IP \fIalt-gos.rus\fP
This is a transliteration data file for converting from ALT (Bryabrins
alternativnyj variant used in many popular wordprocessors)
to GOSTSCII 84 (approx. ISO-8859-5?)
.IP \fIalt-koi8.rus\fP
This is a transliteration data file for converting from ALT to KOI8.
KOI8 is meant to be GOST 19768-74 (as used by RELCOM).
.IP \fIgos-alt.rus\fP
This is a transliteration data file for converting GOSTSCII 84
(approx. ISO-8859-5?) to ALT (Bryabrins alternativnyj variant)
.IP \fIgos-koi8.rus\fP
This is a transliteration data file for converting GOSTSCII 84
(approx. ISO-8859-5?) to KOI8 used by RELCOM
KOI8 is meant to be GOST 19768-74
.IP \fIkoi8-alt.rus\fP
This is a transliteration data file for converting from KOI8.
KOI8 is meant to be GOST 19768-74, to ALT (Bryabrins alternativnyj variant)
.IP \fIkoi8-gos.rus\fP
This is a transliteration data file for converting from KOI8 (Relcom).
KOI8 is meant to be GOST 19768-74, to GOSTSCII 84 (approx. ISO-8859-5)
.IP \fIkoi8-7.rus\fP
This file converts from KOI8 to KOI7.
.IP \fIkoi7-8.rus\fP
This file converts from KOI7 to KOI8. Before you attempt the conversion,
you might need to perform a simple edit on your file. You MUST read the
comments in \fIkoi7-8.rus\fR before you attempt this conversion.
.IP \fIkoi7nl-8.rus\fP
This file assumes that there are only Russian letters (no Latin)
in the input file. If you have Latin letters, and you inserted SHIFT-OUT/IN
characters, use file \fIkoi7-8.rus\fP.
.IP \fIkoi8-lc.rus\fP
This file converts KOI8 to the Library of Congress transliteration.
Some extensions are added.
.IP \fIkoi8-php.rus\fP
This file converts KOI8 to the Pokrovsky transliteration.
.IP \fIphp-koi8.rus\fP
This file converts from Pokrovsky transliteration to KOI8.
.IP \fIkoi8-phg.rus\fP
This file converts from KOI8 to GOST transliteration.
.IP \fIphg-koi8.rus\fP
This file converts from GOST transliteration to KOI8.
.IP \fIpho-koi8.rus\fP
This is a table which will convert from many "phonetic" transliteration
schemes to KOI8. It is elaborate and it takes a lot of time to
transliterate the file using this table. Some transliterations are
hopeless and internally inconsistent (as humans...), so the results
cannot be bug free.
You might want to modify the file, if your transliteration
patterns are different than those assumed in this file. You may also want
to simplify this file if the phonetic transliteration you are converting
is a sound one (most are not, e.g., they use e for je and e oborotnoye,
ts for c and t-s, h for kha, i for i-kratkoe, etc.).
.sp
.SH INTRODUCTION
If you do not intend to write your own transliteration tables, you may
skip this description and go directly to the installation and
copyright sections. However, you might want to read this material anyhow,
to better understand the traps and complexities of transliteration.
It is frequently necessary to transliterate text, i.e., to change one set
of characters (or composite characters, phonemes, etc.) to another set.
.PP
On computers, the transliteration operation consists of converting the input
file in some character set to the output file in another character set.
.PP
In the simplest case, the single characters are transliterated, i.e, their
codes are changed according to some transliteration table. This is called
remapping and, assuming the one-to-one mapping, the task can be accomplished
by a simple pseudo program:
.br
new_char_code = character_map[old_char_code];
.PP
If the one-to-one correspondence does not exist (i.e., some codes may
be present in one set, but do not have corresponding codes in another set),
precise transliteration is not possible. In such cases there are 3 obvious
possibilities:
.br
1. skip characters which do not have counterparts,
.br
2. retain unchanged codes of these characters,
.br
3. convert the codes to multicharacter sequences.
.br
In some cases, the file can contain more than one character sets, e.g.,
the file can contain Latin characters (e.g. English text) and Cyrillic
characters (e.g. Russian text). If the character codes assigned to
characters in different sets do not overlap, this is still a simple mapping
problem. This is a case with KOI8 or GOSTCII character tables for Russian,
which reserve the lower 127 codes for standard ASCII codes (which include
all Latin characters) and characters with codes above 127 for Cyrillic letters.
.PP
If character codes overlap, there is a SHIFT-OUT/SHIFT-IN technique in
which the meaning of the character sequence is determined by an opening
code (or sequence of characters codes). In this case, the meaning of the
series of characters is determined by the SHIFT-OUT character (or sequence)
which precedes them. The SHIFT-IN character (or sequence) following the
series of characters returns the "reader" to the default or previous status.
To schemes are used:
.br
(char_set_1)(SHIFT-IN[1])(SHIFT-OUT[2])(char_set_2)...
.br
or
.br
(char_set_1)(SHIFT-OUT[2])(char_set_2)(SHIFT-OUT[1])char_set_1...
.br
.sp 1
Since computer keyboards, screens, printers, software, etc., are by necessity
language specific (the most popular being ASCII), there is a problem of typing
foreign language text which contains letters different than standard Latin
alphabet. For this reason, many transliteration schemes use several Latin
letters to represent a single letter of foreign alphabet, for example:
.br
zh is used to represent cyrillic letter zhe, \|\\\|"o may be used to
represent the o umlaut, etc.
If there is one-to-one mapping of such sequences to another alphabet, it
is also easy to process. However, it is necessary to substitute longest
sequences first. For example, a frequently used transliteration
for cyrillic letters:
.br
.ta 2mL 7mL 11mL 24mL
\fIshch\fR --- letter \fBshcza\fR 221 (decimal KOI8 code)
.br
\fIsh\fR --- letter \fBsha\fR 219
.br
\fIch\fR --- letter \fBcze\fR 222
.br
\fIc\fR --- letter \fBtse\fR 195
.br
\fIh\fR --- letter \fBkha\fR 200
.br
\fIa\fR --- letter \fBa\fR 193
.PP
Obviously, in this case, we should proceed first with converting all \fIshch\fR
sequences to \fBshcha\fR letter, then two-character \fIsh\fR
and \fIch\fR, and then single
character \fBc\fR and \fBh\fR.
Generally, for the one-to-one transliteration, the longest
sequences should be precessed first, and the order of conversion within
sequences of the same length makes no difference.
For example, converting the word "shchah" to KOI8 should proceed in a following
way:
.br
\fIshchah\fR --> (221)\fIah\fR, (221)\fIah\fR --> (221)(193)\fIh\fR, (221)(193)\fIh\fR --> (221)(193)(200)
.br
There is a multitude of reasons why transliteration is done. I wrote this
program having in mind the following ones:
.br
1) to print cyrillic text using TeX/LaTeX and cyrillic fonts
.br
2) to read KOI8 encoded messages from Russia on my ASCII terminal.
.br
However, I was trying to make it flexible to accommodate other uses.
.SH PROGRAM OPERATION
The program converts the input file to an output file using
transliteration rules from the transliteration rule file which
you specify with option \fB-t\fR.
Some examples of transliteration rule files are enclosed.
Before program can be used, the transliteration rules need to be specified.
.PP
These are given as a file which consist of the following parts
described below:
.br
.in +2m
.in +5m
.ti -5m
1) File format number (it is 1 at this moment)
.ti -5m
2) Delimiters used to enclose a) simple strings, b) character lists,
c) regular expressions
.ti -5m
3) Starting sequence for output
.ti -5m
4) Ending sequence for output
.ti -5m
5) Number of input "character sets"
.ti -5m
6) SHIFT-OUT/SHIFT-IN sequences for each input character set
.ti -5m
7) Number of output "character sets"
.ti -5m
8) SHIFT-OUT/SHIFT-IN sequences for each output character set
.ti -5m
9) Transliteration table
.in -5m
.in -2m
.PP
\fIGENERAL COMMENTS\fR
.br
The transliteration rules file consists of comments and data.
The comments may be included in the file as:
.in +5m
.ti -2m
a) line comments --- lines starting with ! or # character (# or ! must be
in the first column of a line) are treated as comments and are not
read in by the program.
.ti -2m
b) comments following all required entries on the line. They must be
separated by at least one space from the last data entry on the line
and need not start with any particular character. These comments cannot
be used within multiline sequences.
.br
.in -5m
.PP
The data entries consist of integer numbers and strings.
The strings may represent:
.br
a) plain strings
.br
b) character lists
.br
c) regular expressions
.br
.PP
All strings which appear in the file, are processed through the
"string processor", which allows entering unprintable characters as codes.
The character code is specified as a backslash "\|\\\|" followed by at least
2 digit(s) (i.e., \|\\\|01 produces code=1, but \|\|\\\|1 is passed unchanged). The
following formats are supported:
.br
\|\\\|0123 character of octal code 123 (when leading zero present)
.br
\|\\\|123 character of decimal code 123 (when leading digit is not zero)
.br
\|\\\|0o123 or \|\\\|0O123 character of octal code 123
.br
\|\\\|0d123 or \|\\\|0D123 character of decimal code 123
.br
\|\\\|0xA3 or \|\\\|0XA3 or \|\\\|0xa3 character of hexadecimal code A3
.br
.PP
The allowed digits are 0-7 for octal codes, 0-9 for decimal codes and
0-F (and/or 0-f) for hexadecimal codes.
In a situation when code has to be followed by a digit character,
you need to enter the
digit as a code. E.g., if you want character \|\\\|0xA3 followed by a letter C,
you need to specify letter C as a code (\|\\\|0x43 or \|\\\|103 or \|\\\|0o103 or \|\\\|0d67)
and type the sequence as, e.g., \|\\\|0xA3\|\\\|103.
Character resulting in a code 0 (zero) (e.g., \|\\\|00) is special. It tells:
"skip everything what follows me in this string".
It does not make sense to use it, since you can always terminate the
sequence with a delimiter. When you use an empty string as a matching
sequence, remember that it does not match anything.
.sp
If the line with entries is too long, you can break it between the
fields.
If the string is too long to fit a line, you can break it before any nonblank
character by the \|\\\| (backslash) followed by white space (i.e., new lines,
spaces, tabs, etc.). The \|\\\| and the following white space will be removed
from the string by the string preprocessor. However, you are not allowed
to break the individual character codes (and you probably would not
do it ever for aestetic purposes).
For example:
.br
"experi\\
.br
mental design"
.br
is equivalent to:
.br
"experimental design"
.br
while:
.br
"experimental\\
.br
design"
.br
is equivalent to:
.br
"experimentaldesign"
.br
If you need to have \|\\\| followed by a space in your string, you need to
enter either a backslash or a space following it as an explicit character
code, for example:
.br
"\|\\\|\|\\\|0o40"
.br
will produce a \|\\\| followed by the space, while the string:
.br
"\|\\\| "
.br
will be empty.
.sp 1
The preprocessor knows only about comments, plain characters, character codes,
and continuation lines. However, some characters and their combinations
may have a special meaning in lists and regular expressions.
.sp 2
\fIDETAILS OF FILE STRUCTURE\fR
.sp
.PP
.in +3m
.ti -3m
Ad.1) File format number. This is simply a digit 1 on a line by itself at the
moment. This entry is included to allow future extensions of the
transliteration description file without the need to modify older
transliteration descriptions (program will read data according to
the current file format number given in the file).
.sp
.ti -3m
Ad.2) String delimiters. The subsequent 3 lines specify pairs of
single character delimiters for 3 types of text data.
The line format is:
.br
opening_character closing_character.
.br
These are needed to mark the beginning/end and the type of the text data.
Each string (text datum) is saved starting from the first character after
opening delimiter, and ends at the last character before the closing
delimiter. If you need to use the closing delimiter within a string,
you need to specify it as its code (e.g., if you are using () pair as
delimiters, specify ")" as \|\\\|0x29). The opening delimiter may be the same
or different from the closing delimiter.
.sp
.in +2m
.ti -2m
a) The first line contains characters used to enclose (bracket)
a \fIplain string\fR. Plain strings are directly matched to input data or
directly sent to output.
I suggest to stick to " " pair for plain strings.
The ASCII code for " is \|\\\|0d34 = \|\\\|0x22 = \|\\\|0o42 if you need it inside the
string itself.
.sp
.ti -2m
b) The second line contains characters to mark the beginning and the end
of the \fIlist\fR. Lists are used to translate single character codes.
I suggest [ and ] delimiters for the list (ASCII code of "]" is:
\|\\\|0d93 = \|\\\|0x5D = \|\\\|0o135). The lists may include ranges, for example:
[a-zA-Z0-9] will include all Latin letters (small and capital) and digits.
Note that order is important: [a-d] is equivalent to [abcd], while
[d-a] will result in an error. If you want to include "-" (minus) in the
list, you need to place it as the first or the last character. There are only
two special characters on the list, the "-" described above, and the "]"
character. You need to enter the "]" as its code. E.g., for
ASCII character table [*--] is equivalent to [*+,-], is equivalent to
[\|\\\|42\|\\\|43\|\\\|44\|\\\|45]. The order of characters in the list does not matter
unless the input list corresponds to the output list (this will be
explained later). Empty lists do not make sense.
.sp
.ti -2m
c) The third line of delimiter specification contains delimiters for
\fIregular expression\fRs and \fIsubstitution expression\fRs.
These strings are used for "flexible" matches
to the text in the input file. They are very similar to the ones used in
UN*X for searching text in utilities like: grep, sed, vi, awk, etc., though
only a subset of full UN*X regular expression syntax is used here.
I suggest enclosing them within braces { and } (ASCII code for } is
\|\\\|0d125 = \|\\\|0x7D = \|\\\|0o175). Actually, regular expressions can only
be used for input sequences, and for output sequences the {} are
used to enclose substitution sequences. This will be explained
below. The description of the
syntax for regular/substitution expressions is
adapted from the documentation for the regexp package of Henry
Spencer, University of Toronto --- this regular expression package
was incorporated, after minute modifications, into the program.
.br
.sp 2
.ce
\fBREGULAR EXPRESSION SYNTAX\fR
.br
A regular expression is zero or more branches, separated by
`\|\||\|\|'. It matches anything that matches one of the branches.
The `\|\||\|\|' simply means "or".
.ti +2m
A branch is zero or more pieces, concatenated. It matches a
match for the first, followed by a match for the second,
etc.
.ti +2m
A piece is an atom possibly followed by `*', `+', or `?'.
An atom followed by `*' matches a sequence of 0 or more
matches of the atom. An atom followed by `+' matches a
sequence of 1 or more matches of the atom. An atom followed
by `?' matches zero or one occurrences of atom.
.ti +2m
An atom is a regular expression in parentheses (matching a
match for the regular expression), a range (see below), `.'
(matching any single character), a `\|\\\|' followed by
a single character (matching that character), or a
single character with no other significance (matching that
character).
.ti +2m
A range is a sequence of characters enclosed in `[\|\|]'. It
normally matches any single character from the sequence. If
the sequence begins with `^', it matches any single character
not from the rest of the sequence. If two characters in
the sequence are separated by `-', this is shorthand for the
full list of ASCII characters between them (e.g. `[0-9]'
matches any decimal digit). To include a literal `]' in the
sequence, make it the first character (following a possible
`^'). To include a literal `-', make it the first or last
character. The regular expression can contains subexpressions
which are enclosed in a (\|\|) pair. These subexpressions are numbered
1 to 9 and can be nested. The numbering of subexpressions is
given in the order of their opening parentheses "(". For
example:
.br
.ta 6mL
(111)...(22(333)222(444)222)...(555)
.br
Note that expression 2 contains within itself expressions 3 and 4.
.br
These subexpressions can be referenced in the substitution string which
is described below in the paragraph below, or can be used to delimit
atoms.
.in +2m
Examples:
.in +2m
.ti -2m
{[\|\\\|0d32\|\\\|0d09]\|\\\|0d10} --- will match space or tab followed by new line
.ti -2m
{[Tt][Ss]} --- will match TS, Ts, tS and ts
.ti -2m
{TS\|\||\|\|Ts\|\||\|\|tS\|\||\|\|ts} --- same as above
.ti -2m
{[\|\\\|0d09-\|\\\|0d15 ][^hH][^uU][a-zA-Z]*[\|\\\|0d09-\|\\\|0d15 ]} --- all words which
do not start with hu, Hu, hU, HU. There is a space between
\|\\\|0d15 and ].
.br
Note that specifying expressions like {.*} (i.e., match all characters)
does not make much sense, since it would mean here: match the whole input
file. However, expressions like {A.*B} should be acceptable, since they
match a pair of A and B, and everything in between them, e.g. for a
string like: "This is Mr. Allen and this is Mr. Brown." this expression
should match the string: "Allen and this is Mr. B".
.br
.in -4m
Remember to put a backslash "\|\\\|" in front of the following
characters: .\|\|[\|\|(\|\|)\|\||\|\|?\|\|+\|\|*\|\|\|\\\| if you want
their literal meaning outside the
range enclosed in [\|\|]. Inside the range they have their literal meaning.
If you know the syntax of UN*X regular expressions, please note that
\|\|^\|\| and \|$\| anchors are not supported and are treated as normal
characters (with the exception of \|\|^\|\| negation within [\|\|]).
.sp
.ce
\fBSUBSTITUTION EXPRESSIONS\fR
.br
After finding a match for a regular expression in the input text,
a substitution is made.
It can be a simple substitution where the whole matching string
is replaced by another string, or it may reuse a portion or
the whole matching string. The subexpressions (the ones enclosed
in parentheses) within the regular
expression which matched the input text can be referenced in the
substitution expression.
Only the following characters have special meaning within substitution
expression:
.in +4m
.ta 3m
.br
.ti -2m
& --- will put the whole matching string.
.ti -2m
\|\\\|1 --- will put the match for the 1st subexpression in (\|\|).
.ti -2m
\|\\\|2 --- will put the string which matched 2nd subexpression,
etc.
.ti -2m
\|\\\|9 --- will place in a replacement string the 9th
subexpression (provided that there was 9 (\|\|) pairs in
the regular expression)
.in -4m
.sp
Only 9 subexpressions are allowed.
All other characters and sequences within the substitution expression
will be placed in a substitution string as written. To be able to put
a single backslash there, you need to put two of them.
To be able to place the unchanged codes of the
above characters (i.e., to make them literals), you need to precede them
with a backslash "\|\\\|", i.e., to get & in the output string
you need to write it as \|\\\|&. Similarly, to place literal
\|\\\|1, \|\\\|2, etc., you need to enter it as \|\\\|\|\\\|1, \|\\\|\|\\\|2, etc.
Note that characters .+[]()^, etc. which had a special meaning in
the regular expressions, do not have any special meaning in the
substitution expression and will be output as written.
.in +2m
Example:
.br
The regular expression:
.in +2m
.ti -2m
{([Tt])([Ss])} and the corresponding substitution expression {\|\\\|1.\|\\\|2}
puts a period
between adjoining letters t and s preserving their letter case.
.br
The expression:
.ti -2m
{([A-Za-z]+)-[ \|\\\|0x09]*([\|\\\|0x0A-\|\\\|0x0D]+)[ \|\\\|0x09]*([A-Za-z,.?;:"\|\\\|)'`!]+)[ \|\\\|0x09]}
.br
and the substitution expression {\|\\\|1\|\\\|3\|\\\|2} dehyphenate words (when you
understand this one, you are a guru...). For example:
con- (NL)cert is changed to concert(NL), where NL stands for New
Line. It looks for one or more letters (saves them as substring 1)
followed by a hyphen (which may be followed by zero or more spaces
or tabs). The hyphen must be followed by a NewLine (ASCII characters
0A-0D hex form various new line sequences) and saves NewLine sequence
as a subexpression 2.
Then it looks for zero or more tabs and spaces (at the beginning of
the line). Then it looks for the rest of the hyphenated word and
saves it as substring 3. The word may have punctuation attached.
Then it looks again for some spaces or tabs. The substitution expression
junks all sequences which were not within (), i.e., hyphen and
spaces/tabs and inserts only substrings but in a different
order. The \|\\\|1 (word beginning) is followed by \|\\\|3 (word end) and
followed by the NewLine --- \|\\\|2. The {\|\\\|2\|\\\|1\|\\\|3} would
be probably equally good, though you would need to move the punctuation
matching to the beginning of the regular expression.
.in -6m
.ti -3m
Ad.3) Starting sequence. This sequence will be sent to the output before
any text. It is enclosed in the pair of string delimiters. I use it
to output LaTeX preamble. However, it can be empty, if not used.
The (sequence) may contain any characters, including new lines, etc.
.nf
.ta 2m 4m
Example:
"" # empty sequence
.sp
Example:
"\|\\\|documentstyle{article}
\|\\\|input cyracc
\|\\\|begin{document}
"
is right (note a new line at the end), but
.br
"\|\\\|documentstyle{article}
\|\\\|input cyracc # this comment will be included!
\|\\\|begin{document}" # while this will not
is wrong.
.sp
.fi
.ti -3m
Ad.4) Ending sequence. Similar to 1), but will be appended at the end of the
output file.
.nf
For example:
"\|\\\|end{document}
"
.fi
.sp
.ti -3m
Ad.5) Number of input character sets. For example, in some incarnation of
KOI7, there are two character sets: Latin and Cyrillic. Cyrillic
character sequence follows SHIFT-OUT character (CTRL-N), \|\\\|0x0e,
and is terminated by SHIFT-IN character (CTRL-O), \|\\\|0x0f.
Another way of looking at it is that Latin characters follow
CTRL-O and cyrillic ones follow CTRL-N.
.sp
If there is only one character set on input you should specify 0
as a number of input char sets,
since the input file obviously does not contain any SHIFT-OUT/IN
sequences.
.sp
.ti -3m
Ad.6) SHIFT-OUT/SHIFT-IN sequences for each input character set.
These lines appear only if you specified nonzero number of character sets.
These lines contain also "nesting sequences", which will be
explained later in this section.
You do not use "nesting sequences" frequently, and let us assume
for a moment that nesting data are empty strings.
The strings or regular expressions specified here are matched
with the contents of input text. If match was found, the matching sequence
is usually deleted from the input text and:
.in +4m
.ti -2m
a) for SHIFT-OUT sequence: the current input character set number is changed
to the new one corresponding to the SHIFT-OUT sequence, or
.ti -2m
b) for SHIFT-IN sequence: the previous input character set number is restored,
(i.e., the one which preceded the SHIFT-OUT sequence for the current set).
Note that only the SHIFT-IN sequence for the current set is matched.
The SHIFT-IN sequences for other character sets than the current set are
not matched.
The bracketing of sets is assumed
perfect. If the SHIFT-IN sequence for the current set is an empty string,
the input set number is changed when SHIFT-OUT sequence of the new set
is detected.
.in -4m
For each input character set, you have to specify a line consisting
of 6 strings/expressions separated by spaces:
.br
SO-match SO-subs NEST-up NEST-down SI-match SI-subs
.br
where:
.br
.in +2m
.ti -2m
SO-match --- the string or regular expression for the SHIFT-OUT sequence
for the current character set. If detected, the input character set is
changed to this set.
.ti -2m
SO-subs --- this is usually an empty string (i.e., the input sequence
matching SO-match is removed). But it can be a replacement string or
a substitution expression, which will substitute the original matching
SHIFT-OUT sequence.
.ti -2m
NEST-up --- this string (or a regular expression) is usually an empty
string). However, it can be used to count brackets for detection of SHIFT-IN
bracket, if SHIFT-IN sequence is not unique. Its use is explained below.
.ti -2m
NEST-down --- a counterpart of NEST-up. It is explained later.
.ti -2m
SI-match --- when a sequence in an input file matches the string or regular
expression given as SI-match for a current input character set, the
input character set number is restored to the previous set. Note, that
only SI-match for a current set is matched with input characters.
.ti -2m
SI-subs --- this is usually an empty string (i.e., input sequence which
matched SI-match is removed), but if it is not, the input characters which
matched the SI-match are replaced with the SI-subs.
.sp
.in -2m
.br
The KOI7 case described above may be specified as:
.nf
.ta 5m 10m 15m 20m 25m
.nf
2 # 2 input sets
""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 # Latin(set 1)
"\|\\\|016" ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 "\|\\\|017" ""\0\0\0\0 # Cyrillic(set 2)
or
2 # 2 sets
"\|\\\|017" ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 # Latin(set 1)
"\|\\\|016" ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 # Cyrillic(set 2)
.fi
.br
Before the input is processed, the program is initialized to the character
set of the first set. In the above case, it is important, since declaration:
.nf
2 # 2 sets
"\|\\\|016" ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 # Cyrillic(set 1)
"\|\\\|017" ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 # Latin(set 2)
.br
.fi
would be wrong and would mess up the Latin characters preceding
first Cyrillic sequence.
.sp 1
The nesting sequences are used only for specific situations. I needed them
to write a transliteration table from LaTeX to KOI8.
In LaTeX the { } pair is used for grouping and appears frequently in
the text. The sequence of cyrillic characters is also a group
in LaTeX.
The SHIFT-OUT sequence for Russian letters in LaTeX is (at least in
my case): "{\|\\\|cyr ", and the end
of the Russian letters is marked by "}", but the "}" has to be the
bracket matching the opening "{" in "{\|\\\|cyr ", not just any bracket.
For this reason, my SHIFT-OUT/IN entry was in this case:
.br
"{\|\\\|cyr " "" "{" "}" "}" "" # Cyrillic codes
.br
Whenever the "{\|\\\|cyr " was found, the program zeroes the counter.
It adds +1 to it, when NEST-up sequence (i.e., the "{" here) is found, and
subtracts 1 from it, when the NEST-down sequence is found (i.e., the "}").
The checking for a SHIFT-IN sequence (i.e., the "}") for cyrillic set
is done only when
the counter value is zero (i.e., all pairs inside the cyrillic text are
matched. In fact, the process is more
complicated than that (the counter for an opened character set is
placed on the stack), but these are details you can find in the code
itself.
.br
What if the SHIFT-IN and SHIFT-OUT sequence is the same character?
Starting from version 1.01 the TRANSLIT will also work in such cases.
Let us assume that the SHIFT-IN and SHIFT-OUT sequence is a single
character "%" which switches between two character sets. Also,
if we want to use it in the text, we have to double it,
i.e., "%%" will not be a SHIFT-IN/OUT sequence but will denote a literal
percent sign. We can do it in the following way:
.br
"" "" "" "" "" "" # Latin letters
.br
{%([^%])} {\|\|\\\|\|1} "" "" {%([^%])} {\|\|\\\|\|1} # Cyrillic codes
.br
and later in the transliteration table (see below) we should put a line:
.br
0 "%%" 0 "%" # change doubled % to a single one
.br
The same effect, for identical SHIFT-IN/OUT sequences, can be accomplished
with a -3 character set code and will be described below.
.sp
.ti -3m
Ad.7) Number of output "character sets". This is analogous to the input case.
The characters sent to output may belong to different sets. For example,
when the character (or the sequence) from set 2 is followed by the character
(or the sequence) from set 1,
the program first sends the SHIFT-IN sequence for set 2 (if it is not
empty) and then the SHIFT-OUT sequence for set 1 (if it is not empty). If the
output character (or sequence) is assigned to set 0, then no SHIFT-IN/SHIFT-OUT
sequences are sent to output.
.br
If there is only one set of output characters, you should specify 0.
Note that you may have several input sets and several output sets, though
this is rare. Usually, you have one input set and many
output character sets, or vice versa. Again, if you have only one output set,
you do not have any SHIFT-IN/SHIFT-OUT sequences, since those are
send to output only when a set number is changed.
But you are free to experiment.
.sp
.ti -3m
Ad.8) SHIFT-OUT/SHIFT-IN sequences for each output character set. It is
similar to the input case, however, the NEST-in and NEST-up sequences
are not used here. Again, before any text is sent to output, the
character set specified as the first one is assumed. If SHIFT-OUT/IN
sequences are not used (i.e., you have only one output character set),
you will not have any SHIFT-OUT/SHIFT-IN data lines.
The KOI8 (single character set containing all Latin and Russian letters)
to KOI7 (the set using overlapping codes switched by SHIFT-OUT/IN sequences)
conversion could be therefore accomplished by the following table:
.br
2 # 2 output sets
.br
""\0\0\0\0 ""\0\0\0\0 # Latin Letters
.br
"\|\\\|016" "\|\\\|017" # Russian Letters
case
.sp
.ti -3m
Ad.9) Transliteration table for individual character or their sequences.
It is a core of your transliteration data.
There are 4 columns in the transliteration
table:
.br
.in +3m
(inp_set_no) (inp_seq) (out_set_no) (out_seq)
.br
.in -3m
These 4 columns are separated by spaces. The (input_set_number)
corresponds to the input character set number as specified above for
input SHIFT-OUT/SHIFT-IN data, or zero.
If zero is used (even if number of input sets is not zero), the
(input_sequence) will be always matched, irrespectively of the current
input character set imposed by the SHIFT-OUT sequence. This is useful,
since some characters are universal (e.g., new lines, spaces, pluses,
minuses, etc.) irrespectively of the current character set.
The (input_sequence) is the sequence of characters to be matched with
characters in the input file, and if found (within the character set
specified) it is replaced by the (output_sequence) and sent to output
(i.e., the matching is interrupted, the (output_sequence) sent to ouput,
the input file pointer is moved to the first character after the
matched sequence and matching resumes).
The (output_set_number) specifies the output character set. When the
output character set changes during transliteration, the appropriate SHIFT-IN
sequence of the previous set and the current set's SHIFT-OUT sequence is sent
to output. The (output_set_number) may also be zero (even if number of
output sets is not zero). In this case, the current output set status
is not changed, and no SHIFT-IN/OUT sequences is sent to output. Lastly, the
output set code may be -1, -2 or -3.
In this case, the substitution is performed
within input string that matched but the output sequence is not sent to
the output yet. Depending on the code, the following action is performed:
.in +4m
.ti -2m
-1 --- program makes the substitution in the input string (i.e., substitutes
the matching string with the input string in the input buffer).
It does not send the output sequence to the output, but
continues matching input sequences following the currently
matched one.
.ti -2m
-2 --- like code -1, but matching is resumed from the first sequence on
the list.
.ti -2m
-3 --- like code -1, but matching is resumed from the input SHIFT-OUT/IN
sequences.
.in -4m
E.g., if the unprocessed text in the input file is:
.br
mental procedure was not successful since..........
.br
and there was a line in transliteration table:
.br
0 "me" -1 "you"
.br
the input text would be changed to:
.br
yountal procedure was not successful since..........
.br
and all remaining matching data would be applied to this text, rather than
original text.
The -2 code backsteps to the point where the matching of
transliteration starts.
The -3 code backsteps even further, to the point where the
input SHIFT-OUT and SHIFT-IN sequences are matched.
Since the order of sequences to match
is crucial here, for the case of output set code -1/-2/-3
even one-character input sequences are matched in the order specified.
BE CAREFUL HERE. You may create infinite loops. If you use
code -2/-3, be sure that the resulting sequence after substitution
with the code -2/-3, will not match previous sequences
with codes -2/-3.
.br
The (output_sequence)
is a sequence which substitutes the corresponding (input_sequence).
If (output_sequence) is "" (i.e., empty string) then (input_sequence)
is effectively deleted.
The (input_sequence)s are compared with input in the order specified
unless backstepping -2/-3 code is used (the matching is done from the
first sequence again). I use the code -1 e.g.,
to dehyphenate words when changing to LaTeX.
Code -2 is useful if you want to skip next comparisons, and the resulting
substitution string will match earlier matching expressions.
I do not see many uses for the code -3, but it can be used to resolve
"toggle" SHIFT-IN/OUT sequence, as described in an example further
below.
The order for multicharacter sequences is
therefore important (the single character sequences are always compared
after all multicharacter sequences, and can be therefore put anywhere).
The longer multicharacter sequences should be specified before
shorter ones, unless they are some "preprocessing" steps with codes
-1/-2/-3. The order may sometimes be crucial.
If you need single character sequences matched in a specific order,
enter them as regular expressions, i.e., as {c} instead of "c".
In short, the multicharacter input sequences and regular expressions
are matched to input text in the order specified. For the sake of
efficiency, the single character input sequences (with exception of
output set code -1/-2/-3) and input lists are handled as a case of remapping
and are matched in the order of character codes associated with them.
If you specify the same single input character twice for a given input set,
the program will complain.
The following combinations of input and output sequences are allowed:
.nf
.ta 2m 24m
Input Sequence Output Sequence
"\fIplain string\fR" only "\fIplain string\fR"
[\fIlist\fR] [\fIlist\fR] or "\fIplain string\fR"
{\fIregular expression\fR} {\fIsubstitution expression\fR} or
.br
"\fIplain string\fR"
.br
.fi
When match is found, the matching sequence is removed and substituted
with an output sequence. If this results is changing the current output
character set, the appropriate SHIFT-IN/SHIFT-OUT pair is sent to the
output before the transliterated output sequence. If list is
used as the input sequence, you may either use:
.br
.in +2m
.ti -2m
a) plain string as output
sequence. In this case, if current input character belongs to the input list,
it is replaced by the output string. I use it to delete ranges of
characters which do not have any corresponding characters in the output
set (e.g., some graphics characters). In this case, the order of
characters on the input list is not important.
.ti -2m
b) if the output string is also a
list then it has to contain exactly the same number of characters as
the input list. In this case, the 1st character from the input list
is replaced by the 1st character from the output list, the 2nd one
by the 2nd one, etc. Therefore, the order of characters is important.
.br
.in -2m
Theoretically, if there is one-to-one correspondence between characters
in the input set and characters in the output set,
you can make the conversion by
using a single line consisting of two lists. But it looks ugly... And is
difficult to read.
And for the program, the substitution takes the same time, if
the characters are specified separately, or when they are specified
as matching lists.
If regular expression is used to match the input characters, the matching
sequence may be replaced by a plain string or a substitution string,
which was described above.
.in +3m
Examples:
.br
.ta 3m 10m 20m 30m 40m
2 "CCCP" 0 ""\0\0\0\0
.br
will delete all occurrences of CCCP from the input file (but not Cccp or
CCCp) for input set 2.
.sp 1
0 "\|\\\|0xD1" 0 "ya"
.br
will replace all occurrences of character of the code \|\\\|0xD1 with a two
letter sequence "ya".
.sp 1
0 \|\\\|0xD1 2 q
.br
will replace all characters \|\\\|0xD1 with a character "q" and output
SHIFT-IN/OUT sequence if necessary.
.sp 1
2 "q" 0 "\|\\\|0xD1"
.br
will replace letter q (if the current input set is 2) with a code \|\\\|0xD1.
.sp 1
0 "\|\\\|0xD1" 2 "ya"
.br
will replace code \|\\\|0xD1 with a sequence ya (assuming that SHIFT-OUT
and SHIFT-IN sequences
for output set 2 are: {\|\\\|cyr and }, respectively, you will get {\|\\\|cyr ya}).
.sp
If a character is not specified in the transliteration table, it will
be output as is, i.e., it corresponds to a line:
.br
0 "c" 0 "c"
.br
where c is the character. If you want to delete certain characters, you
need to explicitly specify this, e.g.:
.br
0 [a-z] 0 ""
.br
will delete all lower case Latin letters from the text.
.br
Below is an example of solving the identical SHIFT-IN/OUT sequences problem
using character set code -3 which I promissed above. Assume, that you
have 2 character sets in the input file, but switching between them is
accomplished by a "toggle" character. That is, if the toggle character is
found, you should switch to the other set. Also, if you want to use the
toggle character in the set, you need to double it. Let also assume that
we have 2 character codes which will never, ever appear. We can fool the
translit by changing toggle character to a unique character and backstepping
with character code -3 to check for SHIFT-IN/OUT sequences again. Let the
% sign be a toggle character, and that we have two codes (for example codes
\|\|\\\|\|254 and \|\|\\\|\|255) which will never appear in our text.
The appropriate entries in the transliteration table may look like:
.br
1 {%([^%])} -3 {\|\|\\\|\254\|\|\\\|\|1}
.br
2 {%([^%])} -3 {\|\|\\\|\255\|\|\\\|\|1}
.br
0 "%%" 0 "%"
.br
i.e., when the single % is seen in set 1, produce SHIFT-OUT sequence
for set 2; and when a single % is seen in set 2, produce SHIFT-IN
sequence for set 1. The appropriate input character set definitions will be:
.br
2 # number of input character sets
.br
"\|\|\\\|\|255" "" "" "" "" ""
.br
"\|\|\\\|\|254" "" "" "" "" ""
.br
However, be warned. I never tried this. If this trick does not work,
please let me know.
.sp 1
.in -3m
Before you decide to create your own transliteration file, please examine
existing transliteration files. Do yourself (and others) a favor --- put
as many comments as possible there. If you allow others to use your
transliteration files, please include your name and e-mail address
and file creation date.
.in -4m
.sp 2
Program matches the sequences in a specific order:
.in +4m
.ti -2m
\01) if NEST counter is zero, Match/substitute current set SHIFT-IN sequence
.ti -2m
\02) If matched, restore previous set number
.ti -2m
\03) If matched, restore previous set nest counter
.ti -2m
\04) Match/substitute input SHIFT-OUT sequences
.ti -2m
\05) If matched, save current set and start new one
.ti -2m
\06) If matched, zero nest counter for NEST sequences
.ti -2m
\07) Match/substitute transliteration sequences
.ti -2m
\08) If matched and code = -1 make substitution in input buffer and
continue matching the next sequence.
.ti -2m
\09) If matched and code = -2 make substitution and goto 7)
.ti -2m
10) If matched and code = -3 make substitution and goto 1)
.ti -2m
11) Match (no substitution) NEST-up and NEST-down to input buffer
.ti -2m
12) If NEST-up matched, increment counter for current set
.ti -2m
13) If NEST-down matched, decrement counter for current set
.ti -2m
14) If match in 7) send substitute sequence to output
.ti -2m
15) If no match in 7) (or code -1) output current input character
.ti -2m
16) Advance input pointer to point at new characters
.ti -2m
17) If End of File, break
.ti -2m
18) Goto 1)
.br
.fi
.PP
.SH ASCII CHARACTER CODES
.nf
.ta 2m 6m 9m 13m 16m 20m 22m 26m 29m 33m 36m 40m
dec hx oct ch dec hx oct ch
\0\00 00 000 ^@ NUL \064 40 100 @
\0\01 01 001 ^A SOH \065 41 101 A
\0\02 02 002 ^B STX \066 42 102 B
\0\03 03 003 ^C ETX \067 43 103 C
\0\04 04 004 ^D EOT \068 44 104 D
\0\05 05 005 ^E ENQ \069 45 105 E
\0\06 06 006 ^F ACK \070 46 106 F
\0\07 07 007 ^G BEL \071 47 107 G
\0\08 08 010 ^H BS \072 48 110 H
\0\09 09 011 ^I HT \073 49 111 I
\010 0a 012 ^J LF \074 4a 112 J
\011 0b 013 ^K VT \075 4b 113 K
\012 0c 014 ^L FF \076 4c 114 L
\013 0d 015 ^M CR \077 4d 115 M
\014 0e 016 ^N SO \078 4e 116 N
\015 0f 017 ^O SI \079 4f 117 O
\016 10 020 ^P DLE \080 50 120 P
\017 11 021 ^Q DC1 \081 51 121 Q
\018 12 022 ^R DC2 \082 52 122 R
\019 13 023 ^S DC3 \083 53 123 S
\020 14 024 ^T DC4 \084 54 124 T
\021 15 025 ^U NAK \085 55 125 U
\022 16 026 ^V SYN \086 56 126 V
\023 17 027 ^W ETB \087 57 127 W
\024 18 030 ^X CAN \088 58 130 X
\025 19 031 ^Y EM \089 59 131 Y
\026 1a 032 ^Z SUB \090 5a 132 Z
\027 1b 033 ^[ ESC \091 5b 133 [
\028 1c 034 ^\\ FS \092 5c 134 \\
\029 1d 035 ^] GS \093 5d 135 ]
\030 1e 036 ^^ RS \094 5e 136 ^
\031 1f 037 ^_ US \095 5f 137 _
\032 20 040 SP \096 60 140 `
\033 21 041 ! \097 61 141 a
\034 22 042 " \098 62 142 b
\035 23 043 # \099 63 143 c
\036 24 044 $ 100 64 144 d
\037 25 045 % 101 65 145 e
\038 26 046 & 102 66 146 f
\039 27 047 ' 103 67 147 g
\040 28 050 ( 104 68 150 h
\041 29 051 ) 105 69 151 i
\042 2a 052 * 106 6a 152 j
\043 2b 053 + 107 6b 153 k
\044 2c 054 , 108 6c 154 l
\045 2d 055 - 109 6d 155 m
\046 2e 056 . 110 6e 156 n
\047 2f 057 / 111 6f 157 o
\048 30 060 0 112 70 160 p
\049 31 061 1 113 71 161 q
\050 32 062 2 114 72 162 r
\051 33 063 3 115 73 163 s
\052 34 064 4 116 74 164 t
\053 35 065 5 117 75 165 u
\054 36 066 6 118 76 166 v
\055 37 067 7 119 77 167 w
\056 38 070 8 120 78 170 x
\057 39 071 9 121 79 171 y
\058 3a 072 : 122 7a 172 z
\059 3b 073 ; 123 7b 173 {
\060 3c 074 < 124 7c 174 |
\061 3d 075 = 125 7d 175 }
\062 3e 076 > 126 7e 176 ~
\063 3f 077 ? 127 7f 177 DEL
.br
.SH CONVERSION: DECIMAL<-->OCTAL<-->HEX.
.nf
.cs R 24
000 000 00 064 100 40 128 200 80 192 300 C0
001 001 01 065 101 41 129 201 81 193 301 C1
002 002 02 066 102 42 130 202 82 194 302 C2
003 003 03 067 103 43 131 203 83 195 303 C3
004 004 04 068 104 44 132 204 84 196 304 C4
005 005 05 069 105 45 133 205 85 197 305 C5
006 006 06 070 106 46 134 206 86 198 306 C6
007 007 07 071 107 47 135 207 87 199 307 C7
008 010 08 072 110 48 136 210 88 200 310 C8
009 011 09 073 111 49 137 211 89 201 311 C9
010 012 0A 074 112 4A 138 212 8A 202 312 CA
011 013 0B 075 113 4B 139 213 8B 203 313 CB
012 014 0C 076 114 4C 140 214 8C 204 314 CC
013 015 0D 077 115 4D 141 215 8D 205 315 CD
014 016 0E 078 116 4E 142 216 8E 206 316 CE
015 017 0F 079 117 4F 143 217 8F 207 317 CF
016 020 10 080 120 50 144 220 90 208 320 D0
017 021 11 081 121 51 145 221 91 209 321 D1
018 022 12 082 122 52 146 222 92 210 322 D2
019 023 13 083 123 53 147 223 93 211 323 D3
020 024 14 084 124 54 148 224 94 212 324 D4
021 025 15 085 125 55 149 225 95 213 325 D5
022 026 16 086 126 56 150 226 96 214 326 D6
023 027 17 087 127 57 151 227 97 215 327 D7
024 030 18 088 130 58 152 230 98 216 330 D8
025 031 19 089 131 59 153 231 99 217 331 D9
026 032 1A 090 132 5A 154 232 9A 218 332 DA
027 033 1B 091 133 5B 155 233 9B 219 333 DB
028 034 1C 092 134 5C 156 234 9C 220 334 DC
029 035 1D 093 135 5D 157 235 9D 221 335 DD
030 036 1E 094 136 5E 158 236 9E 222 336 DE
031 037 1F 095 137 5F 159 237 9F 223 337 DF
032 040 20 096 140 60 160 240 A0 224 340 E0
033 041 21 097 141 61 161 241 A1 225 341 E1
034 042 22 098 142 62 162 242 A2 226 342 E2
035 043 23 099 143 63 163 243 A3 227 343 E3
036 044 24 100 144 64 164 244 A4 228 344 E4
037 045 25 101 145 65 165 245 A5 229 345 E5
038 046 26 102 146 66 166 246 A6 230 346 E6
039 047 27 103 147 67 167 247 A7 231 347 E7
040 050 28 104 150 68 168 250 A8 232 350 E8
041 051 29 105 151 69 169 251 A9 233 351 E9
042 052 2A 106 152 6A 170 252 AA 234 352 EA
043 053 2B 107 153 6B 171 253 AB 235 353 EB
044 054 2C 108 154 6C 172 254 AC 236 354 EC
045 055 2D 109 155 6D 173 255 AD 237 355 ED
046 056 2E 110 156 6E 174 256 AE 238 356 EE
047 057 2F 111 157 6F 175 257 AF 239 357 EF
048 060 30 112 160 70 176 260 B0 240 360 F0
049 061 31 113 161 71 177 261 B1 241 361 F1
050 062 32 114 162 72 178 262 B2 242 362 F2
051 063 33 115 163 73 179 263 B3 243 363 F3
052 064 34 116 164 74 180 264 B4 244 364 F4
053 065 35 117 165 75 181 265 B5 245 365 F5
054 066 36 118 166 76 182 266 B6 246 366 F6
055 067 37 119 167 77 183 267 B7 247 367 F7
056 070 38 120 170 78 184 270 B8 248 370 F8
057 071 39 121 171 79 185 271 B9 249 371 F9
058 072 3A 122 172 7A 186 272 BA 250 372 FA
059 073 3B 123 173 7B 187 273 BB 251 373 FB
060 074 3C 124 174 7C 188 274 BC 252 374 FC
061 075 3D 125 175 7D 189 275 BD 253 375 FD
062 076 3E 126 176 7E 190 276 BE 254 376 FE
063 077 3F 127 177 7F 191 277 BF 255 377 FF
.cs R
.br
.sp
.fi
.SH INSTALLATION
Program is given in a source form. It was tried under UN*X, VMS and
MS-DOS systems and ran. The file \fIreadme.doc\fR contains the details
on how to obtain the whole package. You can retrieve this file
from anonymous ftp on www.ccl.net in the directory /pub/russian/translit.
You can also obtain it via e-mail by sending a message:
.br
get translit/readme.doc from russian
.br
to OSCPOST@ccl.net or OSCPOST@OHSTPY.BITNET.
.sp
The source of the program consists of several files:
.br
.IP \fIpaths.h\fR
must be edited before compilation. It contains its
own comments what to do. The defines in this file relate to the operating
system you are using and the default path for searching transliteration
table.
.br
.IP \fItranslit.c\fR
It contains the main program.
This was intended to be a portable code.
.br
.IP \fIreg_exp.h\fR
the include file for regular expression matching
library of Henry Spencer from the University of Toronto. This regular
expression package was posted to comp.sources.misc (volume 3). Also 4 patches
were posted (in volumes: 3, 4, 4, 10). I applied the patches to the original
code and made small modifications to the code, which are marked in the
source code.
.br
.IP \fIreg_exp.c\fR
the regular expression library for compilation and
matching of regular expressions.
.br
.IP \fIreg_sub.c\fR
the regular expression substitution routine.
.br
.sp
.PP
Before you compile this program you have to edit \fIpaths.h\fR.
Read comments in the file.
During compilation, all source code should reside in the
current directory.
.br
Then you may compile the program under UN*X as (for example):
.br
cc -o translit translit.c reg_exp.c reg_sub.c
.br
and copy the program \fItranslit\fR to some standard directory which is
in users' path (for example: /usr/local/bin). Then you need to copy
transliteration tables to the directory which you have chosen in \fIpaths.h\fR.
If you get errors, then it is not OK. Please, report them to the author (with
all the gory details: error message, line number, machine, operating system,
etc.).
.sp
Under VMS (VAXes) you need to compile it as:
.br
cc translit
.br
cc reg_exp
.br
cc reg_sub
.br
link translit+reg_exp+reg_sub,sys$library:vaxcrtl/lib
.br
and before you can use the program, you need to type (or better put into your
LOGIN.COM file) a line:
.br
translit == "$SYS$USER:[ME.TRA]TRANSLIT.EXE"
.br
or whatever is the full path to the \fItranslit\fR executable image which
you created with LINK. Note the quotes and the $ sign in front of program
path.
.sp
On an IBM-PC I used MicroSoft C 5.1 as:
.br
.in +2m
.ti -1m
cl /FeTRANSLIT /AL /FPc /W1 /F 5000 /Ox /Gs translit.c reg_exp.c reg_sub.c
.in -2m
.sp 2
.SH RULES, CONDITIONS AND AUTHOR'S WHISHES
You can distribute this code and associated files under these conditions:
.br
.in +4m
.ti -2m
1) You will distribute all files (even if you
think that they are garbage). You may get the complete set from anonymous
ftp at www.ccl.net in /pub/russian/translit. You can also get the program
and associated files via e-mail. To get the instructions for e-mail
distribution send a line:
.br
send translit/readme.doc from russian
.br
to OSCPOST@ccl.net or OSCPOST@OHSTPY.BITNET.
You are not allowed to distribute the incomplete distribution. The following
files should be present in the distribution:
.ta 2m 22n
.nf
alt-gos.rus - ALT to GOSTCII table
alt-koi8.rus - ALT to KOI8 table
example.alt.uu - uuencoded example in ALT
example.ko8.uu - uuencoded example in KOI8
example.pho - phonetic transliteration example
example.tex - LaTeX example
gos-alt.rus - GOSTCII to ALT table
gos-koi8.rus - GOSTCII to KOI8 table
koi7-8.rus - KOI7 to KOI8 table
koi7nl-8.rus - KOI7 (no Latin) to KOI8 table
koi8-7.rus - KOI8 to KOI7 table
koi8-alt.rus - KOI8 to ALT table
koi8-gos.rus - KOI8 to GOSTCII table
koi8-lc.rus - KOI8 to Library of Congress table
koi8-phg.rus - KOI8 to GOST transliteration
koi8-php.rus - KOI8 to Pokrovsky transliteration
koi8-ltx.rus - KOI8 to LaTeX conversion
koi8-tex.rus - KOI8 to Plain TeX conversion
order.txt - Order form for ordering the program
paths.h - Include file for translit.c
phg-koi8.rus - GOST transliteration to KOI8
pho-8sim.rus - Simple phonetic to KOI8
pho-koi8.rus - Various phonetic to KOI8
php-koi8.rus - Pokrovsky to KOI8
readme.doc - short description of the files
reg_exp.c - regular expression code by Henry Spencer
reg_exp.h - include for reg_exp.c and reg_sub.c
reg_sub.c - regular expression code by H. Spencer
ltx-koi8.rus - LaTeX to KOI8
translit.c - TRANSLIT main program
translit.ps - TRANSLIT manual in PostScript
translit.1 - TRANSLIT manual in *roff
translit.txt - Plain ASCII TRANSLIT manual
.sp 1
.fi
.ti -2m
2) You may expand/change the files and the program and distribute modified
files, provided that you do
not delete anything (you can always comment the unnecessary portions out)
and clearly mark your changes. Please send the copy of the modified
version to the author, though you are not required to do so.
I will give you all the credit for your enhancements. I simply wish that
there is a single point of distribution for this code, so it is maintained
to some extent. If you create additional transliteration definition files,
please, send them to the author if you may. I will add them to the program
distribution. I want to fix bugs and expand/optimize this code,
but I need your help.
I need your transliteration files for languages which I do not know or
do not use currently.
Your suggestions for improving documentation are most welcome (I am not
a native English speaker).
.ti -2m
3) You will not charge money for the program and/or associated files,
except for media and copying costs. If you want to sell it, contact the author
first. Bear in mind
that the regular expression package by Henry Spencer has some
copyright restrictions.
But there are other regular expression packages which do not have these
restrictions (which are not violated by this offering).
.ti -2m
4) I will gladly help you with advice on compiling this software and
try to fix bugs when time allows. However, if you want a ready to run
executable, you need to order it for a very nominal fee from
\fIJKL ENTERPRISES, INC.\fR as described in the file \fIorder.txt\fR
which must be a part of a complete distribution.
.in -4m
.SH AUTHOR
Jan Labanowski, P.O. Box 21821, Columbus, OH 43221-0821, USA.
E-mail: jkl@ccl.net, JKL@OHSTPY.BITNET.
|