old-version-1.01
|
INSTALL,
Makefile,
Makefile.os2,
Makefile.unx,
alt-gos.rus,
alt-koi8.rus,
announcement,
example.alt.uu,
example.ko8.uu,
example.pho,
example.tex,
gos-alt.rus,
gos-koi8.rus,
hex-koi8.rus,
koi7-8.rus,
koi7nl-8.rus,
koi8-7.rus,
koi8-alt.rus,
koi8-gos.rus,
koi8-lc.rus,
koi8-phg.rus,
koi8-php.rus,
koi8-tex.rus,
order.txt,
paths.h,
phg-koi8.rus,
pho-8sim.rus,
pho-koi8.rus,
php-koi8.rus,
readme.doc,
reg_exp.c,
reg_exp.h,
reg_sub.c,
tex-koi8.rus,
translit.1,
translit.c,
translit.ps,
translit.tar.Z,
translit.tar.z.uu,
translit.txt,
translit.zip,
translit.zip.uu,
|
|
|
TRANSLIT(JKL) Version 1.01 TRANSLIT(JKL)
NAME
TRANSLIT
Program to transliterate texts in different character
sets. The program converts input character codes (or
sequences of codes) to a different set of output char-
acter codes (or sequences of codes). Intended for
transliteration to/from phonetic representation of
foreign letters with Latin letters from/to special
national codes used for these letters. It supports
simple matches, character lists and flexible matches
via regular expressions. The new transliteration
schemes are easily added by creating simple transli-
teration tables. Multiple character sets are supported
for input and output. It does not yet support UNICODE,
but some day it will.
COPYRIGHT
Copyright (c) 1993 Jan Labanowski and JKL Enterprises, Inc.
You may distribute the Software only as a complete set of
files. You may distribute the modified Software only if you
retain the Copyright notice and you do not delete original
code, data, documentation and associated files. The
Software is copyrighted. You may not sell the software or
incorporate it in the commercial product without written
permission from Jan Labanowski or JKL Enterprises, Inc. You
are allowed to charge for media and copying if you distri-
bute the whole unaltered package.
SYNOPSIS
translit [ -i inpfile ][ -o outfile ][ -d ][ -t transtbl |
transtbl ]
OPTIONS
-i inpfile
inpfile is a name of input file to be transliterated.
If "-i" is not specified, the input is taken from stan-
dard input.
-o outfile
outfile is an output file, where the transliterated
text is stored. If "-o" is not specified, the output
is directed to the standard output. Program will not
overwrite the existing file. If file exists, you need
to delete it first.
-d Some information on character codes read from transli-
teration table file are sent to standard error
("stderr"). Useful when developing new transliteration
tables.
JKL Last change: 30-Mar-1993 1
TRANSLIT(JKL) Version 1.01 TRANSLIT(JKL)
-t transtbl
transtbl is a transliteration table file which you want
to use. The "-t" option may be omitted if the transtbl
is specified as the last parameter on the command line.
The program first tries to locate transtbl file in the
current directory, and if not found, it searches the
directory chosen at compilation/installation time in
"paths.h". If no "transtbl" is given, the default file
name specified in "paths.h" is taken. The
compile/installation time defaults in "paths.h" for the
search directory and the default file name can be
overiden by setting environment variables: TRANSP and
TRANSF, respectively (see below).
ENVIRONMENT VARIABLES
The default path to the directory holding transliteration
tables can be overiden by setting environment variable
TRANSP. The default name for the transliteration table can
be overiden by setting TRANSF environment variable. However,
when the transliteration file is given on the command line,
it will overide the defaults and environment setting. Here
are some examples of setting environment variables for dif-
ferent operating systems:
UN*X System
If you are using csh (C-shell):
setenv TRANSP /home/john/translit/
setenv TRANSF koi8-tex.rus
If you are using sh (Bourne Shell):
set TRANSP=/home/john/translit/
export TRANSP
set TRANSF=koi8-tex.rus
export TRANSF
VAX-VMS System
TRANSP:==SYS$USER:[JOHN.TRANSLIT]
TRANSF:==KOI8-TEX.TBL
PC-DOS or MS-DOS
SET TRANSP=C:\JOHN\TRANSLIT\
SET TRANSF=KOI8-TEX.TBL
Note that the directory path has to include concluding
slashes, \ or /.
EXAMPLES
cat text.koi8 | translit koi8-tex.rus > text.tex
in UN*X is equivalent to:
translit -t koi8-tex.rus -o text.tex -i text.koi8
and converts file text.koi8 to file text.tex using transli-
teration specified in the file koi8-tex.rus.
JKL Last change: 30-Mar-1993 2
TRANSLIT(JKL) Version 1.01 TRANSLIT(JKL)
translit -i text.koi8 koi8-cl.rus
displays the converted text from file text.koi8 on your ter-
minal. The conversion table is koi8-cl.rus (KOI8 --> Library
of Congress).
translit -i text.alt -t alt-koi8.rus | translit -o
text.tex -t koi8-tex.rus
is essentially equivalent to the following two commands in
UN*X or MS-DOS:
translit -i text.alt -o junkfile -t alt-koi8.rus
translit -i junkfile -o text.tex -t koi8-tex.rus
and converts the file in ALT character set to a LaTeX file
for printing.
translit -i russ.txt pho-koi8.rus | translit -o
russ.tex koi8-tex.rus
converts file russ.txt from phonetic transliteration to
LaTeX file russ.tex for printing.
TRANSLITERATION TABLES
The following transliteration files are available with the
current distribution. Consult the comments in the individual
files for details.
koi8-tex.rus
Conversion table which changes the file in KOI8 (8 bit
character set used by RELCOM news service) to a LaTeX
file for printing with AMS WNCYR fonts.
tex-koi8.rus
Conversion table for the LaTeX to KOI8 conversion. Note
that it will not handle complicated cases, since LaTeX
is a program, and only TeX can convert a LaTeX source
to the characters. However, it should work OK for sim-
ple cases of text only files, and may need some editing
for complicated cases.
alt-gos.rus
This is a transliteration data file for converting from
ALT (Bryabrins alternativnyj variant used in many popu-
lar wordprocessors) to GOSTSCII 84 (approx. ISO-8859-
5?)
alt-koi8.rus
This is a transliteration data file for converting from
ALT to KOI8. KOI8 is meant to be GOST 19768-74 (as
used by RELCOM).
gos-alt.rus
JKL Last change: 30-Mar-1993 3
TRANSLIT(JKL) Version 1.01 TRANSLIT(JKL)
This is a transliteration data file for converting
GOSTSCII 84 (approx. ISO-8859-5?) to ALT (Bryabrins
alternativnyj variant)
gos-koi8.rus
This is a transliteration data file for converting
GOSTSCII 84 (approx. ISO-8859-5?) to KOI8 used by REL-
COM KOI8 is meant to be GOST 19768-74
koi8-alt.rus
This is a transliteration data file for converting from
KOI8. KOI8 is meant to be GOST 19768-74, to ALT
(Bryabrins alternativnyj variant)
koi8-gos.rus
This is a transliteration data file for converting from
KOI8 (Relcom). KOI8 is meant to be GOST 19768-74, to
GOSTSCII 84 (approx. ISO-8859-5)
koi8-7.rus
This file converts from KOI8 to KOI7.
koi7-8.rus
This file converts from KOI7 to KOI8. Before you
attempt the conversion, you might need to perform a
simple edit on your file. You MUST read the comments in
koi7-8.rus before you attempt this conversion.
koi7nl-8.rus
This file assumes that there are only Russian letters
(no Latin) in the input file. If you have Latin
letters, and you inserted SHIFT-OUT/IN characters, use
file koi7-8.rus.
koi8-lc.rus
This file converts KOI8 to the Library of Congress
transliteration. Some extensions are added.
koi8-php.rus
This file converts KOI8 to the Pokrovsky translitera-
tion.
php-koi8.rus
This file converts from Pokrovsky transliteration to
KOI8.
koi8-phg.rus
This file converts from KOI8 to GOST transliteration.
phg-koi8.rus
This file converts from GOST transliteration to KOI8.
JKL Last change: 30-Mar-1993 4
TRANSLIT(JKL) Version 1.01 TRANSLIT(JKL)
pho-koi8.rus
This is a table which will convert from many "phonetic"
transliteration schemes to KOI8. It is elaborate and it
takes a lot of time to transliterate the file using
this table. Some transliterations are hopeless and
internally inconsistent (as humans...), so the results
cannot be bug free. You might want to modify the file,
if your transliteration patterns are different than
those assumed in this file. You may also want to sim-
plify this file if the phonetic transliteration you are
converting is a sound one (most are not, e.g., they use
e for je and e oborotnoye, ts for c and t-s, h for kha,
i for i-kratkoe, etc.).
INTRODUCTION
If you do not intend to write your own transliteration
tables, you may skip this description and go directly to the
installation and copyright sections. However, you might want
to read this material anyhow, to better understand the traps
and complexities of transliteration. It is frequently
necessary to transliterate text, i.e., to change one set of
characters (or composite characters, phonemes, etc.) to
another set.
On computers, the transliteration operation consists of con-
verting the input file in some character set to the output
file in another character set.
In the simplest case, the single characters are transli-
terated, i.e, their codes are changed according to some
transliteration table. This is called remapping and, assum-
ing the one-to-one mapping, the task can be accomplished by
a simple pseudo program:
new_char_code = character_map[old_char_code];
If the one-to-one correspondence does not exist (i.e., some
codes may be present in one set, but do not have correspond-
ing codes in another set), precise transliteration is not
possible. In such cases there are 3 obvious possibilities:
1. skip characters which do not have counterparts,
2. retain unchanged codes of these characters,
3. convert the codes to multicharacter sequences.
In some cases, the file can contain more than one character
sets, e.g., the file can contain Latin characters (e.g.
English text) and Cyrillic characters (e.g. Russian text).
If the character codes assigned to characters in different
sets do not overlap, this is still a simple mapping problem.
This is a case with KOI8 or GOSTCII character tables for
Russian, which reserve the lower 127 codes for standard
ASCII codes (which include all Latin characters) and
JKL Last change: 30-Mar-1993 5
TRANSLIT(JKL) Version 1.01 TRANSLIT(JKL)
characters with codes above 127 for Cyrillic letters.
If character codes overlap, there is a SHIFT-OUT/SHIFT-IN
technique in which the meaning of the character sequence is
determined by an opening code (or sequence of characters
codes). In this case, the meaning of the series of charac-
ters is determined by the SHIFT-OUT character (or sequence)
which precedes them. The SHIFT-IN character (or sequence)
following the series of characters returns the "reader" to
the default or previous status. To schemes are used:
(char_set_1)(SHIFT-IN[1])(SHIFT-OUT[2])(char_set_2)...
or
(char_set_1)(SHIFT-OUT[2])(char_set_2)(SHIFT-
OUT[1])char_set_1...
Since computer keyboards, screens, printers, software, etc.,
are by necessity language specific (the most popular being
ASCII), there is a problem of typing foreign language text
which contains letters different than standard Latin alpha-
bet. For this reason, many transliteration schemes use
several Latin letters to represent a single letter of
foreign alphabet, for example:
zh is used to represent cyrillic letter zhe, \"o may be
used to represent the o umlaut, etc.
If there is one-to-one mapping of such sequences to another
alphabet, it is also easy to process. However, it is neces-
sary to substitute longest sequences first. For example, a
frequently used transliteration for cyrillic letters:
shch --- letter shcza 221 (decimal KOI8 code)
sh --- letter sha 219
ch --- letter cze 222
c --- letter tse 195
h --- letter kha 200
a --- letter a 193
Obviously, in this case, we should proceed first with con-
verting all shch sequences to shcha letter, then two-
character sh and ch, and then single character c and h.
Generally, for the one-to-one transliteration, the longest
sequences should be precessed first, and the order of
conversion within sequences of the same length makes no
difference. For example, converting the word "shchah" to
KOI8 should proceed in a following way:
shchah --> (221)ah, (221)ah --> (221)(193)h, (221)(193)h
--> (221)(193)(200)
There is a multitude of reasons why transliteration is done.
I wrote this program having in mind the following ones:
1) to print cyrillic text using TeX/LaTeX and cyrillic
fonts
2) to read KOI8 encoded messages from Russia on my ASCII
terminal.
JKL Last change: 30-Mar-1993 6
TRANSLIT(JKL) Version 1.01 TRANSLIT(JKL)
However, I was trying to make it flexible to accommodate
other uses.
PROGRAM OPERATION
The program converts the input file to an output file using
transliteration rules from the transliteration rule file
which you specify with option -t. Some examples of transli-
teration rule files are enclosed. Before program can be
used, the transliteration rules need to be specified.
These are given as a file which consist of the following
parts described below:
1) File format number (it is 1 at this moment)
2) Delimiters used to enclose a) simple strings, b) char-
acter lists, c) regular expressions
3) Starting sequence for output
4) Ending sequence for output
5) Number of input "character sets"
6) SHIFT-OUT/SHIFT-IN sequences for each input character
set
7) Number of output "character sets"
8) SHIFT-OUT/SHIFT-IN sequences for each output character
set
9) Transliteration table
GENERAL COMMENTS
The transliteration rules file consists of comments and
data. The comments may be included in the file as:
a) line comments --- lines starting with ! or # character
(# or ! must be in the first column of a line) are
treated as comments and are not read in by the program.
b) comments following all required entries on the line.
They must be separated by at least one space from the
last data entry on the line and need not start with any
particular character. These comments cannot be used
within multiline sequences.
The data entries consist of integer numbers and strings.
The strings may represent:
a) plain strings
b) character lists
c) regular expressions
All strings which appear in the file, are processed through
the "string processor", which allows entering unprintable
characters as codes. The character code is specified as a
backslash "\" followed by at least 2 digit(s) (i.e., \01
produces code=1, but \1 is passed unchanged). The following
formats are supported:
\0123 character of octal code 123 (when leading zero
present)
JKL Last change: 30-Mar-1993 7
TRANSLIT(JKL) Version 1.01 TRANSLIT(JKL)
\123 character of decimal code 123 (when leading digit
is not zero)
\0o123 or \0O123 character of octal code 123
\0d123 or \0D123 character of decimal code 123
\0xA3 or \0XA3 or \0xa3 character of hexadecimal code
A3
The allowed digits are 0-7 for octal codes, 0-9 for decimal
codes and 0-F (and/or 0-f) for hexadecimal codes. In a
situation when code has to be followed by a digit character,
you need to enter the digit as a code. E.g., if you want
character \0xA3 followed by a letter C, you need to specify
letter C as a code (\0x43 or \103 or \0o103 or \0d67) and
type the sequence as, e.g., \0xA3\103. Character resulting
in a code 0 (zero) (e.g., \00) is special. It tells: "skip
everything what follows me in this string". It does not
make sense to use it, since you can always terminate the
sequence with a delimiter. When you use an empty string as
a matching sequence, remember that it does not match any-
thing.
If the line with entries is too long, you can break it
between the fields. If the string is too long to fit a
line, you can break it before any nonblank character by the
\ (backslash) followed by white space (i.e., new lines,
spaces, tabs, etc.). The \ and the following white space
will be removed from the string by the string preprocessor.
However, you are not allowed to break the individual charac-
ter codes (and you probably would not do it ever for
aestetic purposes). For example:
"experi\
mental design"
is equivalent to:
"experimental design"
while:
"experimental\
design"
is equivalent to:
"experimentaldesign"
If you need to have \ followed by a space in your string,
you need to enter either a backslash or a space following it
as an explicit character code, for example:
"\\0o40"
will produce a \ followed by the space, while the string:
"\ "
will be empty.
The preprocessor knows only about comments, plain charac-
ters, character codes, and continuation lines. However, some
characters and their combinations may have a special meaning
in lists and regular expressions.
JKL Last change: 30-Mar-1993 8
TRANSLIT(JKL) Version 1.01 TRANSLIT(JKL)
DETAILS OF FILE STRUCTURE
Ad.1) File format number. This is simply a digit 1 on a line
by itself at the moment. This entry is included to allow
future extensions of the transliteration description file
without the need to modify older transliteration descrip-
tions (program will read data according to the current
file format number given in the file).
Ad.2) String delimiters. The subsequent 3 lines specify
pairs of single character delimiters for 3 types of text
data. The line format is:
opening_character closing_character.
These are needed to mark the beginning/end and the type
of the text data. Each string (text datum) is saved
starting from the first character after opening delim-
iter, and ends at the last character before the closing
delimiter. If you need to use the closing delimiter
within a string, you need to specify it as its code
(e.g., if you are using () pair as delimiters, specify
")" as \0x29). The opening delimiter may be the same or
different from the closing delimiter.
a) The first line contains characters used to enclose
(bracket) a plain string. Plain strings are directly
matched to input data or directly sent to output. I
suggest to stick to " " pair for plain strings. The
ASCII code for " is \0d34 = \0x22 = \0o42 if you need
it inside the string itself.
b) The second line contains characters to mark the begin-
ning and the end of the list. Lists are used to
translate single character codes. I suggest [ and ]
delimiters for the list (ASCII code of "]" is: \0d93 =
\0x5D = \0o135). The lists may include ranges, for
example: [a-zA-Z0-9] will include all Latin letters
(small and capital) and digits. Note that order is
important: [a-d] is equivalent to [abcd], while [d-a]
will result in an error. If you want to include "-"
(minus) in the list, you need to place it as the first
or the last character. There are only two special char-
acters on the list, the "-" described above, and the
"]" character. You need to enter the "]" as its code.
E.g., for ASCII character table [*--] is equivalent to
[*+,-], is equivalent to [\42\43\44\45]. The order of
characters in the list does not matter unless the input
list corresponds to the output list (this will be
explained later). Empty lists do not make sense.
c) The third line of delimiter specification contains
delimiters for regular expressions and substitution
JKL Last change: 30-Mar-1993 9
TRANSLIT(JKL) Version 1.01 TRANSLIT(JKL)
expressions. These strings are used for "flexible"
matches to the text in the input file. They are very
similar to the ones used in UN*X for searching text in
utilities like: grep, sed, vi, awk, etc., though only
a subset of full UN*X regular expression syntax is used
here. I suggest enclosing them within braces { and }
(ASCII code for } is \0d125 = \0x7D = \0o175). Actu-
ally, regular expressions can only be used for input
sequences, and for output sequences the {} are used to
enclose substitution sequences. This will be explained
below. The description of the syntax for
regular/substitution expressions is adapted from the
documentation for the regexp package of Henry Spencer,
University of Toronto --- this regular expression pack-
age was incorporated, after minute modifications, into
the program.
REGULAR EXPRESSION SYNTAX
A regular expression is zero or more branches,
separated by `|'. It matches anything that matches
one of the branches. The `|' simply means "or".
A branch is zero or more pieces, concatenated. It
matches a match for the first, followed by a match
for the second, etc.
A piece is an atom possibly followed by `*', `+',
or `?'. An atom followed by `*' matches a
sequence of 0 or more matches of the atom. An atom
followed by `+' matches a sequence of 1 or more
matches of the atom. An atom followed by `?' matches
zero or one occurrences of atom.
An atom is a regular expression in parentheses
(matching a match for the regular expression), a
range (see below), `.' (matching any single charac-
ter), a `\' followed by a single character (matching
that character), or a single character with no other
significance (matching that character).
A range is a sequence of characters enclosed in
`[]'. It normally matches any single character from
the sequence. If the sequence begins with `^', it
matches any single character not from the rest of the
sequence. If two characters in the sequence are
separated by `-', this is shorthand for the full list
of ASCII characters between them (e.g. `[0-9]'
matches any decimal digit). To include a literal `]'
in the sequence, make it the first character (follow-
ing a possible `^'). To include a literal `-', make it
the first or last character. The regular expression
can contains subexpressions which are enclosed in a ()
pair. These subexpressions are numbered 1 to 9 and can
be nested. The numbering of subexpressions is given in
the order of their opening parentheses "(". For
JKL Last change: 30-Mar-1993 10
TRANSLIT(JKL) Version 1.01 TRANSLIT(JKL)
example:
(111)...(22(333)222(444)222)...(555)
Note that expression 2 contains within itself expres-
sions 3 and 4.
These subexpressions can be referenced in the substitu-
tion string which is described below in the paragraph
below, or can be used to delimit atoms.
Examples:
{[\0d32\0d09]\0d10} --- will match space or tab fol-
lowed by new line
{[Tt][Ss]} --- will match TS, Ts, tS and ts
{TS|Ts|tS|ts} --- same as above
{[\0d09-\0d15 ][^hH][^uU][a-zA-Z]*[\0d09-\0d15 ]} ---
all words which do not start with hu, Hu, hU, HU.
There is a space between \0d15 and ].
Note that specifying expressions like {.*} (i.e.,
match all characters) does not make much sense,
since it would mean here: match the whole input
file. However, expressions like {A.*B} should be
acceptable, since they match a pair of A and B, and
everything in between them, e.g. for a string like:
"This is Mr. Allen and this is Mr. Brown." this
expression should match the string: "Allen and this
is Mr. B".
Remember to put a backslash "\" in front of the follow-
ing characters: .[()|?+*\ if you want their literal
meaning outside the range enclosed in []. Inside the
range they have their literal meaning. If you know the
syntax of UN*X regular expressions, please note that ^
and $ anchors are not supported and are treated as nor-
mal characters (with the exception of ^ negation within
[]).
SUBSTITUTION EXPRESSIONS
After finding a match for a regular expression in the
input text, a substitution is made. It can be a simple
substitution where the whole matching string is
replaced by another string, or it may reuse a portion
or the whole matching string. The subexpressions (the
ones enclosed in parentheses) within the regular
expression which matched the input text can be refer-
enced in the substitution expression. Only the follow-
ing characters have special meaning within substitution
expression:
& --- will put the whole matching string.
\1 --- will put the match for the 1st subexpression
in ().
\2 --- will put the string which matched 2nd subex-
pression, etc.
\9 --- will place in a replacement string the 9th
subexpression (provided that there was 9 () pairs
in the regular expression)
JKL Last change: 30-Mar-1993 11
TRANSLIT(JKL) Version 1.01 TRANSLIT(JKL)
Only 9 subexpressions are allowed. All other charac-
ters and sequences within the substitution expression
will be placed in a substitution string as written. To
be able to put a single backslash there, you need to
put two of them. To be able to place the unchanged
codes of the above characters (i.e., to make them
literals), you need to precede them with a backslash
"\", i.e., to get & in the output string you need to
write it as \&. Similarly, to place literal \1, \2,
etc., you need to enter it as \\1, \\2, etc. Note that
characters .+[]()^, etc. which had a special meaning in
the regular expressions, do not have any special mean-
ing in the substitution expression and will be output
as written.
Example:
The regular expression:
{([Tt])([Ss])} and the corresponding substitution
expression {\1.\2} puts a period between adjoining
letters t and s preserving their letter case.
The expression:
{([A-Za-z]+)-[ \0x09]*([\0x0A-\0x0D]+)[ \0x09]*([A-
Za-z,.?;:"\)'`!]+)[ \0x09]}
and the substitution expression {\1\3\2} dehyphen-
ate words (when you understand this one, you are a
guru...). For example: con- (NL)cert is changed
to concert(NL), where NL stands for New Line. It
looks for one or more letters (saves them as sub-
string 1) followed by a hyphen (which may be fol-
lowed by zero or more spaces or tabs). The hyphen
must be followed by a NewLine (ASCII characters
0A-0D hex form various new line sequences) and
saves NewLine sequence as a subexpression 2. Then
it looks for zero or more tabs and spaces (at the
beginning of the line). Then it looks for the rest
of the hyphenated word and saves it as substring 3.
The word may have punctuation attached. Then it
looks again for some spaces or tabs. The substitu-
tion expression junks all sequences which were not
within (), i.e., hyphen and spaces/tabs and inserts
only substrings but in a different order. The \1
(word beginning) is followed by \3 (word end) and
followed by the NewLine --- \2. The {\2\1\3} would
be probably equally good, though you would need to
move the punctuation matching to the beginning of
the regular expression.
Ad.3) Starting sequence. This sequence will be sent to the
output before any text. It is enclosed in the pair of
string delimiters. I use it to output LaTeX preamble.
However, it can be empty, if not used. The (sequence)
may contain any characters, including new lines, etc.
Example:
"" # empty sequence
JKL Last change: 30-Mar-1993 12
TRANSLIT(JKL) Version 1.01 TRANSLIT(JKL)
Example:
"\documentstyle{article}
\input cyracc
\begin{document}
"
is right (note a new line at the end), but
"\documentstyle{article}
\input cyracc # this comment will be included!
\begin{document}" # while this will not
is wrong.
Ad.4) Ending sequence. Similar to 1), but will be appended
at the end of the output file.
For example:
"\end{document}
"
Ad.5) Number of input character sets. For example, in some
incarnation of KOI7, there are two character sets: Latin
and Cyrillic. Cyrillic character sequence follows SHIFT-
OUT character (CTRL-N), \0x0e, and is terminated by
SHIFT-IN character (CTRL-O), \0x0f. Another way of look-
ing at it is that Latin characters follow CTRL-O and
cyrillic ones follow CTRL-N.
If there is only one character set on input you should
specify 0 as a number of input char sets, since the input
file obviously does not contain any SHIFT-OUT/IN
sequences.
Ad.6) SHIFT-OUT/SHIFT-IN sequences for each input character
set. These lines appear only if you specified nonzero
number of character sets. These lines contain also "nest-
ing sequences", which will be explained later in this
section. You do not use "nesting sequences" frequently,
and let us assume for a moment that nesting data are
empty strings. The strings or regular expressions speci-
fied here are matched with the contents of input text. If
match was found, the matching sequence is usually deleted
from the input text and:
a) for SHIFT-OUT sequence: the current input character
set number is changed to the new one corresponding to
the SHIFT-OUT sequence, or
b) for SHIFT-IN sequence: the previous input character
set number is restored, (i.e., the one which preceded
the SHIFT-OUT sequence for the current set). Note
that only the SHIFT-IN sequence for the current set
is matched. The SHIFT-IN sequences for other charac-
ter sets than the current set are not matched. The
bracketing of sets is assumed perfect. If the SHIFT-
IN sequence for the current set is an empty string,
the input set number is changed when SHIFT-OUT
JKL Last change: 30-Mar-1993 13
TRANSLIT(JKL) Version 1.01 TRANSLIT(JKL)
sequence of the new set is detected.
For each input character set, you have to specify a line
consisting of 6 strings/expressions separated by spaces:
SO-match SO-subs NEST-up NEST-down SI-match SI-subs
where:
SO-match --- the string or regular expression for the
SHIFT-OUT sequence for the current character set. If
detected, the input character set is changed to this
set.
SO-subs --- this is usually an empty string (i.e., the
input sequence matching SO-match is removed). But it
can be a replacement string or a substitution expres-
sion, which will substitute the original matching
SHIFT-OUT sequence.
NEST-up --- this string (or a regular expression) is usu-
ally an empty string). However, it can be used to count
brackets for detection of SHIFT-IN bracket, if SHIFT-IN
sequence is not unique. Its use is explained below.
NEST-down --- a counterpart of NEST-up. It is explained
later.
SI-match --- when a sequence in an input file matches the
string or regular expression given as SI-match for a
current input character set, the input character set
number is restored to the previous set. Note, that only
SI-match for a current set is matched with input char-
acters.
SI-subs --- this is usually an empty string (i.e., input
sequence which matched SI-match is removed), but if it
is not, the input characters which matched the SI-match
are replaced with the SI-subs.
The KOI7 case described above may be specified as:
2 # 2 input sets
"" "" "" "" "" "" # Latin(set 1)
"\016" "" "" "" "\017" "" # Cyrillic(set 2)
or
2 # 2 sets
"\017" "" "" "" "" "" # Latin(set 1)
"\016" "" "" "" "" "" # Cyrillic(set 2)
Before the input is processed, the program is initialized
to the character set of the first set. In the above case,
it is important, since declaration:
2 # 2 sets
"\016" "" "" "" "" "" # Cyrillic(set 1)
"\017" "" "" "" "" "" # Latin(set 2)
would be wrong and would mess up the Latin characters
preceding first Cyrillic sequence.
The nesting sequences are used only for specific situa-
tions. I needed them to write a transliteration table
from LaTeX to KOI8. In LaTeX the { } pair is used for
grouping and appears frequently in the text. The sequence
JKL Last change: 30-Mar-1993 14
TRANSLIT(JKL) Version 1.01 TRANSLIT(JKL)
of cyrillic characters is also a group in LaTeX. The
SHIFT-OUT sequence for Russian letters in LaTeX is (at
least in my case): "{\cyr ", and the end of the Russian
letters is marked by "}", but the "}" has to be the
bracket matching the opening "{" in "{\cyr ", not just
any bracket. For this reason, my SHIFT-OUT/IN entry was
in this case:
"{\cyr " "" "{" "}" "}" "" # Cyrillic codes
Whenever the "{\cyr " was found, the program zeroes the
counter. It adds +1 to it, when NEST-up sequence (i.e.,
the "{" here) is found, and subtracts 1 from it, when the
NEST-down sequence is found (i.e., the "}"). The check-
ing for a SHIFT-IN sequence (i.e., the "}") for cyrillic
set is done only when the counter value is zero (i.e.,
all pairs inside the cyrillic text are matched. In fact,
the process is more complicated than that (the counter
for an opened character set is placed on the stack), but
these are details you can find in the code itself.
What if the SHIFT-IN and SHIFT-OUT sequence is the same
character? Starting from version 1.01 the TRANSLIT will
also work in such cases. Let us assume that the SHIFT-IN
and SHIFT-OUT sequence is a single character "%" which
switches between two character sets. Also, if we want to
use it in the text, we have to double it, i.e., "%%" will
not be a SHIFT-IN/OUT sequence but will denote a literal
percent sign. We can do it in the following way:
"" "" "" "" "" "" # Latin letters
{%([^%])} {\1} "" "" {%([^%])} {\1} # Cyrillic codes
and later in the transliteration table (see below) we
should put a line:
0 "%%" 0 "%" # change doubled % to a single one
The same effect, for identical SHIFT-IN/OUT sequences,
can be accomplished with a -3 character set code and will
be described below.
Ad.7) Number of output "character sets". This is analogous
to the input case. The characters sent to output may
belong to different sets. For example, when the character
(or the sequence) from set 2 is followed by the character
(or the sequence) from set 1, the program first sends the
SHIFT-IN sequence for set 2 (if it is not empty) and then
the SHIFT-OUT sequence for set 1 (if it is not empty). If
the output character (or sequence) is assigned to set 0,
then no SHIFT-IN/SHIFT-OUT sequences are sent to output.
If there is only one set of output characters, you should
specify 0. Note that you may have several input sets and
several output sets, though this is rare. Usually, you
have one input set and many output character sets, or
vice versa. Again, if you have only one output set, you
JKL Last change: 30-Mar-1993 15
TRANSLIT(JKL) Version 1.01 TRANSLIT(JKL)
do not have any SHIFT-IN/SHIFT-OUT sequences, since those
are send to output only when a set number is changed.
But you are free to experiment.
Ad.8) SHIFT-OUT/SHIFT-IN sequences for each output character
set. It is similar to the input case, however, the NEST-
in and NEST-up sequences are not used here. Again, before
any text is sent to output, the character set specified
as the first one is assumed. If SHIFT-OUT/IN sequences
are not used (i.e., you have only one output character
set), you will not have any SHIFT-OUT/SHIFT-IN data
lines. The KOI8 (single character set containing all
Latin and Russian letters) to KOI7 (the set using over-
lapping codes switched by SHIFT-OUT/IN sequences) conver-
sion could be therefore accomplished by the following
table:
2 # 2 output sets
"" "" # Latin Letters
"\016" "\017" # Russian Letters case
Ad.9) Transliteration table for individual character or
their sequences. It is a core of your transliteration
data. There are 4 columns in the transliteration table:
(inp_set_no) (inp_seq) (out_set_no) (out_seq)
These 4 columns are separated by spaces. The
(input_set_number) corresponds to the input character set
number as specified above for input SHIFT-OUT/SHIFT-IN
data, or zero. If zero is used (even if number of input
sets is not zero), the (input_sequence) will be always
matched, irrespectively of the current input character
set imposed by the SHIFT-OUT sequence. This is useful,
since some characters are universal (e.g., new lines,
spaces, pluses, minuses, etc.) irrespectively of the
current character set. The (input_sequence) is the
sequence of characters to be matched with characters in
the input file, and if found (within the character set
specified) it is replaced by the (output_sequence) and
sent to output (i.e., the matching is interrupted, the
(output_sequence) sent to ouput, the input file pointer
is moved to the first character after the matched
sequence and matching resumes). The (output_set_number)
specifies the output character set. When the output char-
acter set changes during transliteration, the appropriate
SHIFT-IN sequence of the previous set and the current
set's SHIFT-OUT sequence is sent to output. The
(output_set_number) may also be zero (even if number of
output sets is not zero). In this case, the current out-
put set status is not changed, and no SHIFT-IN/OUT
sequences is sent to output. Lastly, the output set code
may be -1, -2 or -3. In this case, the substitution is
performed within input string that matched but the output
sequence is not sent to the output yet. Depending on the
JKL Last change: 30-Mar-1993 16
TRANSLIT(JKL) Version 1.01 TRANSLIT(JKL)
code, the following action is performed:
-1 --- program makes the substitution in the input
string (i.e., substitutes the matching string with
the input string in the input buffer). It does not
send the output sequence to the output, but continues
matching input sequences following the currently
matched one.
-2 --- like code -1, but matching is resumed from the
first sequence on the list.
-3 --- like code -1, but matching is resumed from the
input SHIFT-OUT/IN sequences.
E.g., if the unprocessed text in the input file is:
mental procedure was not successful since..........
and there was a line in transliteration table:
0 "me" -1 "you"
the input text would be changed to:
yountal procedure was not successful since..........
and all remaining matching data would be applied to this
text, rather than original text. The -2 code backsteps
to the point where the matching of transliteration
starts. The -3 code backsteps even further, to the point
where the input SHIFT-OUT and SHIFT-IN sequences are
matched. Since the order of sequences to match is cru-
cial here, for the case of output set code -1/-2/-3 even
one-character input sequences are matched in the order
specified. BE CAREFUL HERE. You may create infinite
loops. If you use code -2/-3, be sure that the resulting
sequence after substitution with the code -2/-3, will not
match previous sequences with codes -2/-3.
The (output_sequence) is a sequence which substitutes the
corresponding (input_sequence). If (output_sequence) is
"" (i.e., empty string) then (input_sequence) is effec-
tively deleted. The (input_sequence)s are compared with
input in the order specified unless backstepping -2/-3
code is used (the matching is done from the first
sequence again). I use the code -1 e.g., to dehyphenate
words when changing to LaTeX. Code -2 is useful if you
want to skip next comparisons, and the resulting substi-
tution string will match earlier matching expressions. I
do not see many uses for the code -3, but it can be used
to resolve "toggle" SHIFT-IN/OUT sequence, as described
in an example further below. The order for multicharac-
ter sequences is therefore important (the single charac-
ter sequences are always compared after all multicharac-
ter sequences, and can be therefore put anywhere). The
longer multicharacter sequences should be specified
before shorter ones, unless they are some "preprocessing"
steps with codes -1/-2/-3. The order may sometimes be
crucial. If you need single character sequences matched
in a specific order, enter them as regular expressions,
i.e., as {c} instead of "c". In short, the multicharac-
ter input sequences and regular expressions are matched
JKL Last change: 30-Mar-1993 17
TRANSLIT(JKL) Version 1.01 TRANSLIT(JKL)
to input text in the order specified. For the sake of
efficiency, the single character input sequences (with
exception of output set code -1/-2/-3) and input lists
are handled as a case of remapping and are matched in the
order of character codes associated with them. If you
specify the same single input character twice for a given
input set, the program will complain. The following com-
binations of input and output sequences are allowed:
Input Sequence Output Sequence
"plain string" only "plain string"
[list] [list] or "plain string"
{regular expression} {substitution expression} or
"plain string"
When match is found, the matching sequence is removed and
substituted with an output sequence. If this results is
changing the current output character set, the appropri-
ate SHIFT-IN/SHIFT-OUT pair is sent to the output before
the transliterated output sequence. If list is used as
the input sequence, you may either use:
a) plain string as output sequence. In this case, if
current input character belongs to the input list, it
is replaced by the output string. I use it to delete
ranges of characters which do not have any correspond-
ing characters in the output set (e.g., some graphics
characters). In this case, the order of characters on
the input list is not important.
b) if the output string is also a list then it has to
contain exactly the same number of characters as the
input list. In this case, the 1st character from the
input list is replaced by the 1st character from the
output list, the 2nd one by the 2nd one, etc. There-
fore, the order of characters is important.
Theoretically, if there is one-to-one correspondence
between characters in the input set and characters in the
output set, you can make the conversion by using a single
line consisting of two lists. But it looks ugly... And is
difficult to read. And for the program, the substitution
takes the same time, if the characters are specified
separately, or when they are specified as matching lists.
If regular expression is used to match the input charac-
ters, the matching sequence may be replaced by a plain
string or a substitution string, which was described
above.
Examples:
2 "CCCP" 0 ""
will delete all occurrences of CCCP from the input
file (but not Cccp or CCCp) for input set 2.
0 "\0xD1" 0 "ya"
will replace all occurrences of character of the code
\0xD1 with a two letter sequence "ya".
JKL Last change: 30-Mar-1993 18
TRANSLIT(JKL) Version 1.01 TRANSLIT(JKL)
0 \0xD1 2 q
will replace all characters \0xD1 with a character "q"
and output SHIFT-IN/OUT sequence if necessary.
2 "q" 0 "\0xD1"
will replace letter q (if the current input set is 2)
with a code \0xD1.
0 "\0xD1" 2 "ya"
will replace code \0xD1 with a sequence ya (assuming
that SHIFT-OUT and SHIFT-IN sequences for output set 2
are: {\cyr and }, respectively, you will get {\cyr
ya}).
If a character is not specified in the transliteration
table, it will be output as is, i.e., it corresponds
to a line:
0 "c" 0 "c"
where c is the character. If you want to delete cer-
tain characters, you need to explicitly specify this,
e.g.:
0 [a-z] 0 ""
will delete all lower case Latin letters from the
text.
Below is an example of solving the identical SHIFT-
IN/OUT sequences problem using character set code -3
which I promissed above. Assume, that you have 2 char-
acter sets in the input file, but switching between
them is accomplished by a "toggle" character. That is,
if the toggle character is found, you should switch to
the other set. Also, if you want to use the toggle
character in the set, you need to double it. Let also
assume that we have 2 character codes which will
never, ever appear. We can fool the translit by chang-
ing toggle character to a unique character and back-
stepping with character code -3 to check for SHIFT-
IN/OUT sequences again. Let the % sign be a toggle
character, and that we have two codes (for example
codes \254 and \255) which will never appear in our
text. The appropriate entries in the transliteration
table may look like:
1 {%([^%])} -3 {\254\1}
2 {%([^%])} -3 {\255\1}
0 "%%" 0 "%"
i.e., when the single % is seen in set 1, produce
SHIFT-OUT sequence for set 2; and when a single % is
seen in set 2, produce SHIFT-IN sequence for set 1.
The appropriate input character set definitions will
be:
2 # number of input character sets
"\255" "" "" "" "" ""
"\254" "" "" "" "" ""
JKL Last change: 30-Mar-1993 19
TRANSLIT(JKL) Version 1.01 TRANSLIT(JKL)
However, be warned. I never tried this. If this trick
does not work, please let me know.
Before you decide to create your own transliteration
file, please examine existing transliteration files. Do
yourself (and others) a favor --- put as many comments as
possible there. If you allow others to use your transli-
teration files, please include your name and e-mail
address and file creation date.
Program matches the sequences in a specific order:
1) if NEST counter is zero, Match/substitute current set
SHIFT-IN sequence
2) If matched, restore previous set number
3) If matched, restore previous set nest counter
4) Match/substitute input SHIFT-OUT sequences
5) If matched, save current set and start new one
6) If matched, zero nest counter for NEST sequences
7) Match/substitute transliteration sequences
8) If matched and code = -1 make substitution in input
buffer and continue matching the next sequence.
9) If matched and code = -2 make substitution and goto 7)
10) If matched and code = -3 make substitution and goto 1)
11) Match (no substitution) NEST-up and NEST-down to input
buffer
12) If NEST-up matched, increment counter for current set
13) If NEST-down matched, decrement counter for current set
14) If match in 7) send substitute sequence to output
15) If no match in 7) (or code -1) output current input
character
16) Advance input pointer to point at new characters
17) If End of File, break
18) Goto 1)
ASCII CHARACTER CODES
dec hx oct ch dec hx oct ch
0 00 000 ^@ NUL 64 40 100 @
1 01 001 ^A SOH 65 41 101 A
2 02 002 ^B STX 66 42 102 B
3 03 003 ^C ETX 67 43 103 C
4 04 004 ^D EOT 68 44 104 D
5 05 005 ^E ENQ 69 45 105 E
6 06 006 ^F ACK 70 46 106 F
7 07 007 ^G BEL 71 47 107 G
8 08 010 ^H BS 72 48 110 H
9 09 011 ^I HT 73 49 111 I
10 0a 012 ^J LF 74 4a 112 J
11 0b 013 ^K VT 75 4b 113 K
12 0c 014 ^L FF 76 4c 114 L
JKL Last change: 30-Mar-1993 20
TRANSLIT(JKL) Version 1.01 TRANSLIT(JKL)
13 0d 015 ^M CR 77 4d 115 M
14 0e 016 ^N SO 78 4e 116 N
15 0f 017 ^O SI 79 4f 117 O
16 10 020 ^P DLE 80 50 120 P
17 11 021 ^Q DC1 81 51 121 Q
18 12 022 ^R DC2 82 52 122 R
19 13 023 ^S DC3 83 53 123 S
20 14 024 ^T DC4 84 54 124 T
21 15 025 ^U NAK 85 55 125 U
22 16 026 ^V SYN 86 56 126 V
23 17 027 ^W ETB 87 57 127 W
24 18 030 ^X CAN 88 58 130 X
25 19 031 ^Y EM 89 59 131 Y
26 1a 032 ^Z SUB 90 5a 132 Z
27 1b 033 ^[ ESC 91 5b 133 [
28 1c 034 ^\ FS 92 5c 134 \
29 1d 035 ^] GS 93 5d 135 ]
30 1e 036 ^^ RS 94 5e 136 ^
31 1f 037 ^_ US 95 5f 137 _
32 20 040 SP 96 60 140 `
33 21 041 ! 97 61 141 a
34 22 042 " 98 62 142 b
35 23 043 # 99 63 143 c
36 24 044 $ 100 64 144 d
37 25 045 % 101 65 145 e
38 26 046 & 102 66 146 f
39 27 047 ' 103 67 147 g
40 28 050 ( 104 68 150 h
41 29 051 ) 105 69 151 i
42 2a 052 * 106 6a 152 j
43 2b 053 + 107 6b 153 k
44 2c 054 , 108 6c 154 l
45 2d 055 - 109 6d 155 m
46 2e 056 . 110 6e 156 n
47 2f 057 / 111 6f 157 o
48 30 060 0 112 70 160 p
49 31 061 1 113 71 161 q
50 32 062 2 114 72 162 r
51 33 063 3 115 73 163 s
52 34 064 4 116 74 164 t
53 35 065 5 117 75 165 u
54 36 066 6 118 76 166 v
55 37 067 7 119 77 167 w
56 38 070 8 120 78 170 x
57 39 071 9 121 79 171 y
58 3a 072 : 122 7a 172 z
59 3b 073 ; 123 7b 173 {
60 3c 074 < 124 7c 174 |
61 3d 075 = 125 7d 175 }
62 3e 076 > 126 7e 176 ~
63 3f 077 ? 127 7f 177 DEL
JKL Last change: 30-Mar-1993 21
TRANSLIT(JKL) Version 1.01 TRANSLIT(JKL)
CONVERSION: DECIMAL<-->OCTAL<-->HEX.
000 000 00 064 100 40 128 200 80 192 300 C0
001 001 01 065 101 41 129 201 81 193 301 C1
002 002 02 066 102 42 130 202 82 194 302 C2
003 003 03 067 103 43 131 203 83 195 303 C3
004 004 04 068 104 44 132 204 84 196 304 C4
005 005 05 069 105 45 133 205 85 197 305 C5
006 006 06 070 106 46 134 206 86 198 306 C6
007 007 07 071 107 47 135 207 87 199 307 C7
008 010 08 072 110 48 136 210 88 200 310 C8
009 011 09 073 111 49 137 211 89 201 311 C9
010 012 0A 074 112 4A 138 212 8A 202 312 CA
011 013 0B 075 113 4B 139 213 8B 203 313 CB
012 014 0C 076 114 4C 140 214 8C 204 314 CC
013 015 0D 077 115 4D 141 215 8D 205 315 CD
014 016 0E 078 116 4E 142 216 8E 206 316 CE
015 017 0F 079 117 4F 143 217 8F 207 317 CF
016 020 10 080 120 50 144 220 90 208 320 D0
017 021 11 081 121 51 145 221 91 209 321 D1
018 022 12 082 122 52 146 222 92 210 322 D2
019 023 13 083 123 53 147 223 93 211 323 D3
020 024 14 084 124 54 148 224 94 212 324 D4
021 025 15 085 125 55 149 225 95 213 325 D5
022 026 16 086 126 56 150 226 96 214 326 D6
023 027 17 087 127 57 151 227 97 215 327 D7
024 030 18 088 130 58 152 230 98 216 330 D8
025 031 19 089 131 59 153 231 99 217 331 D9
026 032 1A 090 132 5A 154 232 9A 218 332 DA
027 033 1B 091 133 5B 155 233 9B 219 333 DB
028 034 1C 092 134 5C 156 234 9C 220 334 DC
029 035 1D 093 135 5D 157 235 9D 221 335 DD
030 036 1E 094 136 5E 158 236 9E 222 336 DE
031 037 1F 095 137 5F 159 237 9F 223 337 DF
032 040 20 096 140 60 160 240 A0 224 340 E0
033 041 21 097 141 61 161 241 A1 225 341 E1
034 042 22 098 142 62 162 242 A2 226 342 E2
035 043 23 099 143 63 163 243 A3 227 343 E3
036 044 24 100 144 64 164 244 A4 228 344 E4
037 045 25 101 145 65 165 245 A5 229 345 E5
038 046 26 102 146 66 166 246 A6 230 346 E6
039 047 27 103 147 67 167 247 A7 231 347 E7
040 050 28 104 150 68 168 250 A8 232 350 E8
041 051 29 105 151 69 169 251 A9 233 351 E9
042 052 2A 106 152 6A 170 252 AA 234 352 EA
043 053 2B 107 153 6B 171 253 AB 235 353 EB
044 054 2C 108 154 6C 172 254 AC 236 354 EC
045 055 2D 109 155 6D 173 255 AD 237 355 ED
046 056 2E 110 156 6E 174 256 AE 238 356 EE
047 057 2F 111 157 6F 175 257 AF 239 357 EF
048 060 30 112 160 70 176 260 B0 240 360 F0
JKL Last change: 30-Mar-1993 22
TRANSLIT(JKL) Version 1.01 TRANSLIT(JKL)
049 061 31 113 161 71 177 261 B1 241 361 F1
050 062 32 114 162 72 178 262 B2 242 362 F2
051 063 33 115 163 73 179 263 B3 243 363 F3
052 064 34 116 164 74 180 264 B4 244 364 F4
053 065 35 117 165 75 181 265 B5 245 365 F5
054 066 36 118 166 76 182 266 B6 246 366 F6
055 067 37 119 167 77 183 267 B7 247 367 F7
056 070 38 120 170 78 184 270 B8 248 370 F8
057 071 39 121 171 79 185 271 B9 249 371 F9
058 072 3A 122 172 7A 186 272 BA 250 372 FA
059 073 3B 123 173 7B 187 273 BB 251 373 FB
060 074 3C 124 174 7C 188 274 BC 252 374 FC
061 075 3D 125 175 7D 189 275 BD 253 375 FD
062 076 3E 126 176 7E 190 276 BE 254 376 FE
063 077 3F 127 177 7F 191 277 BF 255 377 FF
INSTALLATION
Program is given in a source form. It was tried under UN*X,
VMS and MS-DOS systems and ran. The file readme.doc contains
the details on how to obtain the whole package. You can
retrieve this file from anonymous ftp on www.ccl.net in
the directory /pub/russian/translit. You can also obtain it
via e-mail by sending a message:
get translit/readme.doc from russian
to OSCPOST@ccl.net or OSCPOST@OHSTPY.BITNET.
The source of the program consists of several files:
paths.h
must be edited before compilation. It contains its own
comments what to do. The defines in this file relate to
the operating system you are using and the default path
for searching transliteration table.
translit.c
It contains the main program. This was intended to be
a portable code.
reg_exp.h
the include file for regular expression matching
library of Henry Spencer from the University of
Toronto. This regular expression package was posted to
comp.sources.misc (volume 3). Also 4 patches were
posted (in volumes: 3, 4, 4, 10). I applied the patches
to the original code and made small modifications to
the code, which are marked in the source code.
reg_exp.c
the regular expression library for compilation and
matching of regular expressions.
JKL Last change: 30-Mar-1993 23
TRANSLIT(JKL) Version 1.01 TRANSLIT(JKL)
reg_sub.c
the regular expression substitution routine.
Before you compile this program you have to edit paths.h.
Read comments in the file. During compilation, all source
code should reside in the current directory.
Then you may compile the program under UN*X as (for exam-
ple):
cc -o translit translit.c reg_exp.c reg_sub.c
and copy the program translit to some standard directory
which is in users' path (for example: /usr/local/bin). Then
you need to copy transliteration tables to the directory
which you have chosen in paths.h. If you get errors, then
it is not OK. Please, report them to the author (with all
the gory details: error message, line number, machine,
operating system, etc.).
Under VMS (VAXes) you need to compile it as:
cc translit
cc reg_exp
cc reg_sub
link translit+reg_exp+reg_sub,sys$library:vaxcrtl/lib
and before you can use the program, you need to type (or
better put into your LOGIN.COM file) a line:
translit == "$SYS$USER:[ME.TRA]TRANSLIT.EXE"
or whatever is the full path to the translit executable
image which you created with LINK. Note the quotes and the $
sign in front of program path.
On an IBM-PC I used MicroSoft C 5.1 as:
cl /FeTRANSLIT /AL /FPc /W1 /F 5000 /Ox /Gs translit.c
reg_exp.c reg_sub.c
RULES, CONDITIONS AND AUTHOR'S WHISHES
You can distribute this code and associated files under
these conditions:
1) You will distribute all files (even if you think that
they are garbage). You may get the complete set from
anonymous ftp at www.ccl.net in
/pub/russian/translit. You can also get the program and
associated files via e-mail. To get the instructions for
e-mail distribution send a line:
send translit/readme.doc from russian
to OSCPOST@ccl.net or OSCPOST@OHSTPY.BITNET. You are
not allowed to distribute the incomplete distribution.
The following files should be present in the distribu-
tion:
alt-gos.rus - ALT to GOSTCII table
alt-koi8.rus - ALT to KOI8 table
JKL Last change: 30-Mar-1993 24
TRANSLIT(JKL) Version 1.01 TRANSLIT(JKL)
example.alt.uu - uuencoded example in ALT
example.ko8.uu - uuencoded example in KOI8
example.pho - phonetic transliteration example
example.tex - LaTeX example
gos-alt.rus - GOSTCII to ALT table
gos-koi8.rus - GOSTCII to KOI8 table
koi7-8.rus - KOI7 to KOI8 table
koi7nl-8.rus - KOI7 (no Latin) to KOI8 table
koi8-7.rus - KOI8 to KOI7 table
koi8-alt.rus - KOI8 to ALT table
koi8-gos.rus - KOI8 to GOSTCII table
koi8-lc.rus - KOI8 to Library of Congress table
koi8-phg.rus - KOI8 to GOST transliteration
koi8-php.rus - KOI8 to Pokrovsky transliteration
koi8-tex.rus - KOI8 to LaTeX conversion
order.txt - Order form for ordering the program
paths.h - Include file for translit.c
phg-koi8.rus - GOST transliteration to KOI8
pho-8sim.rus - Simple phonetic to KOI8
pho-koi8.rus - Various phonetic to KOI8
php-koi8.rus - Pokrovsky to KOI8
readme.doc - short description of the files
reg_exp.c - regular expression code by Henry Spencer
reg_exp.h - include for reg_exp.c and reg_sub.c
reg_sub.c - regular expression code by H. Spencer
tex-koi8.rus - LaTeX to KOI8
translit.c - TRANSLIT main program
translit.ps - TRANSLIT manual in PostScript
translit.1 - TRANSLIT manual in *roff
translit.txt - Plain ASCII TRANSLIT manual
2) You may expand/change the files and the program and
distribute modified files, provided that you do not
delete anything (you can always comment the unnecessary
portions out) and clearly mark your changes. Please send
the copy of the modified version to the author, though
you are not required to do so. I will give you all the
credit for your enhancements. I simply wish that there
is a single point of distribution for this code, so it
is maintained to some extent. If you create additional
transliteration definition files, please, send them to
the author if you may. I will add them to the program
distribution. I want to fix bugs and expand/optimize
this code, but I need your help. I need your transli-
teration files for languages which I do not know or do
not use currently. Your suggestions for improving docu-
mentation are most welcome (I am not a native English
speaker).
3) You will not charge money for the program and/or asso-
ciated files, except for media and copying costs. If you
want to sell it, contact the author first. Bear in mind
that the regular expression package by Henry Spencer has
JKL Last change: 30-Mar-1993 25
TRANSLIT(JKL) Version 1.01 TRANSLIT(JKL)
some copyright restrictions. But there are other regu-
lar expression packages which do not have these restric-
tions (which are not violated by this offering).
4) I will gladly help you with advice on compiling this
software and try to fix bugs when time allows. However,
if you want a ready to run executable, you need to order
it for a very nominal fee from JKL ENTERPRISES, INC. as
described in the file order.txt which must be a part of
a complete distribution.
AUTHOR
Jan Labanowski, P.O. Box 21821, Columbus, OH 43221-0821,
USA. E-mail: jkl@ccl.net, JKL@OHSTPY.BITNET.
JKL Last change: 30-Mar-1993 26
|