From jkl@ccl.net Tue Dec 1 16:14:49 1992 From: jkl@ccl.net (Jan Labanowski) Date: Tue, 1 Dec 1992 21:14:49 -0500 Message-Id: <199212020214.AA05755@krakow.ccl.net> To: chemistry@ccl.net Subject: At last... Searching CCL archives. Dear Subscribers, Here is something new for the list. Jan jkl@ccl.net ------------- HOW TO SEARCH COMPUTATIONAL CHEMISTRY ARCHIVES ------------- This file can be obtained from anonymous ftp at www.ccl.net as pub/chemistry/help.search or via e-mail by sending a message: send help.search from chemistry to OSCPOST@ccl.net or OSCPOST@OHSTPY.BITNET Computational chemistry archives can be searched by sending a search query to the address: chemistry-search@ccl.net. As a result, the list of files which satisfy the query, will be sent back via e-mail to the originator of the search request. This is an experimental service and may be improved (or discontinued) in the future, so please send your comments and ideas to the author: jkl@ccl.net. The following document describes the format of the search query. The format is not simple, however, it allows for precise and elaborate search queries. If you have suggestions how to make it simpler without loosing its generality, please tell me. I will appreciate your comments. The search query specifies: 1. How many text patterns to look for, 2. Text patterns to look for --- so called "regular expressions", 3. Logical relation between patterns which needs be satisfied for including the file in the search result --- logical expression. 1. REGULAR EXPRESSIONS ====================== In its simplest form it is just a word which you want to find in a file name and/or a text of the file itself. This world has to be bracketed with a unique character (i.e., the character which does not appear inside the regular expression) which will be called a "delimiter". Since during the processing, the trailing spaces are removed at the beginning and at the end of the line, the delimiter is the only way to distinguish between significant and unsignificant spaces. Some characters (so called "metacharacters") and groups of characters have a special meaning within the regular expression and if their original meaning is needed, they have to be "quoted" by preceding them with a backslash character "\". On the other hand, quoting some ordinary letters, may attach some special meaning to them, so use the backslash judiciously. Note that a list below corresponds to a Perl convention for regular expressions and is different from the one used in UNIX regular expressions. Remember also that: a) the file appears to the searching program as a one, long line of text where words are separated by single spaces (new lines are replaced with spaces, multiple spaces, tabs and other white space is contracted into a single space, hyphenated words at the end of the line are joined). There are two exceptions to this rule: 1) file names, if searched, are treated as separate pieces of text, 2) when searching files containing archived messages posted to the list at a given date, each message is search separately, as if it was a separate file. If the logical expression is satisfied for the message, then the name of the file is reported. Please note that only text files are scanned for text (contents of binary files is not). However, all file names (for binary as well as text) are scanned for, if requested. b) Search is lettercase insensitive, i.e., searching for words: charge, Charge, cHaRgE, etc., will produce the same result. c) The description below includes full metacharacter definitions, however, some will not be found in the text, since there are no new_lines and tabs, A-Z is the same as a-z, etc. Constructs in regular expressions: . --- Period matches any character except a new line (remember, no new lines in here). [ ] --- Any character within the square brackets matches, e.g. [a,b] matches a, comma, b. Ranges are also allowed: [0-9a-z] will match any digit or letter. Note that if you are searching for "-", you must put it just before the right ], or it will be treated as a range. Negation within square brackets is achieved with a caret character immediately following left bracket "[^". [^a-z] means: match everything but letters, [^_] means everything but underscore. Note that within brackets, characters .?*+|()$^{} should not be quoted, while [, \, ] should be entered as \[, \\, \]. \d --- Matches any digit (i.e., is a shorthand for [0-9]). \D --- Matches everything but digit (i.e., is a shorthand for [^0-9]). \w --- Matches "word" characters, i.e., letters, digits and underscore (same as [a-zA-Z0-9_]). \W --- Matches "nonword" character (same as [^a-zA-Z0-9_]). \s --- Matches a white-space (i.e., space, new_line, tab, etc.). \S --- Matches a non-white-space character (i.e., characters which use pigment in your printer). \n --- Matches new line (do not use here). \r --- Matches carriage return (do not use here). \t --- Matches a tab (do not use here, tabs are converted to spaces). \f --- Matches a formfeed (do not use) \b --- Matches a backspace (do not use) \xxx --- Matches an ASCII octal code of a character (xxx are digits). () --- () are special characters to quote substrings for substitution. We do not do substitutions here. If you search for parentheses, use \( and \) outside square brackets. \1, \2 ... \9 --- used only in substitution strings. Do not use here. x? --- matches 0 or 1 occurrences of character x (or any other, i.e., [a-z]? matches 0 or 1 letter). x* --- matches 0 or more occurrences of character x. x+ --- matches 1 or more occurrences of character x. x{m,n} --- matches at least m, but no more than n occurrences of x, e.g., [0-9.+-]{2,3} will match: 1.2, .2, +11, -1, 123, +-. | --- alternative: \son\s|\sin\s|\sup\s|\sat\s will match words: on, in, up, and at. \b --- matches word boundary (outside [] only). It corresponds to white space, punctation marks and the very beginning and end of the line. \B --- matches non word boundary (i.e., any printable character). ^ --- outside [] marks the beginning of the string. Do not use here. $ --- outside [] marks the end of the string. Do not use here. Note that in UNIX, parentheses and braces are quoted to get their special meaning, while here, they need to be quoted to get their ordinary meaning. Regular expressions, bracketed with a delimiter character can be optionally preceded with a label and a search scope identifier followed by a colon ":". The label is an integer number and the scope is one of the letters: T - text only (file names will not be matched to a regular expression), F - file name only (text inside the file will not be scanned), B - both file name and text will be scanned for matching (default). Both the label and the scope can be omitted, but if either exists, the colon must be present. For the purpose of the search, all the expressions below are identical: 1B : /[MA][MO]PAC/ 2: #ampac|mopac# : +MOPAC|AMPAC+ B4: *Ampac|Mopac* -[am][mo]pac- Note that the numerical value of integer label is disregarded by the program, and the program assigns the numbers to regular expressions based on the order in which they were specified. It is here only for your convenience. 2. LOGICAL RELATION =================== Once you specified your regular expressions, you need to specify a logical relation between them which qualifies the file for reporting. The logical relation can contain only operators: & --- AND (you can also use && if you are a UNIX or C fan) | --- OR (you can also use || if you are a UNIX or C fan) ! --- NOT parentheses, ordinal numbers of regular expressions and spaces. The number of the regular expression corresponds to its status (FOUND or NOT_FOUND) after searching its scope (i.e, file text and/or file name). For example: 1 | !2 means that the file should be reported if regular expression number 1 matched and expression number 2 did not. You can make your relation as complicated as you wish and use nested parentheses, but remember that: !(1 & 2) is equivalent to !1 | !2, and the !(1 | 2) is equivalent to !1 & !2. Remember also that in logic the & takes precedence before |, and the ! is a unary operator and takes precedence before both of them, but use the parentheses for better readability. 3. QUERY FORMAT AND EXAMPLES ============================ The complete query has the following format: Number of regular expressions (N) regular expression 1 regular expression 2 .... regular expression N logical relation The query should be send to chemistry-search@ccl.net, and when the search is finished, the resulting list of files satisfying your query will be sent to you automatically. Do not be impatient and do not send your next query before you get the results of the first one --- your request will be denied. Please remember that this is a flat file search and it is very demanding as I/O and CPU is concerned. Therefore only one query will be running at the given moment. Before you send a query, try to analyze it, and make it specific. You do not want to get a listing of the whole archive. The queries which search only file names are much faster than the queries which scan the whole file (i.e., you get your results faster, but you might not like what you get). If you look for a single term, you can use the abbreviated one-line format which consists of a regular expression only. Now, a few examples. Example 1. ---------- 1 1B: /\bMM[23]\b|MMP2\b/ 1 which is equivalent to saying: B: /\bMM[23]\b|MMP2\b/ or /\bMM[23]\b|MMP2\b/ since the default is both text and file names. You search for all text files which refer to MM2, MM3 or MMP2, or file names which contain MM2, MM3 or MMP2. Example 2. ---------- /basis\sset/ Note that this example is not equivalent to the query: 2 /basis/ /set/ 1 & 2 since in the first case, the words "basis" and "set" must be side by side, while in the latter case they may be separated by many words and in fact the "set" may be found before the "basis" is. Also, the latter case will find all the file names having "basis" or "set" in them, while there is no file names in the archive which have a space embedded in them. Example 3. ---------- 3 1T: /\bMOPAC[^a-z]|/\bAMPAC[^a-z]|\bAM1\b|\bPM3\b|\bMNDO\b|\bMINDO/ 2T: /\bCHARGE/ 3T: /\bHYDROGEN[\s-_]?BOND[si\s.,;]|\bH[\s-_]?BOND[si\s.,;]/ 1 & (2 | 3) 3 regular expressions were specified. Expression 1 looks for a word MOPAC, AMPAC, AM1, MNDO, MINDO. Note that it may be either MOPAC or MOPAC6 so it is safer to require a non-letter after MOPAC rather than a space or word boundary. The 2nd expression looks for "CHARGE ", "CHARGES", "CHARGE,", "CHARGE.", "CHARGE=", "CHARGE-", etc. The 3rd one is a challenge for you. Note that people may say: HYDROGEN BOND, HYDROGEN-BOND, HYDROGEN_BOND, HYDROGENBOND, H-BOND, H BOND, H_BOND, HBOND, and may say BONDS, BONDING, and may put .,; after BOND. Note that all the regular expressions given above request searching for the text of the file only, not its name. The logical relation simply says: "find me the files which mention MOPAC or AMPAC or AM1 or PM3 or MNDO or MINDO and say also something about CHARGes or HYDROGEN BONDs". In short: 1. prepare your search query as described above 2. send it to chemistry-search@ccl.net 3. wait (for a long, long time) for an answer ----------------------------- I will welcome suggestions for improving this description and corrections to my spelling and grammar (as you know, I am not a native English speaker). Jan Labanowski jkl@ccl.net