CCL Home Page
Up Directory CCL Message Filter

Filtering your CCL Mail

If you are a subscriber of CCL, you can filter CCL messages before they are delivered to your address by CCL server. Filtering requires that you fill out a complicated Web form after you seriously analyzed what messages you want to get from CCL. The content of your filter will not be knowingly released to anyone. We will not tell anyone what you like and what you hate. But CCL administrators may want to share some of your filtering prescriptions (anonymously) for the common good.

Before you get anywhere with this, please carefully read this manual. Please help me to make it easier to read. I would appreciate your comments. Then you need to learn Perl regular expressions (e.g., you can read my: http://www.ccl.net/chemistry/resources/tips/regular_expressions.shtml unless you are an expert). Then you will need to fill out a Web form. Finally, you need to monitor the messages for a while and compare them with the archive of all messages at: http://www.ccl.net/chemistry/resources/messages/index.shtml to see if your filter does what it is supposed to do. You can always go back to the filter setup form and tune it up. If you are too busy or too lazy to do it, do not even start. At the same time, for what it is worth, learning Perl regular expressions will make you more productive and your Return On Learning Investment  (ROLI -- a four letter word) will pay off many times over in all aspects of your computational work.

Before you are allowed to setup the filter you will need to authenticate -- have a recent CCL message handy and look at its header (namely, the  To: and the  Message-Id: header lines). Also, open the http://www.ccl.net/cgi-bin/ccl/regexp/test_re.pl regular expression testing form and test your regular expressions before you enter them into the filter. It will save you a lot of frustration. The filter setup Web form asks you to specify regular expressions and assign numeric value (positive or negative score) to each of them. You also need to tell which part of the message should be matched (header, body, or both). Before the message is sent to you, the CCL server checks if you created the filter. If so, the software retrieves filter specs and matches your regular expressions with the message. Numeric values assigned to regular expressions that matched are added together. If the sum is greater or equal to zero then the message is delivered to you. If the sum is negative, you will not get the message. To avoid rounding errors, the sum that falls within +/- 0.0001 will be considered a zero (i.e., message will be delivered).
 
The matching is performed on the original message, as it was received from the author rather than the one that you receive from CCL. CCL server alters messages slightly before sending. Namely, obfuscates e-mail addresses, removes id of the sender if present and tries to remove the repeated occurences of the footer in replies. Each message consists of a header and the body. The header, a top part of the message, is followed by a blank line, and then a body of the message. Header begins with a message start line (address and time stamp) like:
From owner-chemistry@ccl.net Thu Nov 3 12:01:18 2005
that is not really useful since it does not provide a real message author address in CCL case. This line is followed by lines that start with a keyword followed by a colon. If a given header line is too long, it is continued on the next line that starts with a blank character (tab or space). An example message follows:


From owner-chemistry@ccl.net Thu Nov  3 12:01:18 2005
Received: from server.example.com (server.example.com [192.168.1.23])
        by server.ccl.net (8.13.1/8.13.1) with ESMTP id jA3H1GJv008592
        for <chemistry@ccl.net>; Thu, 3 Nov 2005 12:01:16 -0500
Received: from localhost.localdomain (server.example.com [192.168.1.23])
        by server.example.com (8.13.0/8.13.0) with ESMTP id jA3H1C4W003410
        for <chemistry@ccl.net>; Thu, 3 Nov 2005 12:01:13 -0500
Message-Id: <200511031701.jA3H1C4W003410@server.example.com>
Content-Type: text/plain
Content-Disposition: inline
Content-Transfer-Encoding: binary
Date: Thu, 03 Nov 2005 12:01:12 -0500
MIME-Version: 1.0
Reply-To: "John M. Smith" <John.M.Smith@example.com>
Organization: Example.Com Corporation
Subject: Spin contamination papers?
From: "John M. Smith" <John.M.Smith@example.com>
To: chemistry@ccl.net

I am looking for papers on spin contamination
that present results of calculations on polyradicals with
unrestricted KS as implemented in QuickMol.

John M. Smith

This is the simplest possible text message. Some messages will have encoded parts and attachments. Currently we do not address these complications and treat all messages as a straight text. In most cases it will not matter much but on rare occasions you can get a message that you did not want or loose the message on your favorite topic due to character encoding. For your matching convenience each header line is converted to a single line (continuation lines are joined with the first line). Using the message above as an example, the filter

Regular expression    Message Match Scope      Score
/spin/i                       [body]               10.0
/quickmol/i                   [header+body]        -5.0
/smith\@example\.com/i        [header]            -10.0
/johns\@other\.com/i          [header]            -10.0
/jerk\@hootmail\.com/i        [header]           -100.0
/^Subject:\s.*Lottery.*$/im   [header]          -1000.0

will stop the example message (the first 3 expression matched and the sum is -5.0), while the filter:

Regular expression    Message Match Scope      Score
/(spin|radical|esr)/i     [body]               10.0
/unrestricted|uhf|uks/i   [body]               10.0
/quickmol/i               [header+body]        -5.0
/smith\@example\.com/     [header]            -10.0
/johns\@other\.com/       [header]            -10.0
/jerk\@hootmail\.com/     [header]            -10.0

will pass the message since the sum of scores of the regular expressions that matched is positive (+5.0).

This is a new thing, and there is no experience on how to use it in practice. My way of thinking is that you should initially focus on rejecting messages that you do not want rather than prioritize messages by interest. Assuming (an abstract case, since I have to be politically correct) that you hate mail coming from a country called Buenita (ISO country code: BU) that is a source of messages promising fraudulent business deals and also the mail from one CCL subscriber (jerk@jerks.example.com) that you despise personally, you can make a simple filter like:

Regular expression    Message Match Scope      Score
/\.bu\W/i                     [header]             -100.0
/jerk\@jerks\.example\.com/i  [header]             -100.0

and rest assured that these messages will not be sent to you by CCL. You can also use filters to protect your mailbox from overflowing when we have a "flame war" on CCL and then remove the breaks when it is over. To stop all messages from CCL without unsubscribing, you could do:

Regular expression    Message Match Scope      Score
/./                      [header]             -100.0

but make sure that you keep at least one recent message that you received from CCL to remove this break, since the filter setup Web page requires your CCL Message-Id and/or your CCL subscriber Id for authentication. It is probably easier though to unsubscribe temporarily using appropriate Web form available from the CCL Web page. If you FUBAR or FUMTU, please contact me, and I will fix it for you. Now you are ready... Go to the page: http://www.ccl.net/cgi-bin/ccl/enter_preferences and set your CCL mail preferences.

Maybe, at some point I will build upon your experience and come up with some typical prescriptions. In a long run, it would be much better to come with the set of categories/topics to which each message can be assigned (say: quantum chemistry, drug design, molecular dynamics, spectroscopy, molecular graphics, etc.) and subscribe to one or more topics. This is an interesting (and far from trivial) research problem, and there are numerous ways of attacking it. One possible approach would be to create a file for each topic with text expressions (abbreviations, keywords, names, text snippets, etc.). Then calculate some text similarity index between each file and the message (e.g., a normalized overlapping ngram count) and assign the message to one or more topics depending on the degree of similarity. But again... this is a good topic for a Ph.D. thesis or to do something useful after retirement. Since the income that CCL brings is negligible, I am not sure when and if I will approach this issue.
 
Jan K. Labanowski, Ph.D.
Manager
Computational Chemistry List, Ltd.
jkl at ccl.net

Modified: Mon Nov 28 00:01:45 2005 GMT
Page accessed 51622 times since Fri Nov 4 00:11:17 2005 GMT