CCL: Re: Software for pattern recognition in QSAR studies?
- From: Michel Petitjean <ptitjean*at*itodys.jussieu.fr>
- Subject: CCL: Re: Software for pattern recognition in QSAR
studies?
- Date: Mon, 6 Dec 2004 17:16:00 +0100 (MET)
To: chemistry*at*ccl.net
Subj: CCL: Re: Software for pattern recognition in QSAR studies?
> > Renxiao Wang wrote:
> > > I am looking for a program that can apply standard pattern
recognition
> > > techniques. Basically, I want to study a number of samples, each
of
> > > which can be characterized by some properties. I would like to
classify
> > > these samples into several groups based on these properties, and
then
> > > derive a QSAR model for each group.
> > >...
> >
> > When all propoerties are non-numerical (e.g. property 1 takes values
> > A or B or C, property 2 takes value red or green or blue, property 3
> > takes values: alpha or beta or gamma or delta, etc...), there is
> > a classification method able to compute the optimal partition,
> > including the number of classes: very few methods can do this.
> > Freeware with reference and documentation:
> > http://petitjeanmichel.free.fr/itoweb.petitjean.freeware.html#POP
> "E.L. Willighagen" <e.willighagen*at*science.ru.nl>
replied:
> Classification and Regression Trees (CART) can be do that... there are two
or
> three packages (one is tree) for R available. See http://cran.r-project.org/.
Hiearachical classification methods need to cut the tree. The problem
is that cutting the tree is done with the help of arbitrary parameters,
these latter being NOT computed from the data only. E.g., in CART,
the decision to split a group needs a test, this latter being based
most time upon an arbitrary value set by the user. It means that the
final number of classes depends on an arbitrary selection of values,
done by the user. But even the experienced user cannot be sure to do
a suitable selection of parameters. The POP freeware above works without
"external" parameters, and compute the number of classes from data
only. This dependance of external parameters occurs in many molecular
modeling problems, and also occurs in many other fields.
Actually, a number of descriptive statisticians work on this problem
in the case of numerical variables. It is far from being solved.
The solution known for categorical variables is due to
F. Marcotorchino in 1981. The scientific community is waiting
for an elegant solution in the numerical case.
Michel Petitjean, Email: petitjean*at*itodys.jussieu.fr
ITODYS (CNRS, UMR 7086) ptitjean*at*ccr.jussieu.fr
1 rue Guy de la Brosse Phone: +33 (0)1 44 27 48 57
75005 Paris, France. FAX : +33 (0)1 44 27 68 14
http://petitjeanmichel.free.fr/itoweb.petitjean.html