CCL: extract chemical information from PDF tables
- From: Jesse Gordon <jesse.gordon _ dotmatics.com>
- Subject: CCL: extract chemical information from PDF tables
- Date: Fri, 6 Apr 2012 08:00:45 -0400
Yes, I've used CLiDE to pull structures out of PDF files. There are
three versions:
- CLiDE standard is one-at-a-time; you draw a box
around the structure and CLiDE interprets it.
- CLiDE pro version
recognizes all the structures in a multi-page PDF and produces a list;
- CLiDE batch version takes a whole stack of documents.
The quality is comparable to OCR - optical character
recognition - and I've always called this method "Chemical OCR."
If you're familiar with OCR, it makes a LOT of mistakes. For example, CLiDE
almost always fails to recognize iodine, which it interprets as a methyl
group with implicit carbon and hydrogen. Or vice versa -- if your methyl
group happens to be vertical, it often gets misinterpreted as an iodine
atom. Rings sometimes end up "broken", i.e. CLiDE interprets a
6-membered ring as a chain with 6 atoms arranged in a hexagon. So it's not
perfect. As with text OCR, you have to read the result and correct it manually.
I found that using CLiDE, whether in one-at-a-time form or
batch form, requires manually comparing the original with the
chemical-OCR-interpreted result (with text OCR you can usually skip referring
back to the original since you know all the words in your head. But with
chemical OCR, the two examples I gave above are obvious upon inspection, but
many other errors are not). The best method to inspect is to assign names to the
resulting structures using some automatic structure-to-name converter. That
exposes any subtle issues like an atom being "close" to a methyl group
when it should be an atom connected by a bond. Compare the automatically
generated name to your visual inspecction of the original, and the process is
pretty quick.
On 5
April 2012 12:42, Alex Allardyce aa|a|
chemaxon.com <owner-chemistry-,-ccl.net> wrote:
Sent to CCL by: Alex Allardyce [aa : chemaxon.com]
ChemAxon supports text and scanned pdf (and doc, ppt, pptx etc). It is
integrated throughout our technology but probably easiest to try out in
MarvinView, (free for the desktop), just 'open' the pdf (or doc or ppt
etc) and all extracted structures are shown. You also access all of the
functionality through the API/command line as well as KNIME and Pipeline Pilot
nodes.
Cheers
Alex
On Tue, Apr 3, 2012 at 17:55, Brian Bennion bennion1=-=llnl.gov
<owner-chemistry[*]ccl.net> wrote:
Sent to CCL by: "Brian Bennion" [bennion1=llnl.gov]
Hello,
Does anyone have/know of code to parse pdf tables for chemical structure and
activity data?
Searching the web did not result in much so I may not be searching with the
correct terms. One interesting hit was the clide code from simbiosis.
Has anyone used this for pulling structures out of pdf files?
I want to populate a repository with chemical structures and annotate the
entries with the activity data given in an associated table located in the same
pdf document.
Thanks
Brian>
-= This is automatically added to each message by the mailing script
=-
E-mail to subscribers: CHEMISTRY-,-ccl.net or use:
http://www.ccl.net/cgi-bin/ccl/send_ccl_message
E-mail to administrators: CHEMISTRY-REQUEST-,-ccl.net or use
http://www.ccl.net/cgi-bin/ccl/send_ccl_message
http://www.ccl.net/chemistry/sub_unsub.shtml
Before posting, check wait time at: http://www.ccl.net
Job: http://www.ccl.net/jobs Conferences: http://server.ccl.net/chemistry/announcements/conferences/
Search Messages: http://www.ccl.net/chemistry/searchccl/index.shtml
http://www.ccl.net/spammers.txt
RTFI: http://www.ccl.net/chemistry/aboutccl/instructions/
--
=====================================
Jesse
Gordon
Application Scientist
Dotmatics Limited
400
West Cummings Park #5450, Woburn MA 01801
T: +1 781-305-3114
M: +1-617-320-6989
Skype:
jessegordon
======================================
See the latest in Dotmatics Suite of Solutions
at:
--
Disclaimer: This electronic mail and its attachments are intended solely for
the person(s) to whom they are addressed and contain information which is
confidential or otherwise protected from disclosure, except for the purpose
for which they are intended. Dissemination, distribution, or reproduction by
anyone other than the intended recipients is prohibited and may be illegal.
If you are not an intended recipient, please immediately inform the sender
and return the electronic mail and its attachments and destroy any copies
which may be in your possession. Dotmatics Limited screens electronic mails
for viruses but does not warrant that this electronic mail is free of any
viruses. Dotmatics Limited accepts no liability for any damage caused by
any virus transmitted by this electronic mail. Dotmatics Limited is
registered in England & Wales No. 5614524 with offices at The Old Monastery,
Windhill, Bishops Stortford, Herts, CM23 2ND, UK.