CCL: extract chemical information from PDF tables



Yes, I've used CLiDE to pull structures out of PDF files. There are three versions:
- CLiDE standard is one-at-a-time; you draw a box around the structure and CLiDE interprets it.
- CLiDE pro version recognizes all the structures in a multi-page PDF and produces a list;
- CLiDE batch version takes a whole stack of documents.
 
The quality is comparable to OCR - optical character recognition - and I've always called this method "Chemical OCR." If you're familiar with OCR, it makes a LOT of mistakes. For example, CLiDE almost always fails to recognize iodine, which it interprets as a methyl group with implicit carbon and hydrogen. Or vice versa -- if your methyl group happens to be vertical, it often gets misinterpreted as an iodine atom. Rings sometimes end up "broken", i.e. CLiDE interprets a 6-membered ring as a chain with 6 atoms arranged in a hexagon. So it's not perfect. As with text OCR, you have to read the result and correct it manually.
 
I found that using CLiDE, whether in one-at-a-time form or batch form, requires manually comparing the original with the chemical-OCR-interpreted result (with text OCR you can usually skip referring back to the original since you know all the words in your head. But with chemical OCR, the two examples I gave above are obvious upon inspection, but many other errors are not). The best method to inspect is to assign names to the resulting structures using some automatic structure-to-name converter. That exposes any subtle issues like an atom being "close" to a methyl group when it should be an atom connected by a bond. Compare the automatically generated name to your visual inspecction of the original, and the process is pretty quick.
 
 
 
On 5 April 2012 12:42, Alex Allardyce aa|a|chemaxon.com <owner-chemistry-,-ccl.net> wrote:

Sent to CCL by: Alex Allardyce [aa : chemaxon.com]
ChemAxon supports text and scanned pdf (and doc, ppt, pptx etc). It is integrated throughout our technology but probably easiest to try out in MarvinView, (free for the desktop), just 'open' the pdf (or doc or ppt etc) and all extracted structures are shown. You also access all of the functionality through the API/command line as well as KNIME and Pipeline Pilot nodes.

Cheers
Alex

On Tue, Apr 3, 2012 at 17:55, Brian Bennion bennion1=-=llnl.gov
<owner-chemistry[*]ccl.net>  wrote:
Sent to CCL by: "Brian  Bennion" [bennion1=llnl.gov]
Hello,

Does anyone have/know of code to parse pdf tables for chemical structure and activity data?

Searching the web did not result in much so I may not be searching with the correct terms.  One interesting hit was the clide code from simbiosis.

Has anyone used this for pulling structures out of pdf files?

I want to populate a repository with chemical structures and annotate the entries with the activity data given in an associated table located in the same pdf document.

Thanks
Brian>




-= This is automatically added to each message by the mailing script =-
E-mail to subscribers: CHEMISTRY-,-ccl.net or use:
    http://www.ccl.net/cgi-bin/ccl/send_ccl_message

E-mail to administrators: CHEMISTRY-REQUEST-,-ccl.net or use
    http://www.ccl.net/cgi-bin/ccl/send_ccl_message
http://www.ccl.net/chemistry/sub_unsub.shtml

Before posting, check wait time at: http://www.ccl.net

Job: http://www.ccl.net/jobs Conferences: http://server.ccl.net/chemistry/announcements/conferences/

Search Messages: http://www.ccl.net/chemistry/searchccl/index.shtml
    http://www.ccl.net/spammers.txt

RTFI: http://www.ccl.net/chemistry/aboutccl/instructions/





--
=====================================
Jesse Gordon
Application Scientist
Dotmatics Limited
400 West Cummings Park #5450, Woburn MA 01801
T: +1 781-305-3114
M: +1-617-320-6989
Email: jesse.gordon-,-dotmatics.com
Skype: jessegordon
======================================
 
See the latest in Dotmatics Suite of Solutions at:
Booth # 11 CHI Drug Discovery Chemistry, April 17-19, 2012, Hilton San Diego Resort & Spa, San Diego, CA, www.drugdiscoverychemistry.com
Booth # 323 Bio-IT World Conference, April 24-26, 2012, World Trade center,  Boston,  MA, www.bio-itworldexpo.com

--
 Disclaimer: This electronic mail and its attachments are intended solely for
 the person(s) to whom they are addressed and contain information which is
 confidential or otherwise protected from disclosure, except for the purpose
 for which they are intended. Dissemination, distribution, or reproduction by
 anyone other than the intended recipients is prohibited and may be illegal.
 If you are not an intended recipient, please immediately inform the sender
 and return the electronic mail and its attachments and destroy any copies
 which may be in your possession. Dotmatics Limited screens electronic mails
 for viruses but does not warrant that this electronic mail is free of any
 viruses. Dotmatics Limited accepts no liability for any damage caused by
 any virus transmitted by this electronic mail. Dotmatics Limited is
 registered in England & Wales No. 5614524 with offices at The Old Monastery,
 Windhill, Bishops Stortford, Herts, CM23 2ND, UK.