Class DocumentToStructure
- java.lang.Object
-
- chemaxon.naming.document.DocumentToStructure
-
@PublicAPI public class DocumentToStructure extends Object
A convenience class for dealing with String-based documents, such as TXT, HTML, or XML files. Also contains a number of property keys which can typically be extracted from documents and made available on the resulting molecules.See the constructors of
MolImporter
for more general and common document to structure operations.- Since:
- 5.9
-
-
Field Summary
Fields Modifier and Type Field Description static String
BYTE
For internal usage only.static String
CHARACTER
Molecule property key on results: key of the molecule property which contains the starting character offset since the beginning of the document, for text formats (html, xml, txt).static String
CONFIDENCE
Molecule property key on results: the confidence that the structure is correct.static String
CONTEXT
Molecule property key on results: the context of the structure recognized in the text.static String
CONTEXT_INDEX
Molecule property key on results: index of the hit inside the context.static String
DOC_AUTHOR
Molecule property key on results: name of the principal author(s) of a document.static String
DOC_CREATION_DATE
Molecule property key on results: the date on which the document was created.static String
DOC_LAST_AUTHOR
Molecule property key on results: name of the last (most recent) author of a document.static String
DOC_PATENT_ASSIGNEES
Molecule property key on results: the assignees of the patent, separated by newline characters.static String
DOC_PATENT_ID
Molecule property key on results: the patent identifier.static String
DOC_PATENT_INVENTORS
Molecule property key on results: the inventors of the patent, separated by newline characters.static String
DOC_PATENT_IPC
Molecule property key on results: the IPC classification(s) for the patent, separated by newline characters.static String
DOC_PATENT_IPCR
Molecule property key on results: the IPCR classification(s) for the patent, separated by newline characters.static String
DOC_TITLE
Molecule property key on results: the title of the document.static String
DOCUMENT
Molecule property key on results: the file name of the source document.static String
END_CHARACTER
Molecule property key on results: key of the molecule property which contains the ending character offset since the beginning of the document, for text formats (html, xml, txt).static String
IDENTIFIER
static String
PAGE
Molecule property key on results: the page number, if applicable (e.g.static String
SECTION
Molecule property key on results: the section of the document where the structure was found.static String
SOURCE_TEXT
Molecule property key on results: the source text, as it appears in the original document.static String
TYPE
Molecule property key on results: the type of source for the structure.static String
TYPE_CAS
Possible value for theTYPE
property: the source is a CAS Registry Number®.static String
TYPE_CDX
Possible value for theTYPE
property: the source is an embedded ChemDraw structure.static String
TYPE_COMMON
Possible value for theTYPE
property: the source is a common name.static String
TYPE_EC
Possible value for theTYPE
property: the source is an EC Number.static String
TYPE_GENERIC
Possible value for theTYPE
property: the source is a generic name, for instance "C1-C4 alkyl".static String
TYPE_INCHI
Possible value for theTYPE
property: the source is an InChI string.static String
TYPE_ION
Possible value for theTYPE
property: the source is an ion abbreviation, for instance K+ or Ca2+.static String
TYPE_MRV
Possible value for theTYPE
property: the source is an embedded Chemaxon MRV structure.static String
TYPE_OSR
Possible value for theTYPE
property: the source is a structure image recognized by Optical Structure Recognition.static String
TYPE_PEPTIDE
Possible value for theTYPE
property: the source is a peptide notation, for instance Val-Gly-Ser-Ala.static String
TYPE_SMILES
Possible value for theTYPE
property: the source is a SMILES string.static String
TYPE_SYMYX
Possible value for theTYPE
property: the source is an embedded Symyx/ISIS draw structure.static String
TYPE_SYSTEMATIC
Possible value for theTYPE
property: the source is a systematic name.
-
Constructor Summary
Constructors Constructor Description DocumentToStructure()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static boolean
isMetadataMol(Molecule m)
static MolImporter
process(String text)
Creates aMolImporter
instance to import structures from a given text using the default format options.static MolImporter
process(String text, String options)
Creates aMolImporter
instance to import structures from a given text.
-
-
-
Field Detail
-
SOURCE_TEXT
public static final String SOURCE_TEXT
Molecule property key on results: the source text, as it appears in the original document.- See Also:
- Constant Field Values
-
DOCUMENT
public static final String DOCUMENT
Molecule property key on results: the file name of the source document.- See Also:
- Constant Field Values
-
PAGE
public static final String PAGE
Molecule property key on results: the page number, if applicable (e.g. for a PDF document).- See Also:
- Constant Field Values
-
CHARACTER
public static final String CHARACTER
Molecule property key on results: key of the molecule property which contains the starting character offset since the beginning of the document, for text formats (html, xml, txt).- See Also:
- Constant Field Values
-
END_CHARACTER
public static final String END_CHARACTER
Molecule property key on results: key of the molecule property which contains the ending character offset since the beginning of the document, for text formats (html, xml, txt).- See Also:
- Constant Field Values
-
BYTE
public static final String BYTE
For internal usage only.- See Also:
- Constant Field Values
-
IDENTIFIER
public static final String IDENTIFIER
- See Also:
- Constant Field Values
-
DOC_AUTHOR
public static final String DOC_AUTHOR
Molecule property key on results: name of the principal author(s) of a document.- See Also:
- Constant Field Values
-
DOC_LAST_AUTHOR
public static final String DOC_LAST_AUTHOR
Molecule property key on results: name of the last (most recent) author of a document.- See Also:
- Constant Field Values
-
DOC_TITLE
public static final String DOC_TITLE
Molecule property key on results: the title of the document.- See Also:
- Constant Field Values
-
DOC_CREATION_DATE
public static final String DOC_CREATION_DATE
Molecule property key on results: the date on which the document was created.- See Also:
- Constant Field Values
-
DOC_PATENT_ID
public static final String DOC_PATENT_ID
Molecule property key on results: the patent identifier.- See Also:
- Constant Field Values
-
DOC_PATENT_IPC
public static final String DOC_PATENT_IPC
Molecule property key on results: the IPC classification(s) for the patent, separated by newline characters.- See Also:
- Constant Field Values
-
DOC_PATENT_IPCR
public static final String DOC_PATENT_IPCR
Molecule property key on results: the IPCR classification(s) for the patent, separated by newline characters.- See Also:
- Constant Field Values
-
DOC_PATENT_ASSIGNEES
public static final String DOC_PATENT_ASSIGNEES
Molecule property key on results: the assignees of the patent, separated by newline characters.- See Also:
- Constant Field Values
-
DOC_PATENT_INVENTORS
public static final String DOC_PATENT_INVENTORS
Molecule property key on results: the inventors of the patent, separated by newline characters.- See Also:
- Constant Field Values
-
CONFIDENCE
public static final String CONFIDENCE
Molecule property key on results: the confidence that the structure is correct.0 or less means very little confidence. 1 or more means high confidence.
This is currently set on image recognition, that is Optical Structure Recognition (OSR), also known as "chemical OCR".
- See Also:
- Constant Field Values
-
SECTION
public static final String SECTION
Molecule property key on results: the section of the document where the structure was found.This is currently supported only for US patents in the USPTO XML format, in which case the value of the property can be "abstract", "citation", "description" or "claim N".
- See Also:
- Constant Field Values
-
CONTEXT
public static final String CONTEXT
Molecule property key on results: the context of the structure recognized in the text.- See Also:
CONTEXT_INDEX
, Constant Field Values
-
CONTEXT_INDEX
public static final String CONTEXT_INDEX
Molecule property key on results: index of the hit inside the context.- See Also:
CONTEXT
, Constant Field Values
-
TYPE
public static final String TYPE
Molecule property key on results: the type of source for the structure.
-
TYPE_SYSTEMATIC
public static final String TYPE_SYSTEMATIC
Possible value for theTYPE
property: the source is a systematic name.- See Also:
- Constant Field Values
-
TYPE_COMMON
public static final String TYPE_COMMON
Possible value for theTYPE
property: the source is a common name.- See Also:
- Constant Field Values
-
TYPE_GENERIC
public static final String TYPE_GENERIC
Possible value for theTYPE
property: the source is a generic name, for instance "C1-C4 alkyl".- See Also:
- Constant Field Values
-
TYPE_SMILES
public static final String TYPE_SMILES
Possible value for theTYPE
property: the source is a SMILES string.- See Also:
- Constant Field Values
-
TYPE_INCHI
public static final String TYPE_INCHI
Possible value for theTYPE
property: the source is an InChI string.- See Also:
- Constant Field Values
-
TYPE_CAS
public static final String TYPE_CAS
Possible value for theTYPE
property: the source is a CAS Registry Number®.- See Also:
- Constant Field Values
-
TYPE_EC
public static final String TYPE_EC
Possible value for theTYPE
property: the source is an EC Number.- See Also:
- Constant Field Values
-
TYPE_ION
public static final String TYPE_ION
Possible value for theTYPE
property: the source is an ion abbreviation, for instance K+ or Ca2+.- See Also:
- Constant Field Values
-
TYPE_PEPTIDE
public static final String TYPE_PEPTIDE
Possible value for theTYPE
property: the source is a peptide notation, for instance Val-Gly-Ser-Ala.- See Also:
- Constant Field Values
-
TYPE_CDX
public static final String TYPE_CDX
Possible value for theTYPE
property: the source is an embedded ChemDraw structure.- See Also:
- Constant Field Values
-
TYPE_MRV
public static final String TYPE_MRV
Possible value for theTYPE
property: the source is an embedded Chemaxon MRV structure.- See Also:
- Constant Field Values
-
TYPE_SYMYX
public static final String TYPE_SYMYX
Possible value for theTYPE
property: the source is an embedded Symyx/ISIS draw structure.- See Also:
- Constant Field Values
-
TYPE_OSR
public static final String TYPE_OSR
Possible value for theTYPE
property: the source is a structure image recognized by Optical Structure Recognition.- See Also:
- Constant Field Values
-
-
Method Detail
-
process
public static MolImporter process(String text)
Creates aMolImporter
instance to import structures from a given text using the default format options.A shorthand for
process(text, null)
.
-
process
public static MolImporter process(String text, String options)
Creates aMolImporter
instance to import structures from a given text.Generally, the text is treated as plain text. However, for convenience, text that starts immediately with an XML or HTML prologue is recognized as such instead of plain text. For complete documents, a direct call to a MolImporter constructor is often more appropriate than loading the whole document into a String object.
The returned
MolImporter
instance does no actual resource management so closing it is not necessary.- Parameters:
text
- the plain text or HTML/XML to processoptions
- the "d2s" format options passed to MolImporter or null if the default options should be used. Starting the String with "d2s:" is optional.- Returns:
- a
MolImporter
that can be used to read the structures found in the text.
-
isMetadataMol
public static boolean isMetadataMol(Molecule m)
-
-