Class DocumentToStructure


  • @PublicAPI
    public class DocumentToStructure
    extends Object
    A convenience class for dealing with String-based documents, such as TXT, HTML, or XML files. Also contains a number of property keys which can typically be extracted from documents and made available on the resulting molecules.

    See the constructors of MolImporter for more general and common document to structure operations.

    Since:
    5.9
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static String BYTE
      For internal usage only.
      static String CHARACTER
      Molecule property key on results: key of the molecule property which contains the starting character offset since the beginning of the document, for text formats (html, xml, txt).
      static String CONFIDENCE
      Molecule property key on results: the confidence that the structure is correct.
      static String CONTEXT
      Molecule property key on results: the context of the structure recognized in the text.
      static String CONTEXT_INDEX
      Molecule property key on results: index of the hit inside the context.
      static String DOC_AUTHOR
      Molecule property key on results: name of the principal author(s) of a document.
      static String DOC_CREATION_DATE
      Molecule property key on results: the date on which the document was created.
      static String DOC_LAST_AUTHOR
      Molecule property key on results: name of the last (most recent) author of a document.
      static String DOC_PATENT_ASSIGNEES
      Molecule property key on results: the assignees of the patent, separated by newline characters.
      static String DOC_PATENT_ID
      Molecule property key on results: the patent identifier.
      static String DOC_PATENT_INVENTORS
      Molecule property key on results: the inventors of the patent, separated by newline characters.
      static String DOC_PATENT_IPC
      Molecule property key on results: the IPC classification(s) for the patent, separated by newline characters.
      static String DOC_PATENT_IPCR
      Molecule property key on results: the IPCR classification(s) for the patent, separated by newline characters.
      static String DOC_TITLE
      Molecule property key on results: the title of the document.
      static String DOCUMENT
      Molecule property key on results: the file name of the source document.
      static String END_CHARACTER
      Molecule property key on results: key of the molecule property which contains the ending character offset since the beginning of the document, for text formats (html, xml, txt).
      static String IDENTIFIER  
      static String PAGE
      Molecule property key on results: the page number, if applicable (e.g.
      static String SECTION
      Molecule property key on results: the section of the document where the structure was found.
      static String SOURCE_TEXT
      Molecule property key on results: the source text, as it appears in the original document.
      static String TYPE
      Molecule property key on results: the type of source for the structure.
      static String TYPE_CAS
      Possible value for the TYPE property: the source is a CAS Registry Number®.
      static String TYPE_CDX
      Possible value for the TYPE property: the source is an embedded ChemDraw structure.
      static String TYPE_COMMON
      Possible value for the TYPE property: the source is a common name.
      static String TYPE_EC
      Possible value for the TYPE property: the source is an EC Number.
      static String TYPE_GENERIC
      Possible value for the TYPE property: the source is a generic name, for instance "C1-C4 alkyl".
      static String TYPE_INCHI
      Possible value for the TYPE property: the source is an InChI string.
      static String TYPE_ION
      Possible value for the TYPE property: the source is an ion abbreviation, for instance K+ or Ca2+.
      static String TYPE_MRV
      Possible value for the TYPE property: the source is an embedded Chemaxon MRV structure.
      static String TYPE_OSR
      Possible value for the TYPE property: the source is a structure image recognized by Optical Structure Recognition.
      static String TYPE_PEPTIDE
      Possible value for the TYPE property: the source is a peptide notation, for instance Val-Gly-Ser-Ala.
      static String TYPE_SMILES
      Possible value for the TYPE property: the source is a SMILES string.
      static String TYPE_SYMYX
      Possible value for the TYPE property: the source is an embedded Symyx/ISIS draw structure.
      static String TYPE_SYSTEMATIC
      Possible value for the TYPE property: the source is a systematic name.
    • Field Detail

      • SOURCE_TEXT

        public static final String SOURCE_TEXT
        Molecule property key on results: the source text, as it appears in the original document.
        See Also:
        Constant Field Values
      • DOCUMENT

        public static final String DOCUMENT
        Molecule property key on results: the file name of the source document.
        See Also:
        Constant Field Values
      • PAGE

        public static final String PAGE
        Molecule property key on results: the page number, if applicable (e.g. for a PDF document).
        See Also:
        Constant Field Values
      • CHARACTER

        public static final String CHARACTER
        Molecule property key on results: key of the molecule property which contains the starting character offset since the beginning of the document, for text formats (html, xml, txt).
        See Also:
        Constant Field Values
      • END_CHARACTER

        public static final String END_CHARACTER
        Molecule property key on results: key of the molecule property which contains the ending character offset since the beginning of the document, for text formats (html, xml, txt).
        See Also:
        Constant Field Values
      • DOC_AUTHOR

        public static final String DOC_AUTHOR
        Molecule property key on results: name of the principal author(s) of a document.
        See Also:
        Constant Field Values
      • DOC_LAST_AUTHOR

        public static final String DOC_LAST_AUTHOR
        Molecule property key on results: name of the last (most recent) author of a document.
        See Also:
        Constant Field Values
      • DOC_TITLE

        public static final String DOC_TITLE
        Molecule property key on results: the title of the document.
        See Also:
        Constant Field Values
      • DOC_CREATION_DATE

        public static final String DOC_CREATION_DATE
        Molecule property key on results: the date on which the document was created.
        See Also:
        Constant Field Values
      • DOC_PATENT_ID

        public static final String DOC_PATENT_ID
        Molecule property key on results: the patent identifier.
        See Also:
        Constant Field Values
      • DOC_PATENT_IPC

        public static final String DOC_PATENT_IPC
        Molecule property key on results: the IPC classification(s) for the patent, separated by newline characters.
        See Also:
        Constant Field Values
      • DOC_PATENT_IPCR

        public static final String DOC_PATENT_IPCR
        Molecule property key on results: the IPCR classification(s) for the patent, separated by newline characters.
        See Also:
        Constant Field Values
      • DOC_PATENT_ASSIGNEES

        public static final String DOC_PATENT_ASSIGNEES
        Molecule property key on results: the assignees of the patent, separated by newline characters.
        See Also:
        Constant Field Values
      • DOC_PATENT_INVENTORS

        public static final String DOC_PATENT_INVENTORS
        Molecule property key on results: the inventors of the patent, separated by newline characters.
        See Also:
        Constant Field Values
      • CONFIDENCE

        public static final String CONFIDENCE
        Molecule property key on results: the confidence that the structure is correct.

        0 or less means very little confidence. 1 or more means high confidence.

        This is currently set on image recognition, that is Optical Structure Recognition (OSR), also known as "chemical OCR".

        See Also:
        Constant Field Values
      • SECTION

        public static final String SECTION
        Molecule property key on results: the section of the document where the structure was found.

        This is currently supported only for US patents in the USPTO XML format, in which case the value of the property can be "abstract", "citation", "description" or "claim N".

        See Also:
        Constant Field Values
      • TYPE_SYSTEMATIC

        public static final String TYPE_SYSTEMATIC
        Possible value for the TYPE property: the source is a systematic name.
        See Also:
        Constant Field Values
      • TYPE_GENERIC

        public static final String TYPE_GENERIC
        Possible value for the TYPE property: the source is a generic name, for instance "C1-C4 alkyl".
        See Also:
        Constant Field Values
      • TYPE_CAS

        public static final String TYPE_CAS
        Possible value for the TYPE property: the source is a CAS Registry Number®.
        See Also:
        Constant Field Values
      • TYPE_ION

        public static final String TYPE_ION
        Possible value for the TYPE property: the source is an ion abbreviation, for instance K+ or Ca2+.
        See Also:
        Constant Field Values
      • TYPE_PEPTIDE

        public static final String TYPE_PEPTIDE
        Possible value for the TYPE property: the source is a peptide notation, for instance Val-Gly-Ser-Ala.
        See Also:
        Constant Field Values
      • TYPE_CDX

        public static final String TYPE_CDX
        Possible value for the TYPE property: the source is an embedded ChemDraw structure.
        See Also:
        Constant Field Values
      • TYPE_MRV

        public static final String TYPE_MRV
        Possible value for the TYPE property: the source is an embedded Chemaxon MRV structure.
        See Also:
        Constant Field Values
      • TYPE_SYMYX

        public static final String TYPE_SYMYX
        Possible value for the TYPE property: the source is an embedded Symyx/ISIS draw structure.
        See Also:
        Constant Field Values
      • TYPE_OSR

        public static final String TYPE_OSR
        Possible value for the TYPE property: the source is a structure image recognized by Optical Structure Recognition.
        See Also:
        Constant Field Values
    • Constructor Detail

      • DocumentToStructure

        public DocumentToStructure()
    • Method Detail

      • process

        public static MolImporter process​(String text)
        Creates a MolImporter instance to import structures from a given text using the default format options.

        A shorthand for process(text, null).

      • process

        public static MolImporter process​(String text,
                                          String options)
        Creates a MolImporter instance to import structures from a given text.

        Generally, the text is treated as plain text. However, for convenience, text that starts immediately with an XML or HTML prologue is recognized as such instead of plain text. For complete documents, a direct call to a MolImporter constructor is often more appropriate than loading the whole document into a String object.

        The returned MolImporter instance does no actual resource management so closing it is not necessary.

        Parameters:
        text - the plain text or HTML/XML to process
        options - the "d2s" format options passed to MolImporter or null if the default options should be used. Starting the String with "d2s:" is optional.
        Returns:
        a MolImporter that can be used to read the structures found in the text.
      • isMetadataMol

        public static boolean isMetadataMol​(Molecule m)