Class DocumentToStructure

java.lang.Object
chemaxon.naming.document.DocumentToStructure

@PublicApi public class DocumentToStructure extends Object
A convenience class for dealing with String-based documents, such as TXT, HTML, or XML files. Also contains a number of property keys which can typically be extracted from documents and made available on the resulting molecules.

See the constructors of MolImporter for more general and common document to structure operations.

Since:
5.9
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final String
    For internal usage only.
    static final String
    Molecule property key on results: key of the molecule property which contains the starting character offset since the beginning of the document, for text formats (html, xml, txt).
    static final String
    Molecule property key on results: the confidence that the structure is correct.
    static final String
    Molecule property key on results: the context of the structure recognized in the text.
    static final String
    Molecule property key on results: index of the hit inside the context.
    static final String
    Molecule property key on results: name of the principal author(s) of a document.
    static final String
    Molecule property key on results: the date on which the document was created.
    static final String
    Molecule property key on results: name of the last (most recent) author of a document.
    static final String
    Molecule property key on results: the assignees of the patent, separated by newline characters.
    static final String
    Molecule property key on results: the patent identifier.
    static final String
    Molecule property key on results: the inventors of the patent, separated by newline characters.
    static final String
    Molecule property key on results: the IPC classification(s) for the patent, separated by newline characters.
    static final String
    Molecule property key on results: the IPCR classification(s) for the patent, separated by newline characters.
    static final String
    Molecule property key on results: the title of the document.
    static final String
    Molecule property key on results: the file name of the source document.
    static final String
    Molecule property key on results: key of the molecule property which contains the ending character offset since the beginning of the document, for text formats (html, xml, txt).
    static final String
     
    static final String
    Molecule property key on results: the page number, if applicable (e.g.
    static final String
    Molecule property key on results: the section of the document where the structure was found.
    static final String
    Molecule property key on results: the source text, as it appears in the original document.
    static final String
    Molecule property key on results: the type of source for the structure.
    static final String
    Possible value for the TYPE property: the source is a CAS Registry Number®.
    static final String
    Possible value for the TYPE property: the source is an embedded ChemDraw structure.
    static final String
    Possible value for the TYPE property: the source is a common name.
    static final String
    Possible value for the TYPE property: the source is an EC Number.
    static final String
    Possible value for the TYPE property: the source is a generic name, for instance "C1-C4 alkyl".
    static final String
    Possible value for the TYPE property: the source is an InChI string.
    static final String
    Possible value for the TYPE property: the source is an ion abbreviation, for instance K+ or Ca2+.
    static final String
    Possible value for the TYPE property: the source is an embedded Chemaxon MRV structure.
    static final String
    Possible value for the TYPE property: the source is a structure image recognized by Optical Structure Recognition.
    static final String
    Possible value for the TYPE property: the source is a peptide notation, for instance Val-Gly-Ser-Ala.
    static final String
    Possible value for the TYPE property: the source is a SMILES string.
    static final String
    Possible value for the TYPE property: the source is an embedded Symyx/ISIS draw structure.
    static final String
    Possible value for the TYPE property: the source is a systematic name.
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    static boolean
     
    Creates a MolImporter instance to import structures from a given text using the default format options.
    process(String text, String options)
    Creates a MolImporter instance to import structures from a given text.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • SOURCE_TEXT

      public static final String SOURCE_TEXT
      Molecule property key on results: the source text, as it appears in the original document.
      See Also:
    • DOCUMENT

      public static final String DOCUMENT
      Molecule property key on results: the file name of the source document.
      See Also:
    • PAGE

      public static final String PAGE
      Molecule property key on results: the page number, if applicable (e.g. for a PDF document).
      See Also:
    • CHARACTER

      public static final String CHARACTER
      Molecule property key on results: key of the molecule property which contains the starting character offset since the beginning of the document, for text formats (html, xml, txt).
      See Also:
    • END_CHARACTER

      public static final String END_CHARACTER
      Molecule property key on results: key of the molecule property which contains the ending character offset since the beginning of the document, for text formats (html, xml, txt).
      See Also:
    • BYTE

      public static final String BYTE
      For internal usage only.
      See Also:
    • IDENTIFIER

      public static final String IDENTIFIER
      See Also:
    • DOC_AUTHOR

      public static final String DOC_AUTHOR
      Molecule property key on results: name of the principal author(s) of a document.
      See Also:
    • DOC_LAST_AUTHOR

      public static final String DOC_LAST_AUTHOR
      Molecule property key on results: name of the last (most recent) author of a document.
      See Also:
    • DOC_TITLE

      public static final String DOC_TITLE
      Molecule property key on results: the title of the document.
      See Also:
    • DOC_CREATION_DATE

      public static final String DOC_CREATION_DATE
      Molecule property key on results: the date on which the document was created.
      See Also:
    • DOC_PATENT_ID

      public static final String DOC_PATENT_ID
      Molecule property key on results: the patent identifier.
      See Also:
    • DOC_PATENT_IPC

      public static final String DOC_PATENT_IPC
      Molecule property key on results: the IPC classification(s) for the patent, separated by newline characters.
      See Also:
    • DOC_PATENT_IPCR

      public static final String DOC_PATENT_IPCR
      Molecule property key on results: the IPCR classification(s) for the patent, separated by newline characters.
      See Also:
    • DOC_PATENT_ASSIGNEES

      public static final String DOC_PATENT_ASSIGNEES
      Molecule property key on results: the assignees of the patent, separated by newline characters.
      See Also:
    • DOC_PATENT_INVENTORS

      public static final String DOC_PATENT_INVENTORS
      Molecule property key on results: the inventors of the patent, separated by newline characters.
      See Also:
    • CONFIDENCE

      public static final String CONFIDENCE
      Molecule property key on results: the confidence that the structure is correct.

      0 or less means very little confidence. 1 or more means high confidence.

      This is currently set on image recognition, that is Optical Structure Recognition (OSR), also known as "chemical OCR".

      See Also:
    • SECTION

      public static final String SECTION
      Molecule property key on results: the section of the document where the structure was found.

      This is currently supported only for US patents in the USPTO XML format, in which case the value of the property can be "abstract", "citation", "description" or "claim N".

      See Also:
    • CONTEXT

      public static final String CONTEXT
      Molecule property key on results: the context of the structure recognized in the text.
      See Also:
    • CONTEXT_INDEX

      public static final String CONTEXT_INDEX
      Molecule property key on results: index of the hit inside the context.
      See Also:
    • TYPE

      public static final String TYPE
      Molecule property key on results: the type of source for the structure.
      See Also:
    • TYPE_SYSTEMATIC

      public static final String TYPE_SYSTEMATIC
      Possible value for the TYPE property: the source is a systematic name.
      See Also:
    • TYPE_COMMON

      public static final String TYPE_COMMON
      Possible value for the TYPE property: the source is a common name.
      See Also:
    • TYPE_GENERIC

      public static final String TYPE_GENERIC
      Possible value for the TYPE property: the source is a generic name, for instance "C1-C4 alkyl".
      See Also:
    • TYPE_SMILES

      public static final String TYPE_SMILES
      Possible value for the TYPE property: the source is a SMILES string.
      See Also:
    • TYPE_INCHI

      public static final String TYPE_INCHI
      Possible value for the TYPE property: the source is an InChI string.
      See Also:
    • TYPE_CAS

      public static final String TYPE_CAS
      Possible value for the TYPE property: the source is a CAS Registry Number®.
      See Also:
    • TYPE_EC

      public static final String TYPE_EC
      Possible value for the TYPE property: the source is an EC Number.
      See Also:
    • TYPE_ION

      public static final String TYPE_ION
      Possible value for the TYPE property: the source is an ion abbreviation, for instance K+ or Ca2+.
      See Also:
    • TYPE_PEPTIDE

      public static final String TYPE_PEPTIDE
      Possible value for the TYPE property: the source is a peptide notation, for instance Val-Gly-Ser-Ala.
      See Also:
    • TYPE_CDX

      public static final String TYPE_CDX
      Possible value for the TYPE property: the source is an embedded ChemDraw structure.
      See Also:
    • TYPE_MRV

      public static final String TYPE_MRV
      Possible value for the TYPE property: the source is an embedded Chemaxon MRV structure.
      See Also:
    • TYPE_SYMYX

      public static final String TYPE_SYMYX
      Possible value for the TYPE property: the source is an embedded Symyx/ISIS draw structure.
      See Also:
    • TYPE_OSR

      public static final String TYPE_OSR
      Possible value for the TYPE property: the source is a structure image recognized by Optical Structure Recognition.
      See Also:
  • Constructor Details

    • DocumentToStructure

      public DocumentToStructure()
  • Method Details

    • process

      public static MolImporter process(String text)
      Creates a MolImporter instance to import structures from a given text using the default format options.

      A shorthand for process(text, null).

    • process

      public static MolImporter process(String text, String options)
      Creates a MolImporter instance to import structures from a given text.

      Generally, the text is treated as plain text. However, for convenience, text that starts immediately with an XML or HTML prologue is recognized as such instead of plain text. For complete documents, a direct call to a MolImporter constructor is often more appropriate than loading the whole document into a String object.

      The returned MolImporter instance does no actual resource management so closing it is not necessary.

      Parameters:
      text - the plain text or HTML/XML to process
      options - the "d2s" format options passed to MolImporter or null if the default options should be used. Starting the String with "d2s:" is optional.
      Returns:
      a MolImporter that can be used to read the structures found in the text.
    • isMetadataMol

      public static boolean isMetadataMol(Molecule m)