Class ECFPFeatureLookup


  • @PublicAPI
    public class ECFPFeatureLookup
    extends Object
    Class for retrieving the substructural features of ECFP fingerprints. ECFPs are represented either as lists of integer identifiers or as fixed-length bit strings, in which the identifiers and bit positions account for particular substructural features of the input molecule. This class provides a lookup service for both kinds of ECFP representations.

    A related class, ECFPFeature serves for representing the substructural features of ECFP fingerprints. More precisely, each ECFPFeature instance captures a circular atom neighborhood of the input molecule by recording a central atom and a diameter. The ECFP generation process assigns integer identifiers to these substructural features by a hashing procedure. The positions of 1 bits in the fixed-length bit string representation are derived from these identifiers. This lookup class provides methods to obtain the represented ECFP features for a given identifier or bit position.

    Note that there is no one-to-one relationship between the substructural features and the generated identifers. Therefore, the lookup methods of this class return a list of corresponding ECFPFeature objects for the given identifier or bit position. Apparently, atom neighborhoods that are equivalent with respect to the considered atom properties are represented by the same identifier and bit position. However, unwanted collisions may also occur, especially for the fixed-length bit string representation. That is, completely different substructural features may be represented by the same bit position due to the applied hashing method (folding). In such cases, all represented features are listed by the lookup methods. (These collisions are inevitable effects of the limited representation capability of fixed-length fingerprints.)

    Apart from the collisions, it is also possible that two different identifiers represent the same atom neighborhood but originating in different central atoms. In such cases, the fingerprint generation method eliminates the redundancy by keeping only one representation according to a specific rule. For example, in the ECFP fingerprints of CO, only three identifiers (bits) are kept out of the generated four.

    This class requires ECFP configuration parameters, which determine both the generation process of ECFP features and the standardization actions that should be applied on the input molecule. You should use exactly the same configuration parameters for fingerprint generation and feature retrieval to ensure correct results.

    For more information about ECFPs, see the related HTML documentation.

    Typical usage

        ECFPFeatureLookup lookup = new ECFPFeatureLookup();
        lookup.processMolecule(mol);
        for (ECFPFeature f : lookup.getFeaturesFromIdentifier(id)) {
            System.out.println(f.getSubstructure().toFormat("SMARTS"));
        }
     

    Since:
    JChem 5.5
    See Also:
    ECFPFeature, ECFP
    • Constructor Detail

      • ECFPFeatureLookup

        public ECFPFeatureLookup()
        Creates a new ECFPFeatureLookup instance with the default ECFP configuration parameters.
      • ECFPFeatureLookup

        public ECFPFeatureLookup​(String configString)
        Creates a new ECFPFeatureLookup instance with the given ECFP configuration parameters.
        Parameters:
        configString - ECFP configuration string in XML
      • ECFPFeatureLookup

        public ECFPFeatureLookup​(ECFPParameters params)
        Creates a new ECFPFeatureLookup instance with the given ECFP configuration parameters.
        Parameters:
        params - ECFP parameters object
    • Method Detail

      • processMolecule

        public void processMolecule​(Molecule mol)
        Performs the necessary preprocessing for the given molecule.
        Parameters:
        mol - the molecule
      • getFeaturesFromIdentifier

        public List<ECFPFeature> getFeaturesFromIdentifier​(int id)
        Returns the substructural features represented by the given integer identifier. If no such feature is found, this method returns an empty list.
        Parameters:
        id - the identifier
        Returns:
        the list of ECFP features
      • getFeaturesFromBitPosition

        public List<ECFPFeature> getFeaturesFromBitPosition​(int bitPos)
        Returns the substructural features represented by the given bit position. If no such feature is found, this method returns an empty list.
        Parameters:
        bitPos - the position in the fixed-length bit string
        Returns:
        the list of ECFP features
      • getIdentifier

        public Integer getIdentifier​(MolAtom atom,
                                     int diameter)
                              throws IllegalArgumentException
        Returns the corresponding integer identifier for the given atom neighborhood.

        Note that the generated identifier is often removed by the fingerprint generation process because the same atom neighborhood is represented by another center atom and diameter. In these cases, this function returns null.

        Parameters:
        atom - the center atom of the circular neighborhood. It must be a chemical atom that is not removed in the standardization phase.
        diameter - the diameter of the circular neighborhood. It must be an even number between zero and the maximum diameter specified by the ECFP configuration parameters.
        Returns:
        the integer identifier or null if no identifier corresponds to the given neighborhood in the generated fingerprint.
        Throws:
        IllegalArgumentException - if the central atom or the diameter is illegal (e.g., the given atom is an explicit hydrogen, which is removed by the applied standardizer).
      • getBitPosition

        public Integer getBitPosition​(MolAtom atom,
                                      int diameter)
                               throws IllegalArgumentException
        Returns the corresponding bit position for the given atom neighborhood.

        Note that the generated identifier is often removed by the fingerprint generation process because the atom neighborhood is represented by another center atom and diameter. In these cases, this function returns null.

        Parameters:
        atom - the center atom of the circular neighborhood. It must be a chemical atom that is not removed in the standardization phase.
        diameter - the diameter of the circular neighborhood. It must be an even number between zero and the maximum diameter specified by the ECFP configuration parameters.
        Returns:
        the bit position or null if no identifier corresponds to the given neighborhood in the generated fingerprint.
        Throws:
        IllegalArgumentException - if the central atom or the diameter is illegal (e.g., the given atom is an explicit hydrogen, which is removed by the applied standardizer).
      • getBitPosition

        public int getBitPosition​(int id)
        Returns the corresponding bit position for the given integer identifier.
        Parameters:
        id - the identifier
        Returns:
        the bit position