Class ECFPFeatureLookup

java.lang.Object
chemaxon.descriptors.ECFPFeatureLookup

@PublicApi public class ECFPFeatureLookup extends Object
Class for retrieving the substructural features of ECFP fingerprints. ECFPs are represented either as lists of integer identifiers or as fixed-length bit strings, in which the identifiers and bit positions account for particular substructural features of the input molecule. This class provides a lookup service for both kinds of ECFP representations.

A related class, ECFPFeature serves for representing the substructural features of ECFP fingerprints. More precisely, each ECFPFeature instance captures a circular atom neighborhood of the input molecule by recording a central atom and a diameter. The ECFP generation process assigns integer identifiers to these substructural features by a hashing procedure. The positions of 1 bits in the fixed-length bit string representation are derived from these identifiers. This lookup class provides methods to obtain the represented ECFP features for a given identifier or bit position.

Note that there is no one-to-one relationship between the substructural features and the generated identifers. Therefore, the lookup methods of this class return a list of corresponding ECFPFeature objects for the given identifier or bit position. Apparently, atom neighborhoods that are equivalent with respect to the considered atom properties are represented by the same identifier and bit position. However, unwanted collisions may also occur, especially for the fixed-length bit string representation. That is, completely different substructural features may be represented by the same bit position due to the applied hashing method (folding). In such cases, all represented features are listed by the lookup methods. (These collisions are inevitable effects of the limited representation capability of fixed-length fingerprints.)

Apart from the collisions, it is also possible that two different identifiers represent the same atom neighborhood but originating in different central atoms. In such cases, the fingerprint generation method eliminates the redundancy by keeping only one representation according to a specific rule. For example, in the ECFP fingerprints of CO, only three identifiers (bits) are kept out of the generated four.

This class requires ECFP configuration parameters, which determine both the generation process of ECFP features and the standardization actions that should be applied on the input molecule. You should use exactly the same configuration parameters for fingerprint generation and feature retrieval to ensure correct results.

For more information about ECFPs, see the related HTML documentation.

Typical usage

    ECFPFeatureLookup lookup = new ECFPFeatureLookup();
    lookup.processMolecule(mol);
    for (ECFPFeature f : lookup.getFeaturesFromIdentifier(id)) {
        System.out.println(f.getSubstructure().toFormat("SMARTS"));
    }
 

Since:
JChem 5.5
See Also:
  • Constructor Details

    • ECFPFeatureLookup

      public ECFPFeatureLookup()
      Creates a new ECFPFeatureLookup instance with the default ECFP configuration parameters.
    • ECFPFeatureLookup

      public ECFPFeatureLookup(String configString)
      Creates a new ECFPFeatureLookup instance with the given ECFP configuration parameters.
      Parameters:
      configString - ECFP configuration string in XML
    • ECFPFeatureLookup

      public ECFPFeatureLookup(ECFPParameters params)
      Creates a new ECFPFeatureLookup instance with the given ECFP configuration parameters.
      Parameters:
      params - ECFP parameters object
  • Method Details

    • processMolecule

      public void processMolecule(Molecule mol)
      Performs the necessary preprocessing for the given molecule.
      Parameters:
      mol - the molecule
    • getFeaturesFromIdentifier

      public List<ECFPFeature> getFeaturesFromIdentifier(int id)
      Returns the substructural features represented by the given integer identifier. If no such feature is found, this method returns an empty list.
      Parameters:
      id - the identifier
      Returns:
      the list of ECFP features
    • getFeaturesFromBitPosition

      public List<ECFPFeature> getFeaturesFromBitPosition(int bitPos)
      Returns the substructural features represented by the given bit position. If no such feature is found, this method returns an empty list.
      Parameters:
      bitPos - the position in the fixed-length bit string
      Returns:
      the list of ECFP features
    • getIdentifier

      public Integer getIdentifier(MolAtom atom, int diameter) throws IllegalArgumentException
      Returns the corresponding integer identifier for the given atom neighborhood.

      Note that the generated identifier is often removed by the fingerprint generation process because the same atom neighborhood is represented by another center atom and diameter. In these cases, this function returns null.

      Parameters:
      atom - the center atom of the circular neighborhood. It must be a chemical atom that is not removed in the standardization phase.
      diameter - the diameter of the circular neighborhood. It must be an even number between zero and the maximum diameter specified by the ECFP configuration parameters.
      Returns:
      the integer identifier or null if no identifier corresponds to the given neighborhood in the generated fingerprint.
      Throws:
      IllegalArgumentException - if the central atom or the diameter is illegal (e.g., the given atom is an explicit hydrogen, which is removed by the applied standardizer).
    • getBitPosition

      public Integer getBitPosition(MolAtom atom, int diameter) throws IllegalArgumentException
      Returns the corresponding bit position for the given atom neighborhood.

      Note that the generated identifier is often removed by the fingerprint generation process because the atom neighborhood is represented by another center atom and diameter. In these cases, this function returns null.

      Parameters:
      atom - the center atom of the circular neighborhood. It must be a chemical atom that is not removed in the standardization phase.
      diameter - the diameter of the circular neighborhood. It must be an even number between zero and the maximum diameter specified by the ECFP configuration parameters.
      Returns:
      the bit position or null if no identifier corresponds to the given neighborhood in the generated fingerprint.
      Throws:
      IllegalArgumentException - if the central atom or the diameter is illegal (e.g., the given atom is an explicit hydrogen, which is removed by the applied standardizer).
    • getBitPosition

      public int getBitPosition(int id)
      Returns the corresponding bit position for the given integer identifier.
      Parameters:
      id - the identifier
      Returns:
      the bit position