chemaxon.pandasutil.dataframeutil

def mol_to_svg_formatter(mol: chemaxon.molecule.Molecule) -> str:

Converts a Molecule object to its SVG representation.

Parameters
  • mol: Molecule - The molecule to be converted to SVG.
Returns

str - SVG representation of the molecule.

def load_molecules_for_pandas( file_name: str, mol_format: str = '', molecule_column: str = 'Molecule', molecule_str_column: str = None, read_properties_to_columns: bool = False) -> dict:

Loads molecules from a file and prepares them for use in a pandas DataFrame.

Parameters
  • file_name: The file containing the molecules
  • mol_format: The format of the input file. If not set, then it is auto recognized
  • molecule_column: The string label of the column containing the Molecule object inside the DataFrame
  • molecule_str_column: The label of the column containing the CXSMILES representation of the Molecule object
  • read_properties_to_columns: If True, the properties of the Molecule objects will be added as separate columns to the DataFrame
Returns

A python dict object filled with Molecule data, that can be passed to a pandas.DataFrame constructor.

def load_molecules_for_pandas_batches( file_name: str, mol_format: str = '', batch_size: int = 1000, molecule_column: str = 'Molecule', molecule_str_column: str = None, read_properties_to_columns: bool = False) -> Generator[dict, Any, NoneType]:

Generator that yields dictionary inputs for pandas.DataFrame in batches. Avoids loading entire file into memory. Useful for files with millions of molecules.

Parameters
  • file_name: The file containing the molecules
  • mol_format: The format of the input file. If not set, then it is auto recognized
  • batch_size: Size of each Molecule batches
  • molecule_column: The string label of the column containing the Molecule object inside the DataFrame
  • molecule_str_column: The label of the column containing the CXSMILES representation of the Molecule object
  • read_properties_to_columns: If True, the properties of the Molecule objects will be added as separate columns to the DataFrame
Returns

A python Generator[dict] object filled with Molecule data, that can be passed to a pandas.DataFrame constructor.

Examples

Build one DataFrame from all batches (convenient, but fully materializes in memory):

import pandas as pd
import chemaxon as cxn

batch_iter = cxn.load_molecules_for_pandas_batches(
    "./resources/nci_random_992.smiles",
    batch_size=200,
    molecule_column="Mol",
    molecule_str_column="cxsmiles"
)

df = pd.concat((pd.DataFrame(batch) for batch in batch_iter), ignore_index=True)
print(df.shape)

Process batches incrementally (memory-friendly):

import pandas as pd
import chemaxon as cxn

for i, batch in enumerate(
    cxn.load_molecules_for_pandas_batches("./resources/nci_random_992.smiles", batch_size=200)
):
    df_batch = pd.DataFrame(batch)
    print(f"batch={i}, rows={len(df_batch)}")
    # process df_batch, then release it
def prepare_molecules_for_pandas( mol_list: list[chemaxon.molecule.Molecule], molecule_column: str = 'Molecule', molecule_str_column: str = None, read_properties_to_columns: bool = False) -> dict:

Prepares a dict object to be loaded into a pandas DataFrame from a list of Molecule objects.

Note: In case, properties are being added as separate columns to the returned dictionary, all the property keys are collected from all the molecules. If a molecule does not have a specific property, then None is added to the corresponding column.

Parameters
  • mol_list: list of Molecule objects
  • molecule_column: The string label of the column containing the Molecule object inside the DataFrame
  • molecule_str_column: The label of the column containing the str (as CXSMILES) representation of the Molecule object
  • read_properties_to_columns: If True, the properties of the Molecule objects will be added as separate columns to the DataFrame
Returns

A python dict object filled with Molecule data, that can be passed to a pandas.DataFrame constructor.