chemaxon.pandasutil.dataframeutil
Converts a Molecule object to its SVG representation.
Parameters
- mol:
Molecule- The molecule to be converted to SVG.
Returns
str- SVG representation of the molecule.
Loads molecules from a file and prepares them for use in a pandas DataFrame.
Parameters
- file_name: The file containing the molecules
- mol_format: The format of the input file. If not set, then it is auto recognized
- molecule_column: The string label of the column containing the
Moleculeobject inside theDataFrame - molecule_str_column: The label of the column containing the CXSMILES representation of the
Moleculeobject - read_properties_to_columns:
If
True, the properties of theMoleculeobjects will be added as separate columns to theDataFrame
Returns
A python
dictobject filled withMoleculedata, that can be passed to apandas.DataFrameconstructor.
Generator that yields dictionary inputs for pandas.DataFrame in batches. Avoids loading entire file into memory. Useful for files with millions of molecules.
Parameters
- file_name: The file containing the molecules
- mol_format: The format of the input file. If not set, then it is auto recognized
- batch_size: Size of each
Moleculebatches - molecule_column: The string label of the column containing the
Moleculeobject inside theDataFrame - molecule_str_column: The label of the column containing the CXSMILES representation of the
Moleculeobject - read_properties_to_columns:
If
True, the properties of theMoleculeobjects will be added as separate columns to theDataFrame
Returns
A python Generator[
dict] object filled withMoleculedata, that can be passed to apandas.DataFrameconstructor.
Examples
Build one DataFrame from all batches (convenient, but fully materializes in memory):
import pandas as pd
import chemaxon as cxn
batch_iter = cxn.load_molecules_for_pandas_batches(
"./resources/nci_random_992.smiles",
batch_size=200,
molecule_column="Mol",
molecule_str_column="cxsmiles"
)
df = pd.concat((pd.DataFrame(batch) for batch in batch_iter), ignore_index=True)
print(df.shape)
Process batches incrementally (memory-friendly):
import pandas as pd
import chemaxon as cxn
for i, batch in enumerate(
cxn.load_molecules_for_pandas_batches("./resources/nci_random_992.smiles", batch_size=200)
):
df_batch = pd.DataFrame(batch)
print(f"batch={i}, rows={len(df_batch)}")
# process df_batch, then release it
Prepares a dict object to be loaded into a pandas DataFrame from a list of Molecule objects.
Note: In case, properties are being added as separate columns to the returned dictionary, all the property keys
are collected from all the molecules. If a molecule does not have a specific property, then None is added to the
corresponding column.
Parameters
- mol_list: list of
Moleculeobjects - molecule_column: The string label of the column containing the
Moleculeobject inside theDataFrame - molecule_str_column:
The label of the column containing the
str(as CXSMILES) representation of theMoleculeobject - read_properties_to_columns:
If
True, the properties of theMoleculeobjects will be added as separate columns to theDataFrame
Returns
A python
dictobject filled withMoleculedata, that can be passed to apandas.DataFrameconstructor.