This open source repository provides "Copolymer Descriptor Database (CopDDB)". This database includes parameter sets for radical-monomer pairs, which are applicable to the descriptors of copolymers. Details of each descriptor and applications to polymer informatics can be found in our preprint at ChemRxiv.
The dataset is available as a csv file, which includes the following descriptors.
Descriptor name | Description |
---|---|
Radical | SMILES for a radical (M1*) |
Monomer | SMILES for a monomer (M2) |
DE_tail | Reaction energy for the addition of a model initiator radical (Me*) to M1 at the tail position |
DE_head | Reaction energy for the addition of a model initiator radical (Me*) to M1 at the head position, which affords M1* |
DE_precursor | Relative energy of the precurser from the dissociation limit (M1* and M2) |
DE_TS | Relative energy of the TS of C-C bond formation from the dissociation limit (M1* and M2) |
DE_product | Relative energy of the product from the dissociation limit (M1* and M2) |
DE_barrier | Activation barrier for the C-C bond formation (i.e., the energy difference between the precursor and the TS) |
DE_reaction | Reaction energy for the C-C bond formation (i.e., the energy difference between the precursor and the product) |
E_Rad_SOMO | SOMO energy of M1* |
E_Rad_LUMO | LUMO energy of M1* |
E_Mon_HOMO | HOMO energy of M2 |
E_Mon_LUMO | LUMO energy of M2 |
DE_SHgap | Energy difference between SOMO of M1* and HOMO of M2 |
DE_SLgap | Energy difference between SOMO of M1* and LUMO of M2 |
VBur_R228_Rad | %VBur within 2.28 Ă… of the reactive carbon atom of M1* |
VBur_R350_Rad | %VBur within 3.50 Ă… of the reactive carbon atom of M1* |
VBur_R228_Mon | %VBur within 2.28 Ă… of the reactive carbon atom of M2 |
VBur_R350_Mon | %VBur within 3.50 Ă… of the reactive carbon atom of M2 |
Volume_Rad | Volume of M1* |
Volume_Mon | Volume of M2 |
CCdist_TS | Reactive C-C bond distance at the TS structure |
Dihedral_TS | Dihedral angle around the reactive C-C at the TS structure |
Sum_MW | Sum of molecular weights of M1* and M2 |
logP_Rad | Partition coefficient logP of M1* |
logP_Mon | Partition coefficient logP of M2 |
List of monomers.
CopDDB includes Python modules for reading and processing csv files. We confirmed that the code works with Python 3.10.12, numpy 1.25.2, pandas 1.5.3, and RDKit 2023.09.5. A sample code worked on Google Colab is available.
git clone https://github.com/hatanaka-lab/CopDDB
To obtain the list of SMILES of available monomers in CopDDB, use the copddb.datasets.get_available_smiles()
function.
>>> from CopDDB import copddb
>>> copddb.datasets.get_available_smiles()
['C=CC(=O)OC(C)(C)C', 'C=CC(=O)OCCCCCCCCCCCCCCCCCC', 'C=C(C)C(=O)OC12CC3CC(O)(CC(O)(C3)C1)C2', 'C=CC(=O)NC(C)(C)CS(=O)(=O)O', 'C=CC(=O)OC1C[C@@H]2CC[C@@]1(C)C2(C)C', 'C=C(C)C(=O)OC', 'C=CC(=O)OCC(C)O', 'C=C(C)C(=O)OC(C)(C)C', 'C=C(C)C(=O)OCC(C)C', 'C=C(C)C(=O)OCc1ccccc1', 'C=C(C)C(=O)O', 'C=C(C)C(=O)OCCCCCCCCCCCCCCCCCC', 'C=COCCOCCOC(=O)C=C', 'C=C(C)C(=O)OCCN(CC)CC', 'C=C(C)C(=O)OCC(CC)CCCC', 'C=C(C)C(=O)OCC(C)O', 'C=C(C)C(=O)OCCCC', 'C=CC(=O)OCCOC', 'C=CC(=O)OCC1(CC)COC1', 'C=CC(=O)OCCCCOCC1CO1', 'C=CC(=O)OCCOc1ccccc1', 'C=CC(=O)OCCCCCCC(C)C', 'C=Cc1ccc(OC(C)=O)cc1', 'C=CC(N)=O', 'C=C(C)C(=O)OC1CCCCC1', 'C=C(C)C(=O)OCCO', 'C=C(C)C(=O)O[C@@]12C[C@H]3C[C@@H](C1)C[C@](O)(C3)C2', 'C=CC(=O)OCCCCCCCCCCCC', 'C=C(C)C(=O)OCCCCCCCCCCCC', 'C=Cc1ccccc1', 'C=CC(=O)OCC1CCCO1', 'C=C(C)C(=O)OCC1(CC)COC1', 'C=C(C)C(=O)OC1CC2CC1C1CCCC21', 'C=C(C)C(=O)OCCOC1CC2CC1C1C=CCC21', 'C=CC(=O)OCC(C)C', 'C=C(C)C(=O)OCCN(C)C', 'C=C(C)C(=O)OCC', 'C=CC(=O)OCC1CCC(CO)CC1', 'C=C(C)C(=O)OCC1CO1', 'C=CC(=O)OCCOC1CC2CC1C1CC=CC21', 'C=CC(=O)O', 'C=CC(=O)OC', 'C=CC(=O)OCCO', 'CCOC(=O)C=C(OCC)OCC', 'C=CC(=O)OCCCCCCCCCCCCCCCC(C)C', 'C=CC(=O)OCCCCCCCC', 'C=C(C)C(=O)OCC1CCCO1', 'CO/C=C\\C(=O)OC', 'C=CC(=O)OCCCCO', 'C=CC(=O)O[C@@]12C[C@H]3C[C@@H](C1)C[C@](O)(C3)C2']
To obtain the names of descriptors registered in CopDDB, use the copddb.datasets.get_available_descriptors()
function.
>>> copddb.datasets.get_available_descriptors()
['Radical', 'Monomer', 'DE_decomposition_tail', 'DE_decomposition_head', 'DE_precursor', 'DE_TS', 'DE_product', 'DE_barrier', 'DE_reaction', 'E_Rad_SOMO', 'E_Rad_LUMO', 'E_Mon_HOMO', 'E_Mon_LUMO', 'DE_SHgap', 'DE_SLgap', 'VBur_R228_Mon', 'VBur_R350_Mon', 'VBur_R228_Rad', 'VBur_R350_Rad', 'Volume_MonteCarlo_Mon', 'Volume_MonteCarlo_Rad', 'CCdist_TS']
The most basic usage is to get descriptors for a radical-monomer pair using the copddb.datasets.descriptors_from_smiles()
function. The following example provides the descriptors in the form of a pandas.DataFrame
from the SMILES of a radical smi_rad
and a monomer smi_mon
.
from CopDDB import copddb
smi_rad = "C=CC(=O)OCCCCCCCCCCCC"
smi_mon = "C=CC(=O)O"
descriptor = copddb.datasets.descriptors_from_smiles(smi_rad, smi_mon)
The output of the descriptors is as follows.
>>> descriptor
DE_decomposition_tail DE_decomposition_head ... Volume_MonteCarlo_Rad CCdist_TS
2078 0.042541 0.058434 ... 227.414 2.268076
[1 rows x 20 columns]
If you input SMILES that are not listed in CopDDB, an empty DataFrame will be returned. When you input ethylene "C=C"
,
>>> descriptor = copddb.datasets.descriptors_from_smiles("C=C", smi_mon)
The output of descriptors is as follows.
>>> descriptor
Empty DataFrame
Columns: [DE_decomposition_tail, DE_decomposition_head, DE_precursor, DE_TS, DE_product, DE_barrier, DE_reaction, E_Rad_SOMO, E_Rad_LUMO, E_Mon_HOMO, E_Mon_LUMO, DE_SHgap, DE_SLgap, VBur_R228_Mon, VBur_R350_Mon, VBur_R228_Rad, VBur_R350_Rad, Volume_MonteCarlo_Mon, Volume_MonteCarlo_Rad, CCdist_TS]
Index: []
If you want to explicitly include missing values, use the with_nan
option (which is False
by default) as follows.
descriptor = copddb.datasets.descriptors_from_smiles("C=C", smi_mon, with_nan=True)
The output of descriptors is as follows.
descriptor
DE_decomposition_tail DE_decomposition_head ... Volume_MonteCarlo_Rad CCdist_TS
2500 NaN NaN ... NaN NaN
[1 rows x 20 columns]
To include the input SMILES in the returned value, use the with_smiles
option (which is False
by default).
descriptor = copddb.datasets.descriptors_from_smiles("C=C", smi_mon, with_nan=True, with_smiles=True)
The output of descriptors is as follows.
>>> descriptor
Radical Monomer ... Volume_MonteCarlo_Rad CCdist_TS
2500 C=C C=CC(=O)O ... NaN NaN
[1 rows x 22 columns]
When you need to input Multiple radical-monomer pairs, use a list
type as follows.
smi_list = [
["C=C(C)C(=O)OC", "C=C(C)C(=O)OC"],
["C=C(C)C(=O)OC", "C=CC(=O)O"],
["CO/C=C\C(=O)OC", "C=Cc1ccccc1"]
]
descriptors = copddb.datasets.descriptors_from_smiles(smi_list)
The output of descriptors is as follows.
>>> descriptors
DE_decomposition_tail DE_decomposition_head ... Volume_MonteCarlo_Rad CCdist_TS
0 0.038534 0.061518 ... 103.2494 2.254882
28 0.038534 0.061518 ... 103.2494 2.248237
152 0.045085 0.045173 ... 103.2451 2.409667
[3 rows x 20 columns]
When you have radical-monomer pairs and their corresponding target variables, use the copddb.datasets.build_dataset_from_smiles_and_y()
function to create a dataset including both descriptors and target variables. The function is useful for removing missing values in descriptors. The resulting dataset is returned as a Bunch
object.
from CopDDB import copddb
smi_list = [
["C=C(C)C(=O)OC", "C=C(C)C(=O)OC"],
["C=C(C)C(=O)OC", "C=CC(=O)O"],
["CO/C=C\C(=O)OC", "C=Cc1ccccc1"],
["C=C", "C=C"] # SMILES that result in missing values
]
target = [1, 2, 3, 4] # Target variables
new_dataset = copddb.datasets.build_dataset_from_smiles_and_y(smi_list, target)
The created Bunch
object new_dataset
contains the descriptors data
and the target variables target
as follows.
>>> new_dataset.keys()
dict_keys(['data', 'target'])
>>> new_dataset["data"]
DE_decomposition_tail DE_decomposition_head ... Volume_MonteCarlo_Rad CCdist_TS
0 0.038534 0.061518 ... 103.2494 2.254882
28 0.038534 0.061518 ... 103.2494 2.248237
152 0.045085 0.045173 ... 103.2451 2.409667
[3 rows x 20 columns]
>>> new_dataset["target"]
array([1, 2, 3])
As shown in Example 1, you have the option to explicitly handle missing values by using the with_nan
parameter, which is set to False
by default.
>>> new_dataset = copddb.datasets.build_dataset_from_smiles_and_y(smi_list, target, with_nan=True)
>>> new_dataset["data"]
DE_decomposition_tail DE_decomposition_head ... Volume_MonteCarlo_Rad CCdist_TS
0 0.038534 0.061518 ... 103.2494 2.254882
28 0.038534 0.061518 ... 103.2494 2.248237
152 0.045085 0.045173 ... 103.2451 2.409667
2501 NaN NaN ... NaN NaN
[4 rows x 20 columns]
>>> new_dataset["target"]
array([1, 2, 3, 4])
To apply the descriptors of radical-monomer pairs to build a ML model for copolymers, preprocessing of the discriptors is required.
When focusing on the reactivity ratio
To combine the descriptors of these two radical-monomer pairs, use the m1m2list_to_11_12()
function. With this function, the label of corresponding radical or monomer (1 or 2) is added to the tail of each descriptor name.
(For example, E_Rad_SOMO of M1* and E_TS of (M1*, M2) pair are converted to E_Rad_SOMO_1 and E_TS_12, respectively.)
from CopDDB import copddb
smi_list = [
["C=C(C)C(=O)OC", "C=C(C)C(=O)OC"],
["C=C(C)C(=O)OC", "C=CC(=O)O"],
["CO/C=C\C(=O)OC", "C=Cc1ccccc1"]
]
new_descriptors = copddb.datasets.m1m2list_to_11_12(smi_list)
The dataset of descriptors and objective variable can be also prepared with build_11_12_variables_from_smiles_and_y()
function as follows.
smi_list = [
["C=C(C)C(=O)OC", "C=C(C)C(=O)OC"],
["C=C(C)C(=O)OC", "C=CC(=O)O"],
["CO/C=C\C(=O)OC", "C=Cc1ccccc1"],
["C=C", "C=C"] # SMILES that result in missing values
]
target = [1, 2, 3, 4] # Target variables
new_dataset = copddb.datasets.build_11_12_variables_from_smiles_and_y(smi_list, target)
The contents of the new_dataset is as follows.
>>> new_dataset["data"]
DE_TS_11 DE_TS_12 ... Volume_MonteCarlo_Rad_1 Volume_MonteCarlo_Rad_2
0 -0.005547 -0.005547 ... 103.2494 103.2494
1 -0.005547 0.008555 ... 103.2494 68.5728
2 -0.001421 -0.003731 ... 103.2451 108.9815
[3 rows x 40 columns]
>>> new_dataset["data"].keys()
Index(['DE_TS_11', 'DE_TS_12', 'DE_product_11', 'DE_product_12',
'DE_barrier_11', 'DE_barrier_12', 'DE_reaction_11', 'DE_reaction_12',
'DE_SHgap_11', 'DE_SHgap_12', 'DE_SLgap_11', 'DE_SLgap_12',
'CCdist_TS_11', 'CCdist_TS_12', 'DE_decomposition_tail_1',
'DE_decomposition_tail_2', 'DE_decomposition_head_1',
'DE_decomposition_head_2', 'DE_precursor_1', 'DE_precursor_2',
'E_Rad_SOMO_1', 'E_Rad_SOMO_2', 'E_Rad_LUMO_1', 'E_Rad_LUMO_2',
'E_Mon_HOMO_1', 'E_Mon_HOMO_2', 'E_Mon_LUMO_1', 'E_Mon_LUMO_2',
'VBur_R228_Mon_1', 'VBur_R228_Mon_2', 'VBur_R350_Mon_1',
'VBur_R350_Mon_2', 'VBur_R228_Rad_1', 'VBur_R228_Rad_2',
'VBur_R350_Rad_1', 'VBur_R350_Rad_2', 'Volume_MonteCarlo_Mon_1',
'Volume_MonteCarlo_Mon_2', 'Volume_MonteCarlo_Rad_1',
'Volume_MonteCarlo_Rad_2'],
dtype='object')
When focusing on the copolymers consisting of two monomers, M1 (= St, GMA, PACS, THFMA, and CHMA) and M2 (= MMA) for instance, the descriptors of three radical-monomer pairs, (M1*, M1), (M1*, MMA), (MMA*, M1) could be used for the descriptos of the monomer pair of M1 and MMA. These three descriptor sets can be formed by the m1list_and_m2_to_11_12_21()
function as follows.
from CopDDB import copddb
m1list = [
"C=Cc1ccccc1", # St
"C=C(C)C(=O)OCC1CO1", # GMA
"C=Cc1ccc(OC(C)=O)cc1", # PACS
"C=C(C)C(=O)OCC1CCCO1", # THFMA
"C=C(C)C(=O)OC1CCCCC1", # CHMA
]
m2 = "C=C(C)C(=O)OC" # MMA
new_dataset = copddb.datasets.m1list_and_m2_to_11_12_21(m1list, m2)
The contents of the new_dataset is as follows.
>>> new_dataset.keys()
dict_keys(['data', 'm1s', 'm2'])
>>> new_dataset.data
DE_tail_11 DE_tail_12 DE_tail_21 DE_head_11 DE_head_12 DE_head_21 ... logP_Rad_11 logP_Rad_12 logP_Rad_21 logP_Mon_11 logP_Mon_12 logP_Mon_21
0 0.038749 0.038749 0.038534 0.063061 0.063061 0.061518 ... 2.7 2.7 1.0 2.7 1.0 2.7
1 0.038719 0.038719 0.038534 0.062134 0.062134 0.061518 ... 0.6 0.6 1.0 0.6 1.0 0.6
2 0.038493 0.038493 0.038534 0.063246 0.063246 0.061518 ... 2.3 2.3 1.0 2.3 1.0 2.3
3 0.037938 0.037938 0.038534 0.061257 0.061257 0.061518 ... 1.1 1.1 1.0 1.1 1.0 1.1
4 0.039386 0.039386 0.038534 0.061804 0.061804 0.061518 ... 2.5 2.5 1.0 2.5 1.0 2.5
[5 rows x 72 columns]
>>> new_dataset.m1s
['C=Cc1ccccc1', 'C=C(C)C(=O)OCC1CO1', 'C=Cc1ccc(OC(C)=O)cc1', 'C=C(C)C(=O)OCC1CCCO1', 'C=C(C)C(=O)OC1CCCCC1']
>>> new_dataset.m2
'C=C(C)C(=O)OC'