melting point database

There are currently more than 40,000 compounds and more than 45,000 synthesis references in the database. The GSE water solubility model, which is based on E-state indices and thus requires lower computational resources, was made publicly available on the OCHEM web site. The same accuracy was calculated notwithstanding whether the consensus or model based on the E-state descriptors was used. The prediction of MP remains an important task for cheminformatics studies for a number of reasons [2, 1117]. Melting/Boiling Points Inorganic compounds are often ionic, and so have very high melting points. We have shown that automated tools for the analysis of chemical information have reached a mature stage allowing for the extraction and collection of high quality data to enable the development of structureactivity relationship models. AJW initiated the study, DL extracted and curated the data, IVT performed the modeling and statistical analysis. Moreover, such information can also be useful for the handling of chemical compounds. The first approach was averaging by model accuracy. The 211 resulting descriptors range from 0D descriptors (such as MW, or atom numbers) to 1D, 2D, and various 3D descriptors. J Chem Inf Model 50:742754, BIOVIA Pipeline Pilot Overview. The ratio of identified outliers to that expected by chance corresponds to the signal-to-noise ratio (SNR). Thwarting Strikes: +11% damage while you have active Grit. Thus, both studied strategies did not provide an improvement compared to the use of a simple arithmetic average of models. Chemical Synthesis Database ChemSynthesis is a freely accessible database of chemicals. ECFP4 descriptor circular fingerprints [41] were calculated using ChemAxon software v. 5.10.4. The former may have low affinity and specificity while the latter are likely to be non-soluble. Thus enlargement of the training set increased prediction power of the models according to the CV protocol. Providing categorized protein sequences and structures as psychrophilic, mesophilic and thermophilic makes this database useful for the development of new tools in protein stability prediction. The RMSEs for the PATENTS set were reported for the whole set, i.e. 2023 BioMed Central Ltd unless otherwise stated. Enter a Chemical Name, CAS Number, Molecular Formula or Weight. This distance to model corresponds to the disagreement (standard deviation) of the individual predictions of models in the consensus model [44]. The final consensus model was compared (Table6) with the model developed using the COMBINED set in our previous publication [29]. The bagging models were developed using N=64 models. J Chem Inf Comput Sci 41:14881493, Tetko IV, Tanchuk VY (2002) Application of associative neural networks for prediction of lipophilicity in ALOGPS 2.1 program. http://accelrys.com/products/collaborative-science/biovia-pipeline-pilot/ (5 Aug 2015), Bender A, Mussa HY, Glen RC, Reiling S (2004) Similarity searching of chemical databases using atom environment descriptors (MOLPRINT 2D): evaluation of performance. J Chem Inf Comput Sci 37:705714, Williams A, Lowe D, Tetko I (2015) Melting point and pyrolysis point data for tens of thousands of chemicals. Our database will be updated periodically. A list of pharmaceutical API and excipients with their melting point. Recent Update. Consensus modeling was shown to be an essential approach to calculate high prediction accuracy for the previous study [11]. Solubilities are in water and are reported as grams solvent/100 grams water. The pyrolysis compounds were excluded for development of the MP model but for better comparison with the previous model the analysis in Table3 was performed without separation of both classes. Descriptor selection bias. Moreover, for the drug-like region the estimated =33.3C, and calculated errors, CV RMSE=33C, were also very similar. Curr Comput Aided Drug Des 4:191198, Skvortsova MI, Baskin II, Skvortsov LA, Palyulin VA, Zefirov NS, Stankevich IV (1999) Chemical graphs and their basis invariants. This result was due to the absence of molecules with MP <0C in the PATENTS set. The predictions are calculated for samples which were not included in the respective training sets and are averaged over all calculated models. There are currently more than 40,000 compounds and more than 45,000 synthesis references in the database. A number of data points (see Table1) from the PATENTS collection contained annotation about the thermal decomposition (pyrolysis) of chemical structures. Online Databases ChemBioFinder Free but you must register. Front Environ Sci. The EFG were selected as the set having the smallest number of non-zero values (Table2). For example, the distribution of MP values from PATENTS literature had peaks at 250 and 350C thus indicating that measurements were either stopped at these temperatures and threshold values were reported or simply that at these temperatures an estimated value within a fairly broad range was entered (i.e. The improvement in model performance for the whole COMBINED set was larger compared to the results calculated for the drug-like subsets. This step requires a significant computational time. The SetCompare tool identified that molecules containing acids (carboxylic, phosphonic and -amino acids), primary amines, tetrazoles, and a number of other groups, were overrepresented in the group of compounds, which decomposed with the heating. Thus, the presence of one of these groups increased the probability of a compound to decompose by more than ten times. [68] The same change decreases the number of rings as well as the number of atoms in the largest -chain (relative to the overall size of the molecule) as well as other electronic parameters of the molecule. This customization consisted of adding support for tokens containing spaces (such that a MP measurement could be treated as a single token) and the integration of LeadMine to identify chemical entities and MPs. has extensive experience with the extraction of chemistry-related information from PATENTS and previous investigations have examined the extraction of chemical reactions [3]. ToxAlert [38] extended functional groups (EFG) [39] included 583 groups covering different functional features of molecules. Even these calculations required about 15,000 core-hours. - andselisk Jan 29, 2019 at 23:40 PubChem and Wikipedia also have almost -163 C. NIST produces the Nations Standard Reference Data (SRD). volume8, Articlenumber:2 (2016) In total 498,985 associations were found in patent grants and 172,886 associations were found in the patent applications. As an example, a non-registered and validated registered user can submit models with up to 1000 and 10,000 molecules per task, respectively. We showed that the estimated accuracy varied as a function of temperature and achieved the lowest error of =32C for the drug-like region of the dataset. Enthalpy of combustion. the MP appears at the end of the experimental section along with any other characterization data. Redrawing of chemical compounds can be difficult and in many cases they are not available as structure depictions but only in the form of chemical names. Complicating data extraction, the format used by the USPTO has varied over time with four significantly different formats being employed (one textual, one SGML and two XML formats). ) or https:// means youve safely connected to the .gov website. http://onswebservices.wikispaces.com/meltingpoint (5 Aug 2015), Open modeling of melting point data. This result is in agreement with the known problem of decreasing solubility of compounds in drug discovery for large molecules. The compounds from the Bergstrm dataset had the second largest MPs. A consensus model based on the average of five models calculated the lowest RMSE=42.3C. The MW and number of non-hydrogen atoms of decomposing structural were practically identical to other molecules. The EFG, despite their high dimensionality, had only 3.1 million non-zero values, and provided the fastest calculations. Terms and Conditions, The application of the SVM method required an optimization of three parameters, C, and . Patent grants were available for the entirety of this period, while patent applications were available only from 2001 onwards. J Chem Inf Model 52:23102316, Salmina E, Haider N, Tetko IV (2016) Extended functional groups (EFG): an efficient set for chemical characterization and structure-activity relationship studies of chemical compounds. Correspondence to In the absence of the MP values a default value is frequently used, e.g. The MLRA model developed with both these descriptors MP=117+0.142MW0.79nC achieved an RMSE=64.7C. This flag was set for cases where: Value was a range where the second temperature was lower than the first temperature. Its prediction from chemical structure remains a highly challenging task for quantitative structureactivity relationship studies. Examples of such algorithms include neural networks, multiple linear regression analysisand partial least squares. A comparison of MLRA and SVM results developed using exactly the same sets of descriptors indicated significantly higher accuracy of the SVM models. Welcome to the NIST Chemistry WebBook. These heuristics aimed to detect cases where the patent text was likely to be in error e.g. Compounds with nitroso groups are well known for their ability to decompose with a release of high energy, which makes them very important for the development of explosives (including dynamite). This website contains substances with their synthesis references and physical properties such as melting point, boiling point and density. an accurate MP was not required per se, see Fig. Cite this article. Instructions: Optionally choose a material search category such as a general category like 'Metal' or a child category like 'Aluminum Alloy' from the category tree. The implementation of ASNN did not offer this feature. Int J Pharm 373:2440, Varnek A, Kireeva N, Tetko IV, Baskin II, Solovev VP (2007) Exhaustive QSPR studies of a large diverse set of ionic liquids: how accurately can we predict melting points? It should be noted that the calculation of large models requires significant CPU resources. The training of a model with hundred thousand descriptors is infeasible with computational algorithms, which operate with the full matrix. However, the training of large datasets requires significant computational resources and can take a long time. The resulting training set used is thus double the size of the number of samples in the smaller class. Table1 and Fig. RMSE of LibSVM models calculated with different sets of descriptors. This is an open melting point database of pharmaceutical API and excipients. Significant amounts of MP data are freely available within the patent literature and, if it were available in the appropriate form, could potentially be used to develop predictive models. Journal of Cheminformatics The grammar can be summarized as: FromLiterature? Search Open Melting Point Data Thirteen thousand experimental melting points for slightly over eight thousand chemical structures. Part of where RMSE was the root mean squared error of the model. doi:10.3389/fenvs.2016.00002, Zhu H, Tropsha A, Fourches D, Varnek A, Papa E, Gramatica P, Oberg T, Dao P, Cherkasov A, Tetko IV (2008) Combinatorial QSAR modeling of chemical toxicants tested against Tetrahymena pyriformis. Indeed, since the accuracies of individual models were very similar and their weighted combination did not improve results compared to the simple average. The RMSE error calculated for the Bergstrm set is the lowest published value for this set and it is about 30% smaller compared to 44.6C reported in the original study of Bergstrm et al. Linear Formula: C6H4-1,2- (CO2H)2 CAS Number: 88-99-3 Molecular Weight: 166.13 Beilstein: 608199 EC Number: 201-873-2 MDL number: MFCD00002467 PubChem Substance ID: 24898471 NACRES: NA.21 Pricing and availability is not currently available. Molecules21:1 doi:10.3390/molecules21010001, Haider N (2010) Functionality pattern matching as an efficient complementary structure/reaction search tool: an open-source approach. The estimation of logS with an error of <0.5 log units is on the level of the experimental measurement accuracy [75] and thus is very valuable for the pharma industry. Compounds with MP <0C, most of which were data processing errors, were excluded. Some of the problems with collected values were difficult to recognize and eliminate. Click on the [+] symbol to open branches on the tree. is an individual prediction was used to develop the consensus model in that study. Thus the results for the PATENTS set COMBINED prediction of molecules from the validation sets and prediction of the outlying molecules using the final models developed with the respective training set. The Bergstrm dataset contained drug-like molecules [17]. The new model provided similar or lower errors for the drug-like subsets compared to the consensus models developed with individual OCHEM or Enamine sets. By using this website, you agree to our J Cheminform 8, 2 (2016). Lewis, Boca Raton, p xxii, Book The descriptors are calculated by splitting the respective string of all possible continuous substrings of a fixed length. during the process of registering, uploading data, developing and publishing models and participating in data moderation). 236 instead of Mp. It is also interesting that only a few compounds in this set had MP values of >250C, thus indicating the difficulties of identifying reproducible measurements for high MP values. These 2D descriptors are calculated with the help of the ISIDA fragmenter tool [32]. Such an approach could enable a widespread use of the GSE equation to estimate the solubility of chemical compounds. In these cases the assumption is made that the MP applies to the compound being synthesized in that paragraph (Fig. The prediction of MP itself has limited practical value. The increase in the accuracy of 0.10.3 log units for both sets was not statistically significant. A large number of MP measurements were duplicated across different PATENTS. J Med Chem 39:28872893, Yang Y, Chen H, Nilsson I, Muresan S, Engkvist O (2010) Investigation of the relationship between topology and selectivity for druglike molecules. According to this equation, the prediction of MP with RMSE of 30C contributes 0.3 log unit to the error of the solubility prediction. He was particularly interested in the quality of experimental MPs reported in the literature and those reported by chemical vendors [6]. FIG. Simply . Molecular frameworks. The support of a sparse data format is efficiently realized in LibSVM making this method easily applicable to this type of data. 4 confirm this observation and indicate that about 90% of compounds from the PATENTS, Enamine and Bergstrm data sets are covered by this temperature interval. J Chem Inf Comput Sci 43:493500, ChemAxon Kft. out of seven molecules identified for this p-value, only one can be explained by statistical properties of the data. CAS The compounds with MP from the PATENTS dataset contributed molecules with the largest MW and thus MP. Contains information on approximately 11,000 substances, including melting point, boiling point, density, solubility and refractive index. http://usefulchem.blogspot.com/2011/06/my-talk-at-sla-on-trust-in-science-and.html (5 Aug 2015), Open Melting Point Collection Book Edition 1. http://usefulchem.blogspot.com/2011/08/open-melting-point-collection-book.html (5 Aug 2015), Melting Point Web Services. The use of WEKA [58] implementation of decision trees (J48) improved balanced accuracy for Fragmentor descriptors from 75.8 to 77.6%. Up to four elements and compositions are However, the aforementioned effect is not the only one contributing to the MP of compounds. This database contains data needed for better understanding protein thermostability and stability engineering. The correlation of melting points of two most common systems, disubstituted imidazolium tetrafluorobo-rate and disubstituted imidazolium hexafluorophosphate, was carried out using a. The final consensus model achieved a precision, which was similar to the estimated experimental accuracy. The entry point to the database is the search form, which allows browsing in two major ways: (i) a simple full-text search for querying the database using protein name, UniProt accession codes, PDB . The outlying compounds were therefore again enriched with decomposing compounds. The melting range of bupropion hydrochloride (polymorph unknown) was experimentally determined to be 230.9-231.8C using a Krss M5000 apparatus (Hamburg, Germany). 383C, which was a misprint of the minus sign. Because of the limitation on the computational resources, the grid search to select SVM parameters was done using only one set of descriptors, EFG, which contained the smallest number of non zero values. Further progress in the prediction of MPs can be advanced by improvement in the accuracy of experimental measurements, as well as prediction of MP for different polymorphic and amorphic forms. The dashed lines indicate a defined drug-like region, which covers the MP of >90% of drugs (Bergstrm) and chemical provider (Enamine) set as well as 87% of the compounds from the PATENTS set. Adriana.Code v.2.2.6 [28] (3D), developed by Molecular Networks GmbH, calculates a variety of physicochemical properties of a molecule. The CV RMSE for a subset of molecules that decomposed was 47.7C, i.e. 75C, 200F, one hundred degrees Celsius. Validating the measured property in any meaningful way is difficult but manual inspection can highlight obvious errors with the parameters as captured (vide infra). Search by chemical names Systematic names Synonyms Trade names Database identifiers Search by chemical structure Create structure-based queries While some inorganic compounds are solids with accessible melting points, and some are liquids with reasonable boiling points, there are not the exhaustive tabulations of melting/boiling point data for inorganic compounds that exist for organics. Our current melting temperature database contains 9,375 materials, out of which 982 compounds are high-melting-temperature materials with melting points above 2,000 K. The database consists of chemical compositions (i.e., elements and concentrations) or equivalently chemical formula, of the materials, and their . statement and The duplicated measurements N=18,058 were used to estimate the experimental accuracy of MP measurements, which was estimated to be =38C. As an illustrative example we applied the GSE to predict logS for N=1311 molecules from our previous study [78]. references will be available. Melting point (MP) is an important property in regards to the solubility of chemical compounds. His interests were in regards to the value of MP to help in predicting temperature-dependent solubility for solvent selection [4] as well as assembling measured experimental properties as part of an Open Notebook Challenge [5]. If instead of N=23 we detect e.g. While this procedure has only been reported for MP extraction and modeling in this work we can imagine utilizing the same procedure for other physicochemical properties such as multi-solvent solubilities, logP and other available parameters. https://creativecommons.org/licenses/by/3.0/ (24 Nov 2015), Palmer DS, Mitchell JB (2014) Is experimental data quality the limiting factor in predicting the aqueous solubility of druglike molecules? J Med Chem 49:64296434, Sushko I, Novotarskyi S, Korner R, Pandey AK, Rupp M, Teetz W, Brandmaier S, Abdelaziz A, Prokopenko VV, Tanchuk VY, Todeschini R, Varnek A, Marcou G, Ertl P, Potemkin V, Grishina M, Gasteiger J, Schwab C, Baskin II, Palyulin VA, Radchenko EV, Welsh WJ, Kholodovych V, Chekmarev D, Cherkasov A, Aires-de-Sousa J, Zhang QY, Bender A, Nigsch F, Patiny L, Williams A, Tkachenko V, Tetko IV (2011) Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information. The original text was retained for reference. Since development of models with E-state counts was faster, the counts were used. We used the average or threshold value for the development of the LibSVM models. J Comput Aided Mol Des 25:533554, OCHEM Molecular descriptors. A number of technical challenges were solved to curate the data and transform the information from the text to computer readable formats. With this in mind we decided to investigate the data mining of property data from an openly available patent corpus, with a focus on the extraction, curation and modeling of MP data. GSFrag and GSFrag-L [33] are used to calculate 2D descriptors representing fragments of length \({\text{k}} = 2 \ldots 10\) or \({\text{k}} = 2 \ldots 7,\) respectively. A binned plot of the accuracy as a function of the MP temperature indicates that measurements with higher and lower temperatures were less reproducible (Fig. alert) in two analyzed sets could happen by chance [25]. By repeating the model building five times one can calculate predictions for all molecules from the initial dataset. All of these descriptor types are implemented within the OCHEM platform [29]. Gaithersburg, MD 20899-6410 These predictions are used to estimate the CV accuracy of the model. In the second approach a consensus model was developed using the predictions of individual models as descriptors for a multiple linear regression model (MLRA). When using 1% of a randomly selected training data set we found that, surprisingly, the same parameters (C=64, =1, =0.00391) were optimal for 10 out of 13 descriptor sets. This indicates the high quality of patent-mined data, which is similar to that of manually curated data from the literature. Once one of these options is used the model can be submitted to perform calculations without a need to specify any other parameters and will use exactly the same workflow as the original model. This grid optimization procedure is implemented as part of OCHEM. if the enthalpy change of melting or Hfusion were zero). The second one is its complement, which indicates the number of free electrons for the carbon atoms, which do not participate in the overlap. 4). Melting point standard 121-123 C, benzoic acid is an analytical standard used to calibrate melting point instruments, in the thermodynamic mode of operation. In: Strasbourg summer school on chemoinformatics: cheminfoS3. http://www.csie.ntu.edu.tw/~cjlin/libsvm (10 Nov 2015), Tetko IV, Baskin II, Varnek A (2008) Tutorial on machine learning. The terms melting range, melting point, or melting temperature are all used in pharmacopeial contexts at which a given solid material changes from a solid state to a liquid, or melts. Its melting point value is determined with an average of 6 to 12 measurements using a Bchi B-545 instrument that is calibrated against primary standards. The entities to associate are shown above. Before the development of models, descriptors, which had two or fewer non-zero values for the whole training set were eliminated. The aggregation and curation of such datasets can be very exacting in terms of extraction of the data from the literature. One of the authors (D.L.) (Value|Range|MeasurementError) OutcomeQualifier? e-mail data [at] nist.gov, Webmaster | Contact Us | Our Other Offices, Mass Spec: NIST/EPA/NIH Mass Spectral Library, NIST INORGANIC CRYSTAL STRUCTURE DATABASE (ICSD) SRD3, REFPROP: NIST Reference Fluid Thermodynamic and Transport Properties, Manufacturing Extension Partnership (MEP), Journal of Physical and Chemical Reference Data, International Metrology Resource Registry, Selected NIST-Recommended Practice Guides in Material Sciences, NIST Simulation of Electron Spectra for Surface Analysis. 6 CRC handbook of chemistry and physics: citable and reliable. In addition, the NIST Electron Elastic-Scattering Cross-Section Database (SRD 64) and the NIST Database of Cross Sections for Inner-Shell Ionization by Electron or Positron Impact (SRD 164) provide data for Monte Carlo simulations of electron transport in matter and for applications in atomic physics, plasma physics, radiation physics, and materials analysis by electron-probe microanalysis. Following data upload to OCHEM we performed modeling and reviewed outlier molecules. where logS is the intrinsic molar solubility and logP is the octanol/water partition coefficient. The same procedure is also used for the larger class but the number of selected samples is limited to that of the smaller class. The CV RMSE for different subsets of the final model as a function of MP. CompilationoftheMeltingPoints OftheMetalOxides DISCARDED3Y M.5/D.A U.S.DEPARTMENTOFCOMMERCE NATIONALBUREAUOFSTANDARDS Next, select a material property from the drop-down list and enter the Unit of Measure. The patent-mined data from this study are publicly downloadable from the same web site as well as available from FigShare [70] under a CC-BY license [71]. Therefore, after initial analysis LibSVM was used to develop all models using radial basis function (RBF) kernel. They were present in only 0.4% compounds (0.1% phosphonic and 0.3% -amino acids) in the whole set but contributed about 4% of all compounds in the decomposing set. The number of rings and resonance counts (number of resonance structures of a molecule) were also two highly correlated descriptors (R=0.355 and R=0.354) calculated using ChemAxon. Thus, the developed consensus model achieved the experimental accuracy of the MP data (Fig. Further branching, such as with the isomer . The theme of this memorial issue is focused on the contributions of Jean-Claude Bradley to Open Science and Dr. Bradley had a particular interest in the quality of MP data and he invested significant efforts in investigating this property. Thus, the COMBINED set has about the same percentage of decomposing compounds as the PATENTS set. significantly larger compared to the 36.5C calculated for the subset of molecules without the decomposition. SESSA: NIST Simulation of Electron Spectra for Surface Analysis. These results indicate the separation of molecules into two classes, i.e. This may be done using other forms of analysis, such as gas chromatography-mass spectroscopy coupled with a database. The groups are based on classifications provided by the CheckMol software [40], which was extended to cover new groups, in particular heterocycles [39]. Three descriptor sets (E-state, CDK and Fragmentor) had balanced accuracy above 75.5% with the best one, E-state, having 78.1%. The selection of samples is repeated for each developed model used in the bagging protocol. https://doi.org/10.1186/s13321-016-0113-y, DOI: https://doi.org/10.1186/s13321-016-0113-y. In this study, we used the sequence fragments composed of atoms and bonds. RIVM, National Institute of Public Health and the Environment, Bilthoven, Delaney JS (2005) Predicting aqueous solubility from structure. For each molecule we selected one record, which had MP near to the median experimental value for it. The descriptor packages analyzed in this study calculated different numbers of descriptors (see Table2). Workflow for extraction of melting point data. A consensus model was built as a simple average of all models with an exception of the three aforementioned models, which had CV RMSE >50C. solubility assessment [19] or as a parameter of multiple solvent models to simulate the accumulation and degradation of chemicals in different solvents, based on a number of explicit mathematical models for the transfer and degradation of molecules [62, 63]. Mol Pharmacol 11:29622972, Hughes LD, Palmer DS, Nigsch F, Mitchell JB (2008) Why are some properties more difficult to predict than others? The propensity of a compound to decompose versus melt is different properties. following pages: text search or structure search. Google Scholar, Manahan SE (2003) Toxicological chemistry and biochemistry, 3rd edn. It provides the descriptor engine, which calculates 246 descriptors containing topological, geometric, electronic, molecular, and constitutional descriptors. IVT is CEO of BigChem GmbH, which licenses OCHEM software. The developed consensus model estimates both the applicability domain [59] and the accuracy of the prediction for new compounds based on the CONSENSUS-STD distance to model [44, 59]. Properties grade reagent grade Quality Level 200 Assay 98% form crystalline powder powder Unsaturation and saturation indexes Ui (R=0.349) and Uc (R=0.325) were the two most highly correlated molecular property descriptors calculated by the Dragon software.

Custard Biscuits Recipe Jamie Oliver, Articles M

melting point database 13923 Umpire St