NLP-Summarizer / text_summary.py
raannakasturi's picture
Upload 3 files
93b9442 verified
text = '''
Machine learning methods have in the past decades witnessed an unprecedented technological evolution enabling a plethora of applications, some of which have become daily companions in our lives.1−3 Applications of ML include technological fields, such as web search, translation, natural language processing, self-driving vehicles, control architectures, and in the sciences, for example, medical diagnostics,4−8 particle physics,9 nano sciences,10 bioinformatics,11,12 braincomputer interfaces,13 social media analysis,14 robotics,15,16 and team, social, or board games.17−19 These methods have also become popular for accelerating the discovery and design of new materials, chemicals, and chemical processes.20 At the same time, we have witnessed hype, criticism, and misunderstanding about how ML tools are to be used in chemical research. OF-DFT is very computationally efficient (these methods should scale linearly with system size153,154) but these formulations have not yet been developed to rival the accuracy or transferability of wavefunction methods, though they have been used for studying different classes of chemical and materials systems.155−157 OF-DFT methods are also used in exciting applications modeling chemistry and materials under extreme conditions.158−160 One should expect that once highly accurate forms are developed and matured, accurate CompChem calculations on electronic structures on systems having more than a million atoms might become commonplace. The ML workflow typically includes the following stages: 1 Gathering and preparing the data 2 Choosing a representation 3 Training the model 3a Train model candidates 3b Evaluate model accuracy 3c Tune hyperparameters 4 Testing the model out of sample Note, that the progression to a good ML model is not necessarily linear and some steps (except the out of sample test) may require reiteration as we learn about the problem at hand. databases containing calculated properties of over 625k materials510 large computational DFT database, which consists of more than 20 M off equilibrium conformations for 57.5k small organic molecules511,512 ANI-1x contains multiple QM properties from 5 M DFT calculations, while ANI-1ccx contains 500k data points obtained with an accurate CCSD/CBS extrapolation513 measured binding affinities focusing on interactions of proteins considered to be candidates as drug-targets; 1 200 000 binding data for 5500 proteins and over 520 000 drug-like molecules514 contains ∼10 000 000 molecular motifs of potential interest which cover small molecule organic photovoltaics and oligomer sequences for polymeric materials515 database containing over 4700 porous structures of metal−organic frameworks with publicly available atomic coordinates; includes important physical and chemical properties516 experimental and calculated hydration free energies for neutral molecules in water517 GDB-11, GDB-13, and GDB-17; together these databases contain billions of small organic molecules following simple chemical stability and synthetic feasibility rules518 contains approximately 1 M zeolite structures519 data sets in this package range in size from 150k to nearly 1 M conformational geometries; all trajectories are calculated at a temperature of 500 K and a resolution of 0.5 fs372 contains data on the properties of over 700k compounds521 1.2 M molecular relaxations with results from over 250 M DFT calculations relevant for renewable energy storage522 consists of DFT predicted crystallographic parameters and formation energies for over 200k experimentally observed crystal structures523 provides 221 million molecular structures optimized with the PM6 method and several electronic properties computed at the same level of theory524 provides ∼3 million molecular structures optimized by DFT and excited states for over 2 million molecules using TD-DFT525 comprehensive data set of 42 physicochemical properties for ∼4.2 M equilibrium and nonequilibrium structures of small organic molecules with up to seven non-hydrogen (C, N, O, S, Cl) atoms526 geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules out of GDB-17527 with a single reference calculation.495 von Lilienfeld and coworkers have investigated how the choice of regressors and molecular representations for ML models impacts accuracy, and their findings suggest ways that ML models may be trained to be more accurate and less computationally expensive than hybrid DFT methods.496 Burke and co-workers have studied how ML methods can result in improved understanding and more physical exact KS-DFT181,497−499 and OFDFT functionals.161 Brockherde et al. have presented an approach, where ML models can directly learn the Hohenberg−Kohn map from the one-body potential efficiently to find the functional and its derivative.162,184 Akashi and co-workers have also reported the out-of-training transferability of NNs that capture total energies, which shows a path forward to generalizable methods.500 Toward predictive insights, there are many other approaches that are broadly useful. For example, ML models can obtain knowledge from failed experimental data more reliably than humans who are more susceptible to survivor bias,545 and it can also be used to distill physical laws and fundamental equations using experimental 363 and computational data.546 ML models can also be used to reliably predict SMILES representations (a string-based representation of molecular graphs) that allow encoded information to be derived from low-resolution images found in the literature.547 ML models can interpret experimental X-ray absorption near edge structure data and predict real space information about coordination environments.548 Likewise, scanning tunneling microscopy data can be used to classify structural and rotational states on surfaces,549 and name indicators can be used to predict in tandem mass As a consequence, generative models offer exciting new avenues in drug and materials design.596,597 Generative methods in CompChem include recurrent neural networks , which can be used for the sequential generation of molecules encoded as SMILES strings.598−600 Segler et al. demonstrated how such a recurrent model can first learn general molecular motifs and then be fine-tuned to sample molecules exhibiting activity against a variety of medical targets.599 Autoencoders are another frequently used ML method for molecular generation. drive generative processes toward certain objectives, allowing for the targeted generation of molecules with particular properties.611−614 RL in general is a promising alternative strategy for generative models,615,616 and they offer the possibility for tight integration into drug design cycles.617 Alternative approaches combine autoregressive models with graph convolution networks.618,619 While these methods use SMILES or graphs to encode molecular structures, generative models have recently been extended to operate on 3D coordinates of molecules and materials.620,621 Gebauer et al. proposed an autoregressive generative model based on the SchNet architecture, called gSchNet.622 Once trained on the QM9 data set, g-SchNet was able to generate equilibrium structures without the need for optimization procedures. For other physical insights, new approaches by Kulik, Getman, and co-workers have also focused on developing ML models appropriate for elucidating complex d-orbital participation in homogeneous catalysis.690 Rappe and co-workers have used regularized random forests to analyze how local chemical pressure effects adsorbate states on surface sites for the hydrogen evolution reaction.691 Almost trivially simple ML approaches can be used in catalysis studies to deduce insights into interaction trends between single metal atoms and oxide supports,692 to identify the significance of features , where CompChem theories break down,693 or they can be used to identify trends that result in optimal catalysis across multiple objectives, such as activity and cost .694 ML is also opening opportunities for CompChem+ML studies on highly detailed and complex networks of reactions.695−700 Such models in principle can then significantly extend the range of utility of microkinetics modeling for predictions of products from catalysis.701,702 ML also enables studies of complicated reaction networks that can allow predictions of regioselective products based on CompChem data,703 asymmetric catalysis important for natural product
'''
import spacy
from collections import Counter
import subprocess
def setup():
try:
try:
subprocess.run(["python", "-m", "spacy", "download en_core_web_trf"], shell=True)
except Exception as e:
subprocess.run(["python3", "-m", "spacy", "download en_core_web_trf"], shell=True)
return True
except Exception as e:
return False
def summarize(text):
nlp = spacy.load('en_core_web_trf')
doc = nlp(text)
word_freq = Counter([word.text.lower() for word in doc if word.is_alpha and not word.is_stop])
max_freq = max(word_freq.values(), default=1)
word_freq = {word: freq / max_freq for word, freq in word_freq.items()}
sentences = []
for i, sentence in enumerate(doc.sents):
sentence_score = sum(word_freq.get(word.text.lower(), 0) for word in sentence if word.is_alpha)
normalized_score = sentence_score / (len(sentence) + 1e-5) # Avoid division by zero
sentences.append((i, sentence.text.strip(), normalized_score))
sorted_sentences = sorted(sentences, key=lambda x: -x[2])
num_sentences = 75
selected_sentences = sorted(sorted_sentences[:num_sentences], key=lambda x: x[0])
summary_text = " ".join([sentence[1] for sentence in selected_sentences])
return summary_text