{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [], "gpuType": "T4" }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" }, "accelerator": "GPU" }, "cells": [ { "cell_type": "markdown", "source": [ "# The NLP Pipeline" ], "metadata": { "id": "BK6CXSOk6K-X" } }, { "cell_type": "markdown", "source": [ "![NLP Pipeline](https://images.prismic.io/turing/65980b21531ac2845a272614_Natural_language_processing_pipeline_e3608ff95c.webp?auto=format,compress)" ], "metadata": { "id": "-Q-ktKxN6QFc" } }, { "cell_type": "markdown", "source": [ "## 1. Sentence Segmentation πŸ’¬\n", "\n", "This initial step involves **breaking down raw text into individual sentences**. It's crucial because many subsequent NLP tasks operate at the sentence level. We'll use a library like **NLTK's `punkt` tokenizer**, which is trained to recognize sentence boundaries.\n", "\n", "* **Colab Demo:**\n", " * **Input:** A multi-sentence paragraph.\n", " * **Code:** `from nltk.tokenize import sent_tokenize; text = \"Your sample paragraph here.\"; sentences = sent_tokenize(text)`\n", " * **Output:** A list where each element is a separate sentence." ], "metadata": { "id": "4E-bWQxq6unO" } }, { "cell_type": "markdown", "source": [ "## 2. Word Tokenization 🏷️\n", "\n", "After segmentation, we further break down each sentence into **individual words or \"tokens.\"** Punctuation is often treated as separate tokens. This process creates the fundamental units of text that NLP models will analyze. We'll again use an NLTK tokenizer for this.\n", "\n", "* **Colab Demo:**\n", " * **Input:** A single sentence (from the previous step's output).\n", " * **Code:** `from nltk.tokenize import word_tokenize; sentence = \"Your sample sentence here.\"; words = word_tokenize(sentence)`\n", " * **Output:** A list of individual words and punctuation marks." ], "metadata": { "id": "AuhW2n-h7jHe" } }, { "cell_type": "markdown", "source": [ "## 3. Stemming 🌳\n", "\n", "Stemming is a basic technique to **reduce words to their root or \"stem\" form** by chopping off suffixes. The resulting stem might not be a linguistically valid word, but it helps group together variations of a word. It's often used for information retrieval. We'll demonstrate with **NLTK's Porter Stemmer**.\n", "\n", "* **Colab Demo:**\n", " * **Input:** A list of words (e.g., \"running,\" \"runs,\" \"ran,\" \"runner\").\n", " * **Code:** `from nltk.stem import PorterStemmer; stemmer = PorterStemmer(); stemmed_words = [stemmer.stem(word) for word in words_list]`\n", " * **Output:** The list of words with their stemmed versions (e.g., \"run,\" \"run,\" \"ran,\" \"runner\")." ], "metadata": { "id": "uN-UpLCq7oUf" } }, { "cell_type": "markdown", "source": [ "## 4. Lemmatization πŸ‹\n", "\n", "Lemmatization is a more sophisticated process that **reduces words to their base or dictionary form (lemma)**. Unlike stemming, the lemma is always a valid word. It uses morphological analysis and often requires knowing the word's part of speech for accuracy. We'll use **NLTK's WordNetLemmatizer**.\n", "\n", "* **Colab Demo:**\n", " * **Input:** A list of words (e.g., \"better,\" \"cars,\" \"geese,\" \"ran\").\n", " * **Code:** `from nltk.stem import WordNetLemmatizer; lemmatizer = WordNetLemmatizer(); lemmatized_words = [lemmatizer.lemmatize(word) for word in words_list]`\n", " * **Output:** The list of words with their lemmatized versions (e.g., \"good,\" \"car,\" \"goose,\" \"run\")." ], "metadata": { "id": "hQ7ZDOs37s5F" } }, { "cell_type": "markdown", "source": [ "## 5. Stop Word Analysis 🚫\n", "\n", "**Stop words are common words** (like \"the,\" \"a,\" \"is,\" \"and\") that often carry little significant meaning and can be removed without losing much context. Removing them helps reduce noise and focus on more important terms for analysis. We'll use NLTK's predefined list of English stop words.\n", "\n", "* **Colab Demo:**\n", " * **Input:** A sentence or list of words containing stop words.\n", " * **Code:** `from nltk.corpus import stopwords; stop_words = set(stopwords.words('english')); filtered_words = [word for word in word_list if word.lower() not in stop_words]`\n", " * **Output:** The list of words with stop words removed." ], "metadata": { "id": "fRjDQklN7xbz" } }, { "cell_type": "markdown", "source": [], "metadata": { "id": "oOrr6EPQ7z5h" } }, { "cell_type": "markdown", "source": [ "## 6. Dependency Parsing πŸ”—\n", "\n", "Dependency parsing analyzes the **grammatical relationships between words in a sentence**. It identifies which words are dependent on others, forming a tree-like structure. This helps us understand the syntactic structure and how words relate to each other's meanings. We'll use **spaCy** for its efficient dependency parser.\n", "\n", "* **Colab Demo:**\n", " * **Input:** A simple sentence.\n", " * **Code:** `import spacy; nlp = spacy.load(\"en_core_web_sm\"); doc = nlp(\"The quick brown fox jumps over the lazy dog.\"); for token in doc: print(f\"{token.text} -- {token.dep_} -- {token.head.text}\")`\n", " * **Output:** A table showing each word, its dependency relation, and its head word. You might also display spaCy's built-in dependency visualizer." ], "metadata": { "id": "HR6r2uLy75JP" } }, { "cell_type": "markdown", "source": [ "## 7. Part-of-Speech Tagging 🏷️\n", "\n", "Part-of-Speech (POS) tagging is the process of **assigning a grammatical category to each word** in a sentence. This includes tags like noun (NN), verb (VB), adjective (JJ), adverb (RB), etc. It's a fundamental step that helps subsequent analyses understand the role of each word. We'll use **spaCy** for this.\n", "\n", "* **Colab Demo:**\n", " * **Input:** A sentence.\n", " * **Code:** `import spacy; nlp = spacy.load(\"en_core_web_sm\"); doc = nlp(\"The quick brown fox jumps over the lazy dog.\"); for token in doc: print(f\"{token.text} -- {token.pos_}\")`\n", " * **Output:** Each word followed by its assigned POS tag." ], "metadata": { "id": "n87eCqiG752y" } }, { "cell_type": "markdown", "source": [ "---" ], "metadata": { "id": "2WUj-RLl8EyW" } }, { "cell_type": "markdown", "source": [ "# Natural Language Processing (NLP) Pipeline & Sentiment Analysis Demo" ], "metadata": { "id": "x7L3ZGft8GEi" } }, { "cell_type": "markdown", "source": [ "This Google Colab notebook demonstrates fundamental steps in a Natural Language Processing (NLP) pipeline,\n", "followed by a practical example of sentiment analysis.\n", "\n", "We will cover:\n", "1. **NLP Pipeline Steps:**\n", " * Sentence Segmentation\n", " * Word Tokenization\n", " * Stemming\n", " * Lemmatization\n", " * Stop Word Removal\n", " * Dependency Parsing\n", " * Part-of-Speech Tagging\n", "2. **Sentiment Analysis:**\n", " * Using NLTK's VADER (Valence Aware Dictionary and sEntiment Reasoner)\n", "\n", "Let's get started!" ], "metadata": { "id": "ujVtZFGRGSxI" } }, { "cell_type": "code", "source": [ "%%bash\n", "pip install nltk huggingface_hub transformers spacy gensim fastai==2.7.12 fastcore==1.5.29 inltk==0.5.1\n", "\n", "# Download necessary NLTK data\n", "python -c \"import nltk; nltk.download('punkt'); nltk.download('wordnet'); nltk.download('stopwords'); nltk.download('averaged_perceptron_tagger'); nltk.download('vader_lexicon')\"\n", "\n", "# Download necessary spaCy model\n", "python -m spacy download en_core_web_sm" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Yffv35qsBiqf", "outputId": "d070ca40-b633-4e85-d7b5-882b149ee2c1" }, "execution_count": 2, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Requirement already satisfied: nltk in /usr/local/lib/python3.11/dist-packages (3.9.1)\n", "Requirement already satisfied: huggingface_hub in /usr/local/lib/python3.11/dist-packages (0.33.4)\n", "Requirement already satisfied: transformers in /usr/local/lib/python3.11/dist-packages (4.53.2)\n", "Requirement already satisfied: spacy in /usr/local/lib/python3.11/dist-packages (3.8.7)\n", "Collecting gensim\n", " Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)\n", "Collecting fastai==2.7.12\n", " Downloading fastai-2.7.12-py3-none-any.whl.metadata (9.6 kB)\n", "Collecting fastcore==1.5.29\n", " Downloading fastcore-1.5.29-py3-none-any.whl.metadata (3.5 kB)\n", "Collecting en-core-web-sm==3.8.0\n", " Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)\n", " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 106.5 MB/s eta 0:00:00\n", "\u001b[38;5;2mβœ” Download and installation successful\u001b[0m\n", "You can now load the package via spacy.load('en_core_web_sm')\n", "\u001b[38;5;3m⚠ Restart to reload dependencies\u001b[0m\n", "If you are in a Jupyter or Colab notebook, you may need to restart Python in\n", "order to load all the package's dependencies. You can do this by selecting the\n", "'Restart kernel' or 'Restart runtime' option.\n" ] }, { "output_type": "stream", "name": "stderr", "text": [ "ERROR: Could not find a version that satisfies the requirement inltk==0.5.1 (from versions: 0.0.2, 0.0.3, 0.0.4, 0.0.5, 0.0.6, 0.0.7, 0.0.8, 0.1.0, 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.6.1, 0.7, 0.7.1, 0.7.2, 0.7.3, 0.7.4, 0.7.5, 0.8, 0.8.1, 0.9)\n", "ERROR: No matching distribution found for inltk==0.5.1\n", "[nltk_data] Downloading package punkt to /root/nltk_data...\n", "[nltk_data] Unzipping tokenizers/punkt.zip.\n", "[nltk_data] Downloading package wordnet to /root/nltk_data...\n", "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", "[nltk_data] Unzipping corpora/stopwords.zip.\n", "[nltk_data] Downloading package averaged_perceptron_tagger to\n", "[nltk_data] /root/nltk_data...\n", "[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.\n", "[nltk_data] Downloading package vader_lexicon to /root/nltk_data...\n" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "4ecccc97" }, "source": [ "## NLP Pipeline Demonstration (English)" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "c71fac50", "outputId": "436a9190-607a-4e6f-e51a-1d555776f6b4" }, "source": [ "import nltk\n", "import spacy\n", "from nltk.tokenize import sent_tokenize, word_tokenize\n", "from nltk.stem import PorterStemmer, WordNetLemmatizer\n", "from nltk.corpus import stopwords\n", "from nltk.sentiment.vader import SentimentIntensityAnalyzer\n", "\n", "# Download necessary NLTK data (already done in setup but good to have here too)\n", "# These lines are typically run once in the setup cell.\n", "# For this standalone snippet, uncomment if running in a fresh environment\n", "# nltk.download('punkt')\n", "# nltk.download('wordnet')\n", "# nltk.download('stopwords')\n", "# nltk.download('averaged_perceptron_tagger') # Needed for default POS tagging with NLTK, but we use spaCy here\n", "# nltk.download('vader_lexicon')\n", "# nltk.download('punkt_tab') # This specific download is often not needed if 'punkt' is already there\n", "print(\"NLTK data resources checked/downloaded...\")\n", "\n", "\n", "# Load spaCy model\n", "try:\n", " nlp_en = spacy.load(\"en_core_web_sm\")\n", " print(\"\\nEnglish Core Web spaCy Model Loaded successfully.\\n\")\n", "except OSError:\n", " print(\"SpaCy model 'en_core_web_sm' not found. Downloading...\")\n", " !python -m spacy download en_core_web_sm\n", " nlp_en = spacy.load(\"en_core_web_sm\")\n", " print(\"\\nEnglish Core Web spaCy Model Downloaded and Loaded.\\n\")\n", "\n", "\n", "text_en = \"\"\"Natural Language Processing (NLP) is a fascinating and rapidly evolving field at the intersection of computer science, artificial intelligence, and linguistics. At its core, NLP enables computers to understand, interpret, and generate human language in a valuable and meaningful way. It's about bridging the communication gap between humans and machines, allowing us to interact with technology using our most natural form of expression: language.\\n\n", "The utility of NLP spans a vast array of applications that touch our daily lives, often without us even realizing it. From the moment you ask a virtual assistant a question, to the automatic translation of a webpage, or even the spam filter protecting your inbox, NLP is hard at work. It's the engine behind search engines that understand your queries, recommendation systems that suggest content, and grammar checkers that refine your writing.\\n\n", "One of NLP's crucial applications is in sentiment analysis, where it determines the emotional tone behind a piece of text. Businesses use this to gauge customer feedback from social media, reviews, and surveys, allowing them to understand public perception of their products or services. This insight is invaluable for strategic decision-up making, product development, and customer relationship management.\\n\n", "Machine translation is another cornerstone of NLP, breaking down language barriers across the globe. Services like Google Translate utilize sophisticated NLP models to convert text or speech from one language to another, facilitating international communication, trade, and cultural exchange.11 While still imperfect, these systems are constantly improving, striving for more nuanced and contextually accurate translations.\\n\n", "The rise of chatbots and virtual assistants is heavily reliant on NLP. These AI-powered entities process user queries, understand their intent, and generate coherent and relevant responses, simulating human-like conversation.14 They are increasingly deployed in customer service, healthcare, and education, providing instant support and information, thereby enhancing user experience and operational efficiency.\\n\n", "NLP also plays a pivotal role in information extraction, where it identifies and pulls specific data points from unstructured text. This can involve extracting names, dates, locations, or key facts from legal documents, research papers, or news articles. It transforms vast quantities of raw text into structured, actionable data, significantly reducing the manual effort required for data analysis and knowledge discovery.\\n\n", "The importance of NLP cannot be overstated in today's data-driven world. As the volume of digital text data explodes, NLP provides the tools to make sense of this information, transforming it into valuable insights. It empowers organizations to automate tasks, improve decision-making, enhance customer interactions, and uncover hidden patterns in textual data that would otherwise be impossible to analyze at scale.\\n\n", "Furthermore, NLP is critical for accessibility and inclusion. By enabling text-to-speech and speech-to-text functionalities, it assists individuals with disabilities in accessing information and communicating more effectively. It also helps bridge linguistic divides, allowing people from different language backgrounds to interact and share knowledge seamlessly.\\n\n", "The advancements in NLP are largely driven by breakthroughs in machine learning and deep learning, particularly with the advent of transformer models like BERT, GPT, and others. These models have revolutionized the field, pushing the boundaries of what's possible in language understanding and generation, leading to more accurate translations, more coherent text generation, and more sophisticated conversational AI.\\n\n", "In conclusion, NLP is not just a technological innovation; it's a transformative force that is reshaping how humans interact with technology and each other. Its continuous evolution promises to unlock even more sophisticated applications, further integrating intelligent language capabilities into every facet of our digital and real-world experiences, making information more accessible and interactions more intuitive.\\n\"\"\"\n", "\n", "# --- Original Text Display ---\n", "print(\"\\n\" + \"=\"*50)\n", "print(\" ORIGINAL TEXT\")\n", "print(\"=\"*50)\n", "print(f\"\\n{text_en}\\n\")\n", "print(\"=\"*50)\n", "\n", "\n", "# --- 1. Sentence Segmentation ---\n", "print(\"\\n\\n\" + \"=\"*50)\n", "print(\" 1. SENTENCE SEGMENTATION\")\n", "print(\"=\"*50)\n", "sentences_en = sent_tokenize(text_en)\n", "print(\"\\nDetected Sentences:\")\n", "for i, sentence in enumerate(sentences_en):\n", " print(f\" [{i+1}] {sentence}\")\n", "print(\"-\" * 50)\n", "\n", "\n", "# --- 2. Word Tokenization ---\n", "print(\"\\n\\n\" + \"=\"*50)\n", "print(\" 2. WORD TOKENIZATION\")\n", "print(\"=\"*50)\n", "# Using the first sentence for demonstration\n", "words_en = word_tokenize(sentences_en[0])\n", "print(f\"\\nSentence for Tokenization: '{sentences_en[0]}'\")\n", "print(f\"Tokens: {words_en}\")\n", "print(\"-\" * 50)\n", "\n", "\n", "# --- 3. Stemming (using Porter Stemmer) ---\n", "print(\"\\n\\n\" + \"=\"*50)\n", "print(\" 3. STEMMING (Porter Stemmer)\")\n", "print(\"=\"*50)\n", "stemmer = PorterStemmer()\n", "stemmed_words_en = [stemmer.stem(word) for word in words_en]\n", "print(f\"\\nOriginal Tokens: {words_en}\")\n", "print(f\"Stemmed Tokens: {stemmed_words_en}\")\n", "print(\"-\" * 50)\n", "\n", "\n", "# --- 4. Lemmatization (using WordNetLemmatizer) ---\n", "print(\"\\n\\n\" + \"=\"*50)\n", "print(\" 4. LEMMATIZATION (WordNetLemmatizer)\")\n", "print(\"=\"*50)\n", "lemmatizer = WordNetLemmatizer()\n", "# Note: Lemmatization often benefits from POS tagging for accuracy\n", "# For a simple demo, we'll just use the default (noun)\n", "lemmatized_words_en = [lemmatizer.lemmatize(word) for word in words_en]\n", "print(f\"\\nOriginal Tokens: {words_en}\")\n", "print(f\"Lemmatized Tokens: {lemmatized_words_en}\")\n", "print(\"-\" * 50)\n", "\n", "\n", "# --- 5. Stop Word Analysis ---\n", "print(\"\\n\\n\" + \"=\"*50)\n", "print(\" 5. STOP WORD REMOVAL\")\n", "print(\"=\"*50)\n", "stop_words_en = set(stopwords.words('english'))\n", "filtered_words_en = [word for word in words_en if word.lower() not in stop_words_en]\n", "print(f\"\\nOriginal Tokens: {words_en}\")\n", "print(f\"Tokens after Stop Word Removal: {filtered_words_en}\")\n", "print(\"-\" * 50)\n", "\n", "\n", "# --- 6. Dependency Parsing (using spaCy) ---\n", "print(\"\\n\\n\" + \"=\"*50)\n", "print(\" 6. DEPENDENCY PARSING (using spaCy)\")\n", "print(\"=\"*50)\n", "# Corrected: Process the first sentence (a string) with spaCy\n", "doc_en_parsed = nlp_en(sentences_en[0])\n", "print(f\"\\nSentence for Dependency Parsing: '{sentences_en[0]}'\")\n", "print(\"\\n{:<15} {:<20} {:<15} {:<10}\".format(\"Word\", \"Dependency Relation\", \"Head Word\", \"Head POS\"))\n", "print(\"-\" * 70) # Adjusted length for better visual separation\n", "for token in doc_en_parsed:\n", " print(f\"{token.text:<15} {token.dep_:<20} {token.head.text:<15} {token.head.pos_:<10}\")\n", "print(\"-\" * 70)\n", "\n", "\n", "# --- 7. Part-of-Speech Tagging (using spaCy) ---\n", "print(\"\\n\\n\" + \"=\"*50)\n", "print(\" 7. PART-OF-SPEECH TAGGING (using spaCy)\")\n", "print(\"=\"*50)\n", "# Using the same doc_en_parsed from the previous step for consistency\n", "print(f\"\\nSentence for POS Tagging: '{sentences_en[0]}'\")\n", "print(\"\\n{:<15} {:<15} {:<25}\".format(\"Word\", \"POS Tag\", \"Explanation\"))\n", "print(\"-\" * 55)\n", "for token in doc_en_parsed:\n", " print(f\"{token.text:<15} {token.pos_:<15} {spacy.explain(token.pos_):<25}\")\n", "print(\"-\" * 55)\n", "\n", "print(\"\\n\\n\" + \"=\"*50)\n", "print(\" NLP Pipeline Demo Complete!\")\n", "print(\"=\"*50)\n" ], "execution_count": 1, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "NLTK data resources checked/downloaded...\n", "\n", "English Core Web spaCy Model Loaded successfully.\n", "\n", "\n", "==================================================\n", " ORIGINAL TEXT\n", "==================================================\n", "\n", "Natural Language Processing (NLP) is a fascinating and rapidly evolving field at the intersection of computer science, artificial intelligence, and linguistics. At its core, NLP enables computers to understand, interpret, and generate human language in a valuable and meaningful way. It's about bridging the communication gap between humans and machines, allowing us to interact with technology using our most natural form of expression: language.\n", "\n", "The utility of NLP spans a vast array of applications that touch our daily lives, often without us even realizing it. From the moment you ask a virtual assistant a question, to the automatic translation of a webpage, or even the spam filter protecting your inbox, NLP is hard at work. It's the engine behind search engines that understand your queries, recommendation systems that suggest content, and grammar checkers that refine your writing.\n", "\n", "One of NLP's crucial applications is in sentiment analysis, where it determines the emotional tone behind a piece of text. Businesses use this to gauge customer feedback from social media, reviews, and surveys, allowing them to understand public perception of their products or services. This insight is invaluable for strategic decision-up making, product development, and customer relationship management.\n", "\n", "Machine translation is another cornerstone of NLP, breaking down language barriers across the globe. Services like Google Translate utilize sophisticated NLP models to convert text or speech from one language to another, facilitating international communication, trade, and cultural exchange.11 While still imperfect, these systems are constantly improving, striving for more nuanced and contextually accurate translations.\n", "\n", "The rise of chatbots and virtual assistants is heavily reliant on NLP. These AI-powered entities process user queries, understand their intent, and generate coherent and relevant responses, simulating human-like conversation.14 They are increasingly deployed in customer service, healthcare, and education, providing instant support and information, thereby enhancing user experience and operational efficiency.\n", "\n", "NLP also plays a pivotal role in information extraction, where it identifies and pulls specific data points from unstructured text. This can involve extracting names, dates, locations, or key facts from legal documents, research papers, or news articles. It transforms vast quantities of raw text into structured, actionable data, significantly reducing the manual effort required for data analysis and knowledge discovery.\n", "\n", "The importance of NLP cannot be overstated in today's data-driven world. As the volume of digital text data explodes, NLP provides the tools to make sense of this information, transforming it into valuable insights. It empowers organizations to automate tasks, improve decision-making, enhance customer interactions, and uncover hidden patterns in textual data that would otherwise be impossible to analyze at scale.\n", "\n", "Furthermore, NLP is critical for accessibility and inclusion. By enabling text-to-speech and speech-to-text functionalities, it assists individuals with disabilities in accessing information and communicating more effectively. It also helps bridge linguistic divides, allowing people from different language backgrounds to interact and share knowledge seamlessly.\n", "\n", "The advancements in NLP are largely driven by breakthroughs in machine learning and deep learning, particularly with the advent of transformer models like BERT, GPT, and others. These models have revolutionized the field, pushing the boundaries of what's possible in language understanding and generation, leading to more accurate translations, more coherent text generation, and more sophisticated conversational AI.\n", "\n", "In conclusion, NLP is not just a technological innovation; it's a transformative force that is reshaping how humans interact with technology and each other. Its continuous evolution promises to unlock even more sophisticated applications, further integrating intelligent language capabilities into every facet of our digital and real-world experiences, making information more accessible and interactions more intuitive.\n", "\n", "\n", "==================================================\n", "\n", "\n", "==================================================\n", " 1. SENTENCE SEGMENTATION\n", "==================================================\n", "\n", "Detected Sentences:\n", " [1] Natural Language Processing (NLP) is a fascinating and rapidly evolving field at the intersection of computer science, artificial intelligence, and linguistics.\n", " [2] At its core, NLP enables computers to understand, interpret, and generate human language in a valuable and meaningful way.\n", " [3] It's about bridging the communication gap between humans and machines, allowing us to interact with technology using our most natural form of expression: language.\n", " [4] The utility of NLP spans a vast array of applications that touch our daily lives, often without us even realizing it.\n", " [5] From the moment you ask a virtual assistant a question, to the automatic translation of a webpage, or even the spam filter protecting your inbox, NLP is hard at work.\n", " [6] It's the engine behind search engines that understand your queries, recommendation systems that suggest content, and grammar checkers that refine your writing.\n", " [7] One of NLP's crucial applications is in sentiment analysis, where it determines the emotional tone behind a piece of text.\n", " [8] Businesses use this to gauge customer feedback from social media, reviews, and surveys, allowing them to understand public perception of their products or services.\n", " [9] This insight is invaluable for strategic decision-up making, product development, and customer relationship management.\n", " [10] Machine translation is another cornerstone of NLP, breaking down language barriers across the globe.\n", " [11] Services like Google Translate utilize sophisticated NLP models to convert text or speech from one language to another, facilitating international communication, trade, and cultural exchange.11 While still imperfect, these systems are constantly improving, striving for more nuanced and contextually accurate translations.\n", " [12] The rise of chatbots and virtual assistants is heavily reliant on NLP.\n", " [13] These AI-powered entities process user queries, understand their intent, and generate coherent and relevant responses, simulating human-like conversation.14 They are increasingly deployed in customer service, healthcare, and education, providing instant support and information, thereby enhancing user experience and operational efficiency.\n", " [14] NLP also plays a pivotal role in information extraction, where it identifies and pulls specific data points from unstructured text.\n", " [15] This can involve extracting names, dates, locations, or key facts from legal documents, research papers, or news articles.\n", " [16] It transforms vast quantities of raw text into structured, actionable data, significantly reducing the manual effort required for data analysis and knowledge discovery.\n", " [17] The importance of NLP cannot be overstated in today's data-driven world.\n", " [18] As the volume of digital text data explodes, NLP provides the tools to make sense of this information, transforming it into valuable insights.\n", " [19] It empowers organizations to automate tasks, improve decision-making, enhance customer interactions, and uncover hidden patterns in textual data that would otherwise be impossible to analyze at scale.\n", " [20] Furthermore, NLP is critical for accessibility and inclusion.\n", " [21] By enabling text-to-speech and speech-to-text functionalities, it assists individuals with disabilities in accessing information and communicating more effectively.\n", " [22] It also helps bridge linguistic divides, allowing people from different language backgrounds to interact and share knowledge seamlessly.\n", " [23] The advancements in NLP are largely driven by breakthroughs in machine learning and deep learning, particularly with the advent of transformer models like BERT, GPT, and others.\n", " [24] These models have revolutionized the field, pushing the boundaries of what's possible in language understanding and generation, leading to more accurate translations, more coherent text generation, and more sophisticated conversational AI.\n", " [25] In conclusion, NLP is not just a technological innovation; it's a transformative force that is reshaping how humans interact with technology and each other.\n", " [26] Its continuous evolution promises to unlock even more sophisticated applications, further integrating intelligent language capabilities into every facet of our digital and real-world experiences, making information more accessible and interactions more intuitive.\n", "--------------------------------------------------\n", "\n", "\n", "==================================================\n", " 2. WORD TOKENIZATION\n", "==================================================\n", "\n", "Sentence for Tokenization: 'Natural Language Processing (NLP) is a fascinating and rapidly evolving field at the intersection of computer science, artificial intelligence, and linguistics.'\n", "Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'and', 'rapidly', 'evolving', 'field', 'at', 'the', 'intersection', 'of', 'computer', 'science', ',', 'artificial', 'intelligence', ',', 'and', 'linguistics', '.']\n", "--------------------------------------------------\n", "\n", "\n", "==================================================\n", " 3. STEMMING (Porter Stemmer)\n", "==================================================\n", "\n", "Original Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'and', 'rapidly', 'evolving', 'field', 'at', 'the', 'intersection', 'of', 'computer', 'science', ',', 'artificial', 'intelligence', ',', 'and', 'linguistics', '.']\n", "Stemmed Tokens: ['natur', 'languag', 'process', '(', 'nlp', ')', 'is', 'a', 'fascin', 'and', 'rapidli', 'evolv', 'field', 'at', 'the', 'intersect', 'of', 'comput', 'scienc', ',', 'artifici', 'intellig', ',', 'and', 'linguist', '.']\n", "--------------------------------------------------\n", "\n", "\n", "==================================================\n", " 4. LEMMATIZATION (WordNetLemmatizer)\n", "==================================================\n", "\n", "Original Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'and', 'rapidly', 'evolving', 'field', 'at', 'the', 'intersection', 'of', 'computer', 'science', ',', 'artificial', 'intelligence', ',', 'and', 'linguistics', '.']\n", "Lemmatized Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'and', 'rapidly', 'evolving', 'field', 'at', 'the', 'intersection', 'of', 'computer', 'science', ',', 'artificial', 'intelligence', ',', 'and', 'linguistics', '.']\n", "--------------------------------------------------\n", "\n", "\n", "==================================================\n", " 5. STOP WORD REMOVAL\n", "==================================================\n", "\n", "Original Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'and', 'rapidly', 'evolving', 'field', 'at', 'the', 'intersection', 'of', 'computer', 'science', ',', 'artificial', 'intelligence', ',', 'and', 'linguistics', '.']\n", "Tokens after Stop Word Removal: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'fascinating', 'rapidly', 'evolving', 'field', 'intersection', 'computer', 'science', ',', 'artificial', 'intelligence', ',', 'linguistics', '.']\n", "--------------------------------------------------\n", "\n", "\n", "==================================================\n", " 6. DEPENDENCY PARSING (using spaCy)\n", "==================================================\n", "\n", "Sentence for Dependency Parsing: 'Natural Language Processing (NLP) is a fascinating and rapidly evolving field at the intersection of computer science, artificial intelligence, and linguistics.'\n", "\n", "Word Dependency Relation Head Word Head POS \n", "----------------------------------------------------------------------\n", "Natural compound Language PROPN \n", "Language compound Processing PROPN \n", "Processing nsubj is AUX \n", "( punct Processing PROPN \n", "NLP appos Processing PROPN \n", ") punct Processing PROPN \n", "is ROOT is AUX \n", "a det field NOUN \n", "fascinating amod field NOUN \n", "and cc fascinating ADJ \n", "rapidly advmod evolving VERB \n", "evolving conj fascinating ADJ \n", "field attr is AUX \n", "at prep is AUX \n", "the det intersection NOUN \n", "intersection pobj at ADP \n", "of prep intersection NOUN \n", "computer compound science NOUN \n", "science pobj of ADP \n", ", punct science NOUN \n", "artificial amod intelligence NOUN \n", "intelligence conj science NOUN \n", ", punct intelligence NOUN \n", "and cc intelligence NOUN \n", "linguistics conj intelligence NOUN \n", ". punct is AUX \n", "----------------------------------------------------------------------\n", "\n", "\n", "==================================================\n", " 7. PART-OF-SPEECH TAGGING (using spaCy)\n", "==================================================\n", "\n", "Sentence for POS Tagging: 'Natural Language Processing (NLP) is a fascinating and rapidly evolving field at the intersection of computer science, artificial intelligence, and linguistics.'\n", "\n", "Word POS Tag Explanation \n", "-------------------------------------------------------\n", "Natural PROPN proper noun \n", "Language PROPN proper noun \n", "Processing PROPN proper noun \n", "( PUNCT punctuation \n", "NLP PROPN proper noun \n", ") PUNCT punctuation \n", "is AUX auxiliary \n", "a DET determiner \n", "fascinating ADJ adjective \n", "and CCONJ coordinating conjunction \n", "rapidly ADV adverb \n", "evolving VERB verb \n", "field NOUN noun \n", "at ADP adposition \n", "the DET determiner \n", "intersection NOUN noun \n", "of ADP adposition \n", "computer NOUN noun \n", "science NOUN noun \n", ", PUNCT punctuation \n", "artificial ADJ adjective \n", "intelligence NOUN noun \n", ", PUNCT punctuation \n", "and CCONJ coordinating conjunction \n", "linguistics NOUN noun \n", ". PUNCT punctuation \n", "-------------------------------------------------------\n", "\n", "\n", "==================================================\n", " NLP Pipeline Demo Complete!\n", "==================================================\n" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "a9a5ba97" }, "source": [ "### Sentiment Analysis Demonstration" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "343fe3ab", "outputId": "cb3b0f40-ce74-4fda-daae-faeb1765790f" }, "source": [ "from nltk.sentiment.vader import SentimentIntensityAnalyzer\n", "\n", "# --- Sentiment Analysis ---\n", "print(\"=\"*60) # Increased width for new headers\n", "print(\" SENTIMENT ANALYSIS (using NLTK's VADER)\")\n", "print(\"=\"*60) # Increased width for new headers\n", "analyzer_en = SentimentIntensityAnalyzer()\n", "\n", "sentences_for_sentiment_en = [\n", " \"This is a great movie!\",\n", " \"I really disliked that experience.\",\n", " \"The weather is neutral today.\",\n", " \"This product is amazing and I love it!\",\n", " \"It was okay, nothing special.\"\n", "]\n", "\n", "# Print table header with full words\n", "print(\"\\n{:<45} {:>10} {:>10} {:>10} {:>12} {:>12}\".format(\n", " \"Sentence\", \"Negative\", \"Neutral\", \"Positive\", \"Compound\", \"Sentiment\"\n", "))\n", "print(\"-\" * 105) # Adjusted length for new headers and wider columns\n", "\n", "# Analyze and print each sentence in a table row\n", "for sentence in sentences_for_sentiment_en:\n", " vs = analyzer_en.polarity_scores(sentence)\n", " if vs['compound'] >= 0.05:\n", " sentiment = 'Positive'\n", " elif vs['compound'] <= -0.05:\n", " sentiment = 'Negative'\n", " else:\n", " sentiment = 'Neutral'\n", " print(\"{:<45} {:>10.3f} {:>10.3f} {:>10.3f} {:>12.3f} {:>12}\".format( # Adjusted width for numbers\n", " f\"'{sentence}'\",\n", " vs['neg'], vs['neu'], vs['pos'], vs['compound'], sentiment\n", " ))\n", "\n", "print(\"\\n\" + \"=\"*60)\n", "print(\" NLP Pipeline & Sentiment Analysis Demo Complete!\")\n", "print(\"=\"*60)" ], "execution_count": 5, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "============================================================\n", " SENTIMENT ANALYSIS (using NLTK's VADER)\n", "============================================================\n", "\n", "Sentence Negative Neutral Positive Compound Sentiment\n", "---------------------------------------------------------------------------------------------------------\n", "'This is a great movie!' 0.000 0.406 0.594 0.659 Positive\n", "'I really disliked that experience.' 0.499 0.501 0.000 -0.458 Negative\n", "'The weather is neutral today.' 0.000 1.000 0.000 0.000 Neutral\n", "'This product is amazing and I love it!' 0.000 0.376 0.624 0.852 Positive\n", "'It was okay, nothing special.' 0.315 0.419 0.265 -0.092 Negative\n", "\n", "============================================================\n", " NLP Pipeline & Sentiment Analysis Demo Complete!\n", "============================================================\n" ] } ] }, { "cell_type": "markdown", "source": [ "### πŸ˜ƒ What is Sentiment Analysis?\n", "\n", "Sentiment analysis, also known as opinion mining, is a Natural Language Processing (NLP) technique used to determine the **emotional tone** of a piece of text. Its goal is to classify the sentiment expressed in text as positive, negative, or neutral. This technology helps computers understand subjective information, making it incredibly useful for:\n", "\n", " * **Understanding customer feedback:** Analyzing reviews, social media comments, and support tickets to gauge satisfaction.\n", " * **Market research:** Tracking public opinion about products, brands, or political candidates.\n", " * **Brand monitoring:** Identifying mentions of a brand and understanding the sentiment associated with them.\n", " * **Customer service:** Prioritizing urgent or negative feedback.\n", "\n", "\n", "### πŸ› οΈ How Sentiment Analysis Works in our Code (using VADER)\n", "\n", "Our code uses **NLTK's VADER (Valence Aware Dictionary and sEntiment Reasoner)** for sentiment analysis. VADER is a rule-based sentiment analysis model specifically attuned to sentiments expressed in social media contexts. It doesn't rely on training data (like machine learning models) but instead uses a lexicon (a dictionary of words) and a set of rules.\n", "\n", "Here's a breakdown of how it works in oour code:\n", "\n", "#### 1\\. The VADER Lexicon\n", "\n", "VADER comes with a **pre-built lexicon** containing a list of words, each associated with a sentiment score (valence). For example:\n", "\n", " * \"good\" might have a positive score.\n", " * \"bad\" might have a negative score.\n", " * \"amazing\" would have a higher positive score than \"good.\"\n", "\n", "It also considers:\n", "\n", " * **Punctuation:** Exclamation marks (e.g., \"amazing\\!\\!\\!\") increase intensity.\n", " * **Capitalization:** All-caps words (e.g., \"AWFUL\") increase intensity.\n", " * **Degree modifiers (Adverbs):** Words like \"very\" or \"not\" can alter the sentiment of a subsequent word (e.g., \"very good\" is stronger than \"good\"; \"not good\" flips the sentiment).\n", " * **Conjunctions:** Words like \"but\" can shift sentiment focus.\n", "\n", "#### 2\\. The `SentimentIntensityAnalyzer()`\n", "\n", "In oour code:\n", "\n", "```python\n", "analyzer_en = SentimentIntensityAnalyzer()\n", "```\n", "\n", "This line **initializes the VADER sentiment analyzer**. It loads the VADER lexicon and rules, preparing the `analyzer_en` object to process text.\n", "\n", "#### 3\\. Analyzing Sentences with `polarity_scores()`\n", "\n", "For each sentence in your `sentences_for_sentiment_en` list:\n", "\n", "```python\n", "vs = analyzer_en.polarity_scores(sentence)\n", "```\n", "\n", "The `polarity_scores()` method takes a sentence as input and returns a dictionary (`vs`) containing four key scores:\n", "\n", " * **`'neg'` (Negative):** The proportion of text that expresses **negative** sentiment.\n", " * **`'neu'` (Neutral):** The proportion of text that expresses **neutral** sentiment.\n", " * **`'pos'` (Positive):** The proportion of text that expresses **positive** sentiment.\n", " * *Note: The sum of `neg`, `neu`, and `pos` for a sentence will approximately add up to 1.0.*\n", " * **`'compound'` (Compound):** This is the most important score. It's a normalized, weighted composite score ranging from **-1 (most extreme negative)** to **+1 (most extreme positive)**. It's derived by summing the valence scores of each word in the lexicon, adjusting for rules (like intensity boosters or negations), and then normalizing the result.\n", "\n", "#### 4\\. Interpreting the `compound` Score\n", "\n", "Your code then uses the `compound` score to classify the overall sentiment:\n", "\n", "```python\n", "if vs['compound'] >= 0.05:\n", " sentiment = 'Positive'\n", "elif vs['compound'] <= -0.05:\n", " sentiment = 'Negative'\n", "else:\n", " sentiment = 'Neutral'\n", "```\n", "\n", "This logic applies common thresholds for interpreting the `compound` score:\n", "\n", " * If `compound` is **0.05 or greater**, the sentiment is considered **Positive**.\n", " * If `compound` is **-0.05 or less**, the sentiment is considered **Negative**.\n", " * If `compound` is **between -0.05 and 0.05** (exclusive of the bounds), the sentiment is considered **Neutral**.\n", "\n", "This simple yet effective rule-based approach makes VADER a popular choice for quick and relatively accurate sentiment analysis, especially for informal text like social media posts." ], "metadata": { "id": "75mhvP_9S4Lo" } }, { "cell_type": "markdown", "source": [ "## NLP Pipeline (Part Two)" ], "metadata": { "id": "8-9T4A-OTl5P" } }, { "cell_type": "code", "source": [ "# Install required libraries\n", "!pip install nltk spacy transformers sentencepiece --quiet\n", "!python -m spacy download en_core_web_sm" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "VExPAM_hVm5k", "outputId": "1580689d-ff26-4f07-d9f0-96313bc5807b" }, "execution_count": 6, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Collecting en-core-web-sm==3.8.0\n", " Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.8/12.8 MB\u001b[0m \u001b[31m55.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h\u001b[38;5;2mβœ” Download and installation successful\u001b[0m\n", "You can now load the package via spacy.load('en_core_web_sm')\n", "\u001b[38;5;3m⚠ Restart to reload dependencies\u001b[0m\n", "If you are in a Jupyter or Colab notebook, you may need to restart Python in\n", "order to load all the package's dependencies. You can do this by selecting the\n", "'Restart kernel' or 'Restart runtime' option.\n" ] } ] }, { "cell_type": "code", "source": [ "import nltk\n", "import spacy\n", "nltk.download('punkt')\n", "nltk.download('stopwords')\n", "nltk.download('wordnet')\n", "nltk.download('averaged_perceptron_tagger')\n", "nltk.download('vader_lexicon')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "0I5io_qWVnMg", "outputId": "9093102a-de9b-4054-d2e9-2bd3022631cf" }, "execution_count": 2, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "[nltk_data] Downloading package punkt to /root/nltk_data...\n", "[nltk_data] Package punkt is already up-to-date!\n", "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n", "[nltk_data] Downloading package wordnet to /root/nltk_data...\n", "[nltk_data] Package wordnet is already up-to-date!\n", "[nltk_data] Downloading package averaged_perceptron_tagger to\n", "[nltk_data] /root/nltk_data...\n", "[nltk_data] Package averaged_perceptron_tagger is already up-to-\n", "[nltk_data] date!\n", "[nltk_data] Downloading package vader_lexicon to /root/nltk_data...\n", "[nltk_data] Package vader_lexicon is already up-to-date!\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "True" ] }, "metadata": {}, "execution_count": 2 } ] }, { "cell_type": "markdown", "source": [ "2. NLP Pipeline Steps\n", "\n", " 2.1. Sentence Segmentation" ], "metadata": { "id": "2VvjnJjvVzkr" } }, { "cell_type": "code", "source": [ "from nltk.tokenize import sent_tokenize\n", "\n", "text = \"Natural Language Processing is fascinating. It enables computers to understand human language!\"\n", "\n", "sentences = sent_tokenize(text)\n", "print(\"Sentence Segmentation:\", sentences)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "5hlX2ilrVviA", "outputId": "6fc7284c-ba2a-45f3-d20d-db7c028471d8" }, "execution_count": 3, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Sentence Segmentation: ['Natural Language Processing is fascinating.', 'It enables computers to understand human language!']\n" ] } ] }, { "cell_type": "markdown", "source": [ "2.2. Word Tokenization" ], "metadata": { "id": "0dltR_hIV8rS" } }, { "cell_type": "code", "source": [ "from nltk.tokenize import word_tokenize\n", "\n", "tokens = word_tokenize(text)\n", "print(\"Word Tokens:\", tokens)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "L4NAcXGGV4q9", "outputId": "4b1f2b61-03aa-4757-fd8d-df86e46a217b" }, "execution_count": 4, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Word Tokens: ['Natural', 'Language', 'Processing', 'is', 'fascinating', '.', 'It', 'enables', 'computers', 'to', 'understand', 'human', 'language', '!']\n" ] } ] }, { "cell_type": "markdown", "source": [ "2.3 Stemming" ], "metadata": { "id": "3LyiR8XcWCBh" } }, { "cell_type": "code", "source": [ "from nltk.stem import PorterStemmer\n", "\n", "stemmer = PorterStemmer()\n", "stems = [stemmer.stem(token) for token in tokens]\n", "print(\"Stems:\", stems)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "q4-hPHXvV_e3", "outputId": "5b8adead-be41-41f8-9715-aa70f7360ed0" }, "execution_count": 5, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Stems: ['natur', 'languag', 'process', 'is', 'fascin', '.', 'it', 'enabl', 'comput', 'to', 'understand', 'human', 'languag', '!']\n" ] } ] }, { "cell_type": "markdown", "source": [ "2.4 Lemmatization" ], "metadata": { "id": "EFMk66vjWGiT" } }, { "cell_type": "code", "source": [ "from nltk.stem import WordNetLemmatizer\n", "\n", "lemmatizer = WordNetLemmatizer()\n", "lemmas = [lemmatizer.lemmatize(token) for token in tokens]\n", "print(\"Lemmas:\", lemmas)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "SHbByxuBWDu2", "outputId": "89087832-c6df-4033-f67b-3356d675bf48" }, "execution_count": 6, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Lemmas: ['Natural', 'Language', 'Processing', 'is', 'fascinating', '.', 'It', 'enables', 'computer', 'to', 'understand', 'human', 'language', '!']\n" ] } ] }, { "cell_type": "markdown", "source": [ "2.5 Stop Words Removal" ], "metadata": { "id": "62aKLv4cWK4l" } }, { "cell_type": "code", "source": [ "from nltk.corpus import stopwords\n", "\n", "stop_words = set(stopwords.words('english'))\n", "filtered_tokens = [token for token in tokens if token.lower() not in stop_words]\n", "print(\"Tokens after Stop Word Removal:\", filtered_tokens)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "HGsGcS97WIc8", "outputId": "5d572f5c-461c-4f69-e08d-2c7ae3445d1f" }, "execution_count": 7, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Tokens after Stop Word Removal: ['Natural', 'Language', 'Processing', 'fascinating', '.', 'enables', 'computers', 'understand', 'human', 'language', '!']\n" ] } ] }, { "cell_type": "markdown", "source": [ "2.6. Dependency Parsing" ], "metadata": { "id": "l6cvxTbqWRTL" } }, { "cell_type": "code", "source": [ "nlp = spacy.load(\"en_core_web_sm\")\n", "doc = nlp(text)\n", "\n", "print(\"Dependency Parsing:\")\n", "for token in doc:\n", " print(f\"\\n{token.text} --> {token.dep_} --> {token.head.text}\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "AGWREIQ5WOOp", "outputId": "2ea03a79-1180-41ea-d7cc-555f2388cac3" }, "execution_count": 8, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Dependency Parsing:\n", "\n", "Natural --> compound --> Language\n", "\n", "Language --> compound --> Processing\n", "\n", "Processing --> nsubj --> is\n", "\n", "is --> ROOT --> is\n", "\n", "fascinating --> acomp --> is\n", "\n", ". --> punct --> is\n", "\n", "It --> nsubj --> enables\n", "\n", "enables --> ROOT --> enables\n", "\n", "computers --> nsubj --> understand\n", "\n", "to --> aux --> understand\n", "\n", "understand --> ccomp --> enables\n", "\n", "human --> amod --> language\n", "\n", "language --> dobj --> understand\n", "\n", "! --> punct --> enables\n" ] } ] }, { "cell_type": "markdown", "source": [ "2.7 Parts of Speech Tagging" ], "metadata": { "id": "Pzgbg8UwWumt" } }, { "cell_type": "code", "source": [ "print(\"Part-of-Speech Tagging:\\n\")\n", "for token in doc:\n", " print(f\"{token.text} --> {token.pos_}\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "-JN5V8haWT0S", "outputId": "236d32ba-4f5d-4e5f-ff3b-853eaa0fce40" }, "execution_count": 9, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Part-of-Speech Tagging:\n", "\n", "Natural --> PROPN\n", "Language --> PROPN\n", "Processing --> NOUN\n", "is --> AUX\n", "fascinating --> ADJ\n", ". --> PUNCT\n", "It --> PRON\n", "enables --> VERB\n", "computers --> NOUN\n", "to --> PART\n", "understand --> VERB\n", "human --> ADJ\n", "language --> NOUN\n", "! --> PUNCT\n" ] } ] }, { "cell_type": "markdown", "source": [ "3. Sentiment Analysis with NLTK's VADER" ], "metadata": { "id": "Ltd-DcosW_ix" } }, { "cell_type": "code", "source": [ "from nltk.sentiment import SentimentIntensityAnalyzer\n", "\n", "sia = SentimentIntensityAnalyzer()\n", "sentiment = sia.polarity_scores(text)\n", "print(\"VADER Sentiment Scores:\", sentiment)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "qkuhRYrOW0jV", "outputId": "cac8f9b7-34d5-4cff-ea97-a257c1458722" }, "execution_count": 10, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "VADER Sentiment Scores: {'neg': 0.0, 'neu': 0.614, 'pos': 0.386, 'compound': 0.7424}\n" ] } ] }, { "cell_type": "markdown", "source": [ "4. Translation using Huggingface Transformers" ], "metadata": { "id": "AIVJfTwCXJXM" } }, { "cell_type": "code", "source": [ "from transformers import pipeline\n", "\n", "# English to French translation\n", "translator_fr = pipeline(\"translation_en_to_fr\", model=\"Helsinki-NLP/opus-mt-en-fr\")\n", "translation = translator_fr(text)\n", "print(\"\\n\\nTranslation (EN->FR):\", translation[0]['translation_text'])\n", "\n", "print(\"\\n\\n\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "FaKMSZ7IXE5i", "outputId": "aeb55cc3-cdd0-4242-b464-bbf2a603368e" }, "execution_count": 11, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "/usr/local/lib/python3.11/dist-packages/transformers/models/marian/tokenization_marian.py:175: UserWarning: Recommended: pip install sacremoses.\n", " warnings.warn(\"Recommended: pip install sacremoses.\")\n", "Device set to use cpu\n" ] }, { "output_type": "stream", "name": "stdout", "text": [ "\n", "\n", "Translation (EN->FR): Le traitement du langage naturel est fascinant. Il permet aux ordinateurs de comprendre le langage humain!\n", "\n", "\n", "\n" ] } ] }, { "cell_type": "markdown", "source": [ "5. Text Generation Model Creation (Word-LSTM Model)" ], "metadata": { "id": "VeqTijUvXURn" } }, { "cell_type": "code", "source": [ "!pip install --upgrade datasets gcsfs fsspec --quiet" ], "metadata": { "id": "xGyym_LsgNrj" }, "execution_count": 7, "outputs": [] }, { "cell_type": "code", "source": [ "# Imports, setting up devices and seed\n", "# !pip install datasets tqdm --quiet\n", "\n", "import torch\n", "import torch.nn as nn\n", "import torch.optim as optim\n", "import numpy as np\n", "from tqdm import tqdm\n", "from datasets import load_dataset\n", "import re\n", "import os\n", "import matplotlib.pyplot as plt\n", "from collections import Counter\n", "from sklearn.model_selection import train_test_split\n", "\n", "torch.manual_seed(42)\n", "np.random.seed(42)\n", "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", "print(\"Using device:\", device)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "8ETP7K3geUEN", "outputId": "f80f7c93-a3d3-4b0c-d520-a23d256776f2" }, "execution_count": 1, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Using device: cuda\n" ] } ] }, { "cell_type": "code", "source": [ "# Load 10% of dataset and tokenize\n", "\n", "ds = load_dataset(\"nirajandhakal/Mahabharata-HHGTTG-Text\", split=\"train[:90%]\")\n", "corpus = \" \".join(ds['text'])\n", "\n", "def tokenize(text):\n", " # Split on whitespace and punctuation\n", " return re.findall(r\"\\b\\w+\\b|[^\\w\\s]\", text.lower())\n", "\n", "tokens = tokenize(corpus)\n", "print(f\"Number of tokens: {len(tokens)}\")\n", "print(\"Sample tokens:\", tokens[:20])" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "vKBU1auceYXm", "outputId": "4984aacc-b077-426f-b966-9e5938556d60" }, "execution_count": 22, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Number of tokens: 3634450\n", "Sample tokens: ['the', 'epicurean', 'paradox', 'the', 'epicurean', 'paradox', 'is', 'a', 'philosophical', 'argument', 'that', 'has', 'intrigued', 'thinkers', 'for', 'centuries', '.', 'it', 'is', 'an']\n" ] } ] }, { "cell_type": "code", "source": [ "# Build vocabulary and Encode\n", "vocab_size = 10000\n", "most_common = Counter(tokens).most_common(vocab_size-2)\n", "vocab = [w for w, _ in most_common]\n", "word2idx = {w: i+2 for i, w in enumerate(vocab)}\n", "word2idx[\"\"] = 0\n", "word2idx[\"\"] = 1\n", "idx2word = {i: w for w, i in word2idx.items()}\n", "\n", "# Encode tokens\n", "encoded = [word2idx.get(w, 1) for w in tokens] # 1 is \n", "print(f\"Vocabulary size: {len(word2idx)}\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "75flr-GKec1H", "outputId": "555722ca-8a85-4155-9bb3-89126f63b1f8" }, "execution_count": 3, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Vocabulary size: 10000\n" ] } ] }, { "cell_type": "code", "source": [ "# Prepare data for training\n", "\n", "seq_length = 15 # Longer context for better quality\n", "step = 1 # More overlap, more data, better quality\n", "\n", "sequences = []\n", "next_words = []\n", "for i in range(0, len(encoded) - seq_length, step):\n", " sequences.append(encoded[i:i+seq_length])\n", " next_words.append(encoded[i+seq_length])\n", "\n", "X = np.array(sequences, dtype=np.int32)\n", "y = np.array(next_words, dtype=np.int32)\n", "\n", "# Train/validation split\n", "X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1, random_state=42)\n", "\n", "print(f\"Train samples: {len(X_train)}, Val samples: {len(X_val)}\")\n", "\n", "from torch.utils.data import TensorDataset, DataLoader\n", "\n", "batch_size = 256\n", "train_dataset = TensorDataset(torch.tensor(X_train, dtype=torch.long), torch.tensor(y_train, dtype=torch.long))\n", "val_dataset = TensorDataset(torch.tensor(X_val, dtype=torch.long), torch.tensor(y_val, dtype=torch.long))\n", "train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)\n", "val_loader = DataLoader(val_dataset, batch_size=batch_size)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "lkhqyVW2esLf", "outputId": "8bfec172-9e2d-4ee2-e01a-69aa2d5447bf" }, "execution_count": 6, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Train samples: 391758, Val samples: 43529\n" ] } ] }, { "cell_type": "code", "source": [ "# Define the Model\n", "\n", "class WordLSTM(nn.Module):\n", " def __init__(self, vocab_size, embed_size, hidden_size, num_layers=3, dropout=0.3):\n", " super().__init__()\n", " self.embedding = nn.Embedding(vocab_size, embed_size, padding_idx=0)\n", " self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True, dropout=dropout)\n", " self.fc = nn.Linear(hidden_size, vocab_size)\n", " def forward(self, x, hidden=None):\n", " x = self.embedding(x)\n", " out, hidden = self.lstm(x, hidden)\n", " out = self.fc(out[:, -1, :])\n", " return out, hidden\n", "\n", "embed_size = 256\n", "hidden_size = 512\n", "num_layers = 3\n", "dropout = 0.3\n", "\n", "model = WordLSTM(len(word2idx), embed_size, hidden_size, num_layers, dropout).to(device)\n", "loss_fn = nn.CrossEntropyLoss()\n", "optimizer = optim.Adam(model.parameters(), lr=0.002)" ], "metadata": { "id": "VdxeA4U7evTC" }, "execution_count": 7, "outputs": [] }, { "cell_type": "code", "source": [ "# Train the model and plot the loss\n", "\n", "epochs = 10 # Increase for better quality\n", "train_losses, val_losses = [], []\n", "train_accuracies, val_accuracies = [], []\n", "\n", "def accuracy(preds, targets):\n", " return (preds.argmax(dim=1) == targets).float().mean().item()\n", "\n", "for epoch in range(epochs):\n", " model.train()\n", " total_loss, total_acc, total_count = 0, 0, 0\n", " for xb, yb in tqdm(train_loader, desc=f\"Train Epoch {epoch+1}/{epochs}\"):\n", " xb, yb = xb.to(device), yb.to(device)\n", " optimizer.zero_grad()\n", " output, _ = model(xb)\n", " loss = loss_fn(output, yb)\n", " loss.backward()\n", " optimizer.step()\n", " total_loss += loss.item() * xb.size(0)\n", " total_acc += accuracy(output, yb) * xb.size(0)\n", " total_count += xb.size(0)\n", " avg_loss = total_loss / total_count\n", " avg_acc = total_acc / total_count\n", " train_losses.append(avg_loss)\n", " train_accuracies.append(avg_acc)\n", "\n", " # Validation\n", " model.eval()\n", " val_loss, val_acc, val_count = 0, 0, 0\n", " with torch.no_grad():\n", " for xb, yb in val_loader:\n", " xb, yb = xb.to(device), yb.to(device)\n", " output, _ = model(xb)\n", " loss = loss_fn(output, yb)\n", " val_loss += loss.item() * xb.size(0)\n", " val_acc += accuracy(output, yb) * xb.size(0)\n", " val_count += xb.size(0)\n", " val_losses.append(val_loss / val_count)\n", " val_accuracies.append(val_acc / val_count)\n", " print(f\"Epoch {epoch+1}: Train Loss={avg_loss:.4f}, Val Loss={val_losses[-1]:.4f}, Train Acc={avg_acc:.4f}, Val Acc={val_accuracies[-1]:.4f}\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "08g9D-RlfAVt", "outputId": "7f0a6668-54bb-4ef1-d680-248be4285512" }, "execution_count": 8, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "Train Epoch 1/10: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1531/1531 [01:16<00:00, 20.14it/s]\n" ] }, { "output_type": "stream", "name": "stdout", "text": [ "Epoch 1: Train Loss=6.3144, Val Loss=6.2408, Train Acc=0.0783, Val Acc=0.0786\n" ] }, { "output_type": "stream", "name": "stderr", "text": [ "Train Epoch 2/10: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1531/1531 [01:17<00:00, 19.84it/s]\n" ] }, { "output_type": "stream", "name": "stdout", "text": [ "Epoch 2: Train Loss=6.1643, Val Loss=5.7447, Train Acc=0.0869, Val Acc=0.1387\n" ] }, { "output_type": "stream", "name": "stderr", "text": [ "Train Epoch 3/10: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1531/1531 [01:17<00:00, 19.65it/s]\n" ] }, { "output_type": "stream", "name": "stdout", "text": [ "Epoch 3: Train Loss=5.2669, Val Loss=4.9505, Train Acc=0.1835, Val Acc=0.2157\n" ] }, { "output_type": "stream", "name": "stderr", "text": [ "Train Epoch 4/10: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1531/1531 [01:18<00:00, 19.56it/s]\n" ] }, { "output_type": "stream", "name": "stdout", "text": [ "Epoch 4: Train Loss=4.6935, Val Loss=4.7041, Train Acc=0.2291, Val Acc=0.2411\n" ] }, { "output_type": "stream", "name": "stderr", "text": [ "Train Epoch 5/10: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1531/1531 [01:17<00:00, 19.73it/s]\n" ] }, { "output_type": "stream", "name": "stdout", "text": [ "Epoch 5: Train Loss=4.3577, Val Loss=4.5889, Train Acc=0.2505, Val Acc=0.2542\n" ] }, { "output_type": "stream", "name": "stderr", "text": [ "Train Epoch 6/10: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1531/1531 [01:17<00:00, 19.68it/s]\n" ] }, { "output_type": "stream", "name": "stdout", "text": [ "Epoch 6: Train Loss=4.1002, Val Loss=4.5632, Train Acc=0.2639, Val Acc=0.2616\n" ] }, { "output_type": "stream", "name": "stderr", "text": [ "Train Epoch 7/10: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1531/1531 [01:18<00:00, 19.58it/s]\n" ] }, { "output_type": "stream", "name": "stdout", "text": [ "Epoch 7: Train Loss=3.8865, Val Loss=4.5903, Train Acc=0.2751, Val Acc=0.2651\n" ] }, { "output_type": "stream", "name": "stderr", "text": [ "Train Epoch 8/10: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1531/1531 [01:17<00:00, 19.68it/s]\n" ] }, { "output_type": "stream", "name": "stdout", "text": [ "Epoch 8: Train Loss=3.7002, Val Loss=4.6680, Train Acc=0.2864, Val Acc=0.2686\n" ] }, { "output_type": "stream", "name": "stderr", "text": [ "Train Epoch 9/10: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1531/1531 [01:17<00:00, 19.75it/s]\n" ] }, { "output_type": "stream", "name": "stdout", "text": [ "Epoch 9: Train Loss=3.5370, Val Loss=4.7425, Train Acc=0.2969, Val Acc=0.2683\n" ] }, { "output_type": "stream", "name": "stderr", "text": [ "Train Epoch 10/10: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1531/1531 [01:17<00:00, 19.71it/s]\n" ] }, { "output_type": "stream", "name": "stdout", "text": [ "Epoch 10: Train Loss=3.3925, Val Loss=4.8263, Train Acc=0.3082, Val Acc=0.2688\n" ] } ] }, { "cell_type": "code", "source": [ "print(model)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "IXtEKalQjAEG", "outputId": "4628b64f-b075-4f88-8efd-e95cd2a87067" }, "execution_count": 9, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "WordLSTM(\n", " (embedding): Embedding(10000, 256, padding_idx=0)\n", " (lstm): LSTM(256, 512, num_layers=3, batch_first=True, dropout=0.3)\n", " (fc): Linear(in_features=512, out_features=10000, bias=True)\n", ")\n" ] } ] }, { "cell_type": "code", "source": [ "# Plot Training and Validation Curves\n", "\n", "plt.figure(figsize=(12,5))\n", "plt.subplot(1,2,1)\n", "plt.plot(train_losses, label='Train Loss')\n", "plt.plot(val_losses, label='Val Loss')\n", "plt.xlabel('Epoch')\n", "plt.ylabel('Loss')\n", "plt.title('Loss Curve')\n", "plt.legend()\n", "plt.grid(True)\n", "\n", "plt.subplot(1,2,2)\n", "plt.plot(train_accuracies, label='Train Acc')\n", "plt.plot(val_accuracies, label='Val Acc')\n", "plt.xlabel('Epoch')\n", "plt.ylabel('Accuracy')\n", "plt.title('Accuracy Curve')\n", "plt.legend()\n", "plt.grid(True)\n", "\n", "plt.tight_layout()\n", "plt.show()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 507 }, "id": "vHm3p94_hDm5", "outputId": "308aa0f2-63aa-4ff4-9f77-7d02843bbede" }, "execution_count": 10, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "code", "source": [ "# Save the model\n", "\n", "save_path = \"word_lstm_standard.pth\"\n", "torch.save({\n", " 'model_state_dict': model.state_dict(),\n", " 'word2idx': word2idx,\n", " 'idx2word': idx2word,\n", " 'vocab_size': len(word2idx),\n", " 'embed_size': embed_size,\n", " 'hidden_size': hidden_size,\n", " 'num_layers': num_layers,\n", " 'dropout': dropout,\n", " 'seq_length': seq_length\n", "}, save_path)\n", "print(f\"Model saved to {save_path}\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ejY0IN1QfFz4", "outputId": "5b48210a-4a1f-4954-a0f9-e0c50f701ee7" }, "execution_count": 11, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Model saved to word_lstm_standard.pth\n" ] } ] }, { "cell_type": "code", "source": [ "# Load the Model\n", "\n", "def load_model(path):\n", " checkpoint = torch.load(path, map_location=device)\n", " model = WordLSTM(\n", " checkpoint['vocab_size'],\n", " checkpoint['embed_size'],\n", " checkpoint['hidden_size'],\n", " checkpoint['num_layers'],\n", " checkpoint['dropout']\n", " ).to(device)\n", " model.load_state_dict(checkpoint['model_state_dict'])\n", " model.eval()\n", " return model, checkpoint['word2idx'], checkpoint['idx2word'], checkpoint['seq_length']\n", "\n", "loaded_model, loaded_word2idx, loaded_idx2word, loaded_seq_length = load_model(save_path)\n", "print(\"Model loaded!\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Za9zPxsLfTAq", "outputId": "52a645fc-8a56-44da-aba1-da290dc76946" }, "execution_count": 12, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Model loaded!\n" ] } ] }, { "cell_type": "code", "source": [ "# Generate Text\n", "\n", "def tokenize(text):\n", " return re.findall(r\"\\b\\w+\\b|[^\\w\\s]\", text.lower())\n", "\n", "def generate_text(model, word2idx, idx2word, seq_length, seed, length=40, temperature=1.0):\n", " model.eval()\n", " seed_tokens = tokenize(seed.lower())\n", " seed_encoded = [word2idx.get(w, 1) for w in seed_tokens]\n", " if len(seed_encoded) < seq_length:\n", " seed_encoded = [0]*(seq_length-len(seed_encoded)) + seed_encoded\n", " else:\n", " seed_encoded = seed_encoded[-seq_length:]\n", " generated = seed_tokens.copy()\n", " inp = torch.tensor([seed_encoded], dtype=torch.long, device=device)\n", " hidden = None\n", " for _ in range(length):\n", " out, hidden = model(inp, hidden)\n", " out = out[0].detach().cpu().numpy()\n", " out = out / temperature\n", " exp_out = np.exp(out - np.max(out))\n", " probs = exp_out / np.sum(exp_out)\n", " idx = np.random.choice(range(len(idx2word)), p=probs)\n", " next_word = idx2word.get(idx, \"\")\n", " generated.append(next_word)\n", " inp = torch.cat([inp[:, 1:], torch.tensor([[idx]], device=device)], dim=1)\n", " return \" \".join(generated)\n", "\n", "seed_text = \"The universe is such a place where Krishna is the creator\"\n", "print(\"Generated text (temperature=0.5):\\n\")\n", "print(generate_text(loaded_model, loaded_word2idx, loaded_idx2word, loaded_seq_length, seed_text, length=100, temperature=0.2))" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ek3K3dBLfXzW", "outputId": "d05d7b63-687b-4ba0-b95c-483bcde8875b" }, "execution_count": 21, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Generated text (temperature=0.5):\n", "\n", "the universe is such a place where krishna is the creator of the universe . it is possible that the gospels is that weΓ’ Β€ Β™ t areas of multiple , and that the universe is that the gospels is that the values is the product of the universe . the number of nasaΓ’ is the temperature . the number of slokas composed by the gods and the kaikeyas , the planets , the , and the andhakas , the planets , the , and the , the , the , and the , and the , the , and the ,\n" ] } ] } ] }