# media_stores.ipynb
> A notebook for storing all types of media as vector stores

In this notebook, we'll implement the functionality required to interact with many types of media stores. This is - not just for text files and pdfs, but also for images, audio, and video.

Below are some references for integration of different media types into vector stores.

- YouTube: https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/youtube_audio
- Websites:
  - https://js.langchain.com/docs/modules/indexes/document_loaders/examples/web_loaders/
  - https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/web_base
  - Extracting relevant information from website: https://www.oncrawl.com/technical-seo/extract-relevant-text-content-from-html-page/

:::{.callout-caution}
These notebooks are development notebooks, meaning that they are meant to be run locally or somewhere that supports navigating a full repository (in other words, not Google Colab unless you clone the entire repository to drive and then mount the Drive-Repository.) However, it is expected if you're able to do all of those steps, you're likely also able to figure out the required pip installs for development there.
:::


In [None]:
#| default_exp MediaVectorStores

In [None]:
#| export
# import libraries here
import os
import itertools

from langchain.embeddings import OpenAIEmbeddings

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders.unstructured import UnstructuredFileLoader
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader
from langchain.document_loaders import WebBaseLoader, UnstructuredURLLoader
from langchain.docstore.document import Document

from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQAWithSourcesChain

Note that we will not export the following packages to our module because in this exploration we have decided to go with langchain implementations, or they are only used for testing.

In [None]:
#exploration
import trafilatura
import requests
import justext

## Media to Text Converters
In this section, we provide a set of converters that can either read text and convert it to other useful text, or read YouTube or Websites and convert them into text.

### Standard Text Splitter
Here we define a standard text splitter. This can be used on any text.

In [None]:
#| export
def rawtext_to_doc_split(text, chunk_size=1500, chunk_overlap=150):
  
  # Quick type checking
  if not isinstance(text, list):
    text = [text]

  # Create splitter
  text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size,
                                                 chunk_overlap=chunk_overlap,
                                                 add_start_index = True)
  
  #Split into docs segments
  if isinstance(text[0], Document):
    doc_segments = text_splitter.split_documents(text)
  else:
    doc_segments = text_splitter.split_documents(text_splitter.create_documents(text))

  # Make into one big list
  doc_segments = list(itertools.chain(*doc_segments)) if isinstance(doc_segments[0], list) else doc_segments

  return doc_segments

In [None]:
# test basic functionality
rawtext_to_doc_split(["This is a sentence. This is another sentence.", "This is a third sentence."], chunk_size=10, chunk_overlap=5)

[Document(page_content='This is a', metadata={}),
 Document(page_content='sentence.', metadata={}),
 Document(page_content='This is', metadata={}),
 Document(page_content='another', metadata={}),
 Document(page_content='sentence.', metadata={}),
 Document(page_content='This is a', metadata={}),
 Document(page_content='a third', metadata={}),
 Document(page_content='sentence.', metadata={})]

We'll write a quick function to do a unit test on the function we just wrote.

In [None]:
def test_split_texts():
    
    # basic behavior
    text = "This is a sample text that we will use to test the splitter function."
    expected_output = ["This is a sample text that we will use to test the splitter function."]
    out_splits = [doc.page_content for doc in rawtext_to_doc_split(text)]
    assert all([target==expected for target, expected in zip(expected_output, out_splits)]), ('The basic splitter functionality is incorrect, and does not correctly ' +
                                                                                              'use chunk_size and chunk_overlap on chunks <1500.')
    
    # try a known result with variable chunk_length and chunk_overlap
    text = ("This is a sample text that we will use to test the splitter function. It should split the " +
            "text into multiple chunks of size 1500 with an overlap of 150 characters. This is the second chunk.")
    expected_output = ['This is a sample text that we will use to test the',
                       'test the splitter function. It should split the',
                       'split the text into multiple chunks of size 1500',
                       'size 1500 with an overlap of 150 characters. This',
                       'This is the second chunk.']
    out_splits = [doc.page_content for doc in rawtext_to_doc_split(text, 50, 10)]
    assert all([target==expected for target, expected in zip(expected_output, out_splits)]), 'The splitter does not correctly use chunk_size and chunk_overlap.'

# Run test
test_split_texts()

The following function is used for testing to make sure single files and lists can be accommodated, and that what are returned are lists of documents.

In [None]:
# a set of tests to make sure that this works on both lists single inputs
def test_converters_inputs(test_fcn, files_list=None):
    if files_list is None:
        single_file = 'The cat was super cute and adorable'
        multiple_files = [single_file, 'The dog was also cute and her wet nose is always so cold!']
    elif isinstance(files_list, str):
        single_file = files_list
        multiple_files = [single_file, single_file]
    elif isinstance(files_list, list):
        single_file = files_list[0]
        multiple_files = files_list
    else:
        TypeError("You've passed in a files_list which is neither a string or a list or None")

    # test for single file
    res = test_fcn(single_file)
    assert isinstance(res, list), 'FAILED ASSERT in {test_fcn}. A single file should return a list.'
    assert not isinstance(res[0], list), 'FAILED ASSERT in {test_fcn}. A single file should return a 1-dimensional list.'

    # test for multiple files
    res = test_fcn(multiple_files)
    assert isinstance(res, list), 'FAILED ASSERT in {test_fcn}. A list of files should return a list.'
    assert not isinstance(res[0], list), 'FAILED ASSERT in {test_fcn}. A list of files should return a 1-dimensional list with all documents combined.'

    # test that the return type of elements should be Document
    assert all([isinstance(doc, Document) for doc in res]), 'FAILED ASSERT in {test_fcn}. The return type of elements should be Document.'

In [None]:
# test behavior of standard text splitter
test_converters_inputs(rawtext_to_doc_split)

### File or Files
Functions which load a single file or files from a directory, including pdfs, text files, html, images, and more. See [Unstructured File Documentation](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file) for more information.

In [None]:
#| export
## A single File
def _file_to_text(single_file, chunk_size = 1000, chunk_overlap=150):

  # Create loader and get segments
  loader = UnstructuredFileLoader(single_file)
  doc_segments = loader.load_and_split(RecursiveCharacterTextSplitter(chunk_size=chunk_size,
                                                                      chunk_overlap=chunk_overlap,
                                                                      add_start_index=True))
  return doc_segments


## Multiple files
def files_to_text(files_list, chunk_size=1000, chunk_overlap=150):
  
  # Quick type checking
  if not isinstance(files_list, list):
    files_list = [files_list]

  # This is currently a fix because the UnstructuredFileLoader expects a list of files yet can't split them correctly yet
  all_segments = [_file_to_text(single_file, chunk_size=chunk_size, chunk_overlap=chunk_overlap) for single_file in files_list]
  all_segments = list(itertools.chain(*all_segments)) if isinstance(all_segments[0], list) else all_segments

  return all_segments

In [None]:
# ensure basic behavior
res = files_to_text(['../roadnottaken.txt', '../roadnottaken.txt'], chunk_size=100, chunk_overlap=20)
res[:11]

[Document(page_content='Two roads diverged in a yellow wood,\rAnd sorry I could not travel both\rAnd be one traveler, long I', metadata={'source': '../roadnottaken.txt', 'start_index': 0}),
 Document(page_content='traveler, long I stood\rAnd looked down one as far as I could\rTo where it bent in the', metadata={'source': '../roadnottaken.txt', 'start_index': 82}),
 Document(page_content='it bent in the undergrowth;\r\rThen took the other, as just as fair,\rAnd having perhaps the better', metadata={'source': '../roadnottaken.txt', 'start_index': 152}),
 Document(page_content='perhaps the better claim,\rBecause it was grassy and wanted wear;\rThough as for that the passing', metadata={'source': '../roadnottaken.txt', 'start_index': 230}),
 Document(page_content='that the passing there\rHad worn them really about the same,\r\rAnd both that morning equally lay\rIn', metadata={'source': '../roadnottaken.txt', 'start_index': 309}),
 Document(page_content='equally lay\rIn leaves no step had t

In [None]:
test_converters_inputs(files_to_text, '../roadnottaken.txt')

### Youtube
This works by first transcribing the video to text.

In [None]:
#| export
def youtube_to_text(urls, save_dir = "content"):
  # Transcribe the videos to text
  # save_dir: directory to save audio files

  if not isinstance(urls, list):
    urls = [urls]
  
  youtube_loader = GenericLoader(YoutubeAudioLoader(urls, save_dir), OpenAIWhisperParser())
  youtube_docs = youtube_loader.load()
  
  return youtube_docs

Now, let's demonstrate functionality using some existing YouTube videos

In [None]:
# Two Karpathy lecture videos
urls = ["https://youtu.be/kCc8FmEb1nY", "https://youtu.be/VMj-3S1tku0"]
youtube_text = youtube_to_text(urls)
youtube_text

Other Youtube helper functions to help with getting full features of YouTube videos are included below. These two grab and save the text of the transcripts.

<p style="color:red"><strong>Note that in this stage of development, the following cannot be tested due to YouTube download errors.</strong></p>

In [None]:
#| export
def save_text(text, text_name = None):
  if not text_name:
    text_name = text[:20]
  text_path = os.path.join("/content",text_name+".txt")
  
  with open(text_path, "x") as f:
    f.write(text)
  # Return the location at which the transcript is saved
  return text_path

In [None]:
#| export
def get_youtube_transcript(yt_url, save_transcript = False, temp_audio_dir = "sample_data"):
  # Transcribe the videos to text and save to file in /content
  # save_dir: directory to save audio files

  youtube_docs = youtube_to_text(yt_url, save_dir = temp_audio_dir)
  
  # Combine doc
  combined_docs = [doc.page_content for doc in youtube_docs]
  combined_text = " ".join(combined_docs)
  
  # Save text to file
  video_path = youtube_docs[0].metadata["source"]
  youtube_name = os.path.splitext(os.path.basename(video_path))[0]

  save_path = None
  if save_transcript:
    save_path = save_text(combined_text, youtube_name)
  
  return youtube_docs, save_path

### Websites
We have a few different approaches to reading website text. Some approaches are specifically provided through langchain and some are other packages that seem to be performant. We'll show the pros/cons of each approach below.

#### Langchain: WebBaseLoader

In [None]:
#| export
def website_to_text_web(url, chunk_size = 1500, chunk_overlap=100):
  
    # Url can be a single string or list
    website_loader = WebBaseLoader(url)
    website_raw = website_loader.load()

    website_data = rawtext_to_doc_split(website_raw, chunk_size = chunk_size, chunk_overlap=chunk_overlap)
  
    # Combine doc
    return website_data

Now for a quick test to ensure functionality...

In [None]:
demo_urls = ["https://www.espn.com/", "https://www.vanderbilt.edu/undergrad-datascience/faq"]

In [None]:
# get the results
res_web = website_to_text_web(demo_urls)

res_web

[Document(page_content="ESPN - Serving Sports Fans. Anytime. Anywhere.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n        Skip to main content\n    \n\n        Skip to navigation\n    \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n<\n\n>\n\n\n\n\n\n\n\n\n\nMenuESPN\n\n\nSearch\n\n\n\nscores\n\n\n\nNFLMLBNBANHLSoccerGolf…Women's World CupNCAAFNCAAMNCAAWSports BettingBoxingCFLNCAACricketF1HorseMMANASCARNBA G LeagueOlympic SportsPLLRacingRN BBRN FBRugbyTennisWNBAWWEX GamesXFLMore ESPNFantasyListenWatchESPN+\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n  \n\nSUBSCRIBE NOW\n\n\n\n\n\nPaul vs. Diaz (ESPN+ PPV)\n\n\n\n\n\n\n\nPGA TOUR LIVE\n\n\n\n\n\n\n\nLittle League Baseball: Regionals\n\n\n\n\n\n\n\nMLB: Select Games\n\n\n\n\n\n\n\nCrossFit Games\n\n\n\n\n\n\n\nSlamBall\n\n\n\n\n\n\n\nThe Ultimate Fighter: Season 31\n\n\n\n\n\n\n\nFantasy Foot

In [None]:
#unit testbed
test_converters_inputs(website_to_text_web, demo_urls)

Something interesting that we notice here is the proliferation of new lines that aren't for the best.

#### Langchain: UnstructuredURLLoader

In [None]:
#| export
def website_to_text_unstructured(web_urls, chunk_size = 1500, chunk_overlap=100):

    # Make sure it's a list
    if not isinstance(web_urls, list):
        web_urls = [web_urls]
  
    # Url can be a single string or list
    website_loader = UnstructuredURLLoader(web_urls)
    website_raw = website_loader.load()

    website_data = rawtext_to_doc_split(website_raw, chunk_size = chunk_size, chunk_overlap=chunk_overlap)
  
    # Return individual docs or list
    return website_data

In [None]:
# get the results
res_unstructured = website_to_text_unstructured(demo_urls)
res_unstructured

[Document(page_content="Menu\n\nESPN\n\nSearch\n\n\n\nscores\n\nNFL\n\nMLB\n\nNBA\n\nNHL\n\nSoccer\n\nGolf\n\n…Women's World CupNCAAFNCAAMNCAAWSports BettingBoxingCFLNCAACricketF1HorseMMANASCARNBA G LeagueOlympic SportsPLLRacingRN BBRN FBRugbyTennisWNBAWWEX GamesXFL\n\nMore ESPN\n\nFantasy\n\nListen\n\nWatch\n\nESPN+\n\nSUBSCRIBE NOW\n\nPaul vs. Diaz (ESPN+ PPV)\n\nPGA TOUR LIVE\n\nLittle League Baseball: Regionals\n\nMLB: Select Games\n\nCrossFit Games\n\nSlamBall\n\nThe Ultimate Fighter: Season 31\n\nFantasy Football: Top Storylines, Rookies, Sleepers\n\nQuick Links\n\nWomen's World Cup\n\nNHL Free Agency\n\nNBA Free Agency Buzz\n\nNBA Trade Machine\n\nThe Basketball Tournament\n\nFantasy Football: Sign Up\n\nHow To Watch PGA TOUR\n\nFavorites\n\nManage Favorites\n\nCustomize ESPN\n\nESPN Sites\n\nESPN Deportes\n\nAndscape\n\nespnW\n\nESPNFC\n\nX Games\n\nSEC Network\n\nESPN Apps\n\nESPN\n\nESPN Fantasy\n\nFollow ESPN\n\nFacebook\n\nX/Twitter\n\nInstagram\n\nSnapchat\n\nTikTok\n\nYou

In [None]:
#unit testb
test_converters_inputs(website_to_text_unstructured, demo_urls)

We also see here that there's something to be said about the unstructured approach which appears to be more conservative in the number of newline characters but still appears to preserve content. However, the gain is not overly significant.

#### Trafilatura Parsing

[Tralifatura](https://trafilatura.readthedocs.io/en/latest/) is a Python and command-line utility which attempts to extracts the most relevant information from a given website.  

In [None]:
def website_trafilatura(url):
  downloaded = trafilatura.fetch_url(url)
  return trafilatura.extract(downloaded)

In [None]:
trafilatura_text = website_trafilatura(demo_urls[0])
print('Total number of characters in example:', len(trafilatura_text), '\n')
trafilatura_text

Total number of characters in example: 1565 



'|\n|\n|\n|\n|\n||\n|\n|\n||\nPHI\nMIA\n||\n56-49\n57-49\n||\n||\n||\n||\n||\n6:40 PM ET\n||\n|\n|\n|\n|\n|\n|\n||\n|\n|\n||\nMIL\nWSH\n||\n57-49\n44-62\n||\n||\n||\n||\n||\n7:05 PM ET\n||\n|\n|\n|\n|\n|\n|\n||\n|\n|\n||\nTB\nNYY\n||\n64-44\n55-50\n||\n||\n||\n||\n||\n7:05 PM ET\n||\n|\n|\n|\n|\n|\n|\n||\n|\n|\n||\nBAL\nTOR\n||\n64-41\n59-47\n||\n||\n||\n||\n||\n7:07 PM ET\n||\n|\n|\n|\n|\n|\n|\n||\n|\n|\n||\nLAA\nATL\n||\n55-51\n67-36\n||\n||\n||\n||\n||\n7:20 PM ET\n||\n|\n|\n|\n|\n|\n|\n||\n|\n|\n||\nCIN\nCHC\n||\n58-49\n53-52\n||\n||\n||\n||\n||\n8:05 PM ET\n||\n|\n|\n|\n|\n|\n|\n||\n|\n|\n||\nCLE\nHOU\n||\n53-53\n59-47\n||\n||\n||\n||\n||\n8:10 PM ET\n||\n|\n|\n|\n|\n|\n|\n||\n|\n|\n||\nSD\nCOL\n||\n52-54\n41-64\n||\n||\n||\n||\n||\n8:40 PM ET\n||\n|\n|\n|\n|\n|\n|\n||\n|\n|\n||\nBOS\nSEA\n||\n56-49\n54-51\n||\n||\n||\n||\n||\n9:40 PM ET\n||\n|\n|\n|\n|\n|\n|\n||\n|\n|\n||\nARI\nSF\n||\n56-50\n58-48\n||\n||\n||\n||\n||\n9:45 PM ET\n||\n|\n|\n|\n|\n|\n|\n||\n|\n|\n||\nJPN\nESP\n||\

This output is SUBSTANTIALLY shorter with a length of 1565 characters. However, the problem is that the main article on the page actually isn't captured at all.

#### jusText

[jusText](https://pypi.org/project/jusText/) is another Python library for extracting content from a website.

In [None]:
def website_justext(url):
  response = requests.get(url)
  paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
  content = [paragraph.text for paragraph in paragraphs \
            if not paragraph.is_boilerplate]
  text = " ".join(content)
  return text

In [None]:
# Ensure behavior
justext_text = website_justext(demo_urls[0])
justext_text

''

In [None]:
# Try a different URL to see if behavior improves
justext_text = website_justext(demo_urls[1])
justext_text

'Declaring the Minor While minor declarations can be made any time, DS courses will give some preference to students who have officially declared the Data Science Minor. So we recommend declaring the minor sooner rather than later. It is always possible to drop a declared minor. Minor declarations must be submitted at least two weeks before registration begins. Otherwise, the minor declaration will not be processed until after registration. No preference will be given during registration for an “intent” to declare because the minor declaration was made too late. First, preference for students who have declared the minor only applies to DS courses, not other courses. Second, if you declared the minor within two weeks of registration, your minor declaration will. not show up on YES, and you will not have preference. Third, while we try to hold as many seats for students who have declared the minor as we can, not all seats are reserved. Yes. While A&S students are usually prevented from d

Here, we see that we may prefer to stick with the langchain implementations. The first jusText example returned an empty string, although previous work demonstrates that on a different day, it worked well (note that the ESPN's content was different). With the second URL, parts of the website, particularly the headers, is actually missing.

## Creating Document Segments
Now, the precursor to creating vector stores/embeddings is to create document segments. Since we have a variety of sources, we will keep this in mind as we develop the following function.

:::{.callout-warning}
Note that the `get_document_segments` currently is meant to be used in one single pass with `context_info` being all of a single file type. [Issue #150](https://github.com/vanderbilt-data-science/lo-achievement/issues/150) is meant to expand this functionality so that if many files are uploaded, the software will be able to handle this.
:::

In [None]:
#| export
def get_document_segments(context_info, data_type, chunk_size = 1500, chunk_overlap=100):

    load_fcn = None
    addtnl_params = {'chunk_size': chunk_size, 'chunk_overlap': chunk_overlap}

    # Define function use to do the loading
    if data_type == 'text':
        load_fcn = rawtext_to_doc_split
    elif data_type == 'web_page':
        load_fcn = website_to_text_unstructured
    elif data_type == 'youtube_video':
        load_fcn = youtube_to_text
    else:
        load_fcn = files_to_text
    
    # Get the document segments
    doc_segments = load_fcn(context_info, **addtnl_params)

    return doc_segments

## Creating Vector Stores from Document Segments
The last step here will be in the creation of vector stores from the provided document segments. We will allow for the usage of either Chroma or DeepLake and enforce OpenAIEmbeddings.

In [None]:
#| export
def create_local_vector_store(document_segments, **retriever_kwargs):
    embeddings = OpenAIEmbeddings()
    db = Chroma.from_documents(document_segments, embeddings)
    retriever = db.as_retriever(**retriever_kwargs)
    
    return db, retriever

### Unit test of vector store and segment creation

In [None]:
from langchain.chat_models import ChatOpenAI
from getpass import getpass

In [None]:
openai_api_key = getpass()
os.environ["OPENAI_API_KEY"] = openai_api_key

llm = ChatOpenAI(model_name = 'gpt-3.5-turbo-16k')

In [None]:
test_files = ['../roadnottaken.txt', '../2302.11382.pdf']

#get vector store
segs = get_document_segments(test_files, data_type='other', chunk_size = 1000, chunk_overlap = 100)
chroma_db, vs_retriever = create_local_vector_store(segs)

#create test retrievalqa
qa_chain = RetrievalQA.from_chain_type(llm=openai_llm, chain_type="stuff", retriever=vs_retriever)

In [None]:
# check for functionality
chroma_db.similarity_search('The street was forked and I did not know which way to go')

[Document(page_content='Two roads diverged in a yellow wood,\rAnd sorry I could not travel both\rAnd be one traveler, long I stood\rAnd looked down one as far as I could\rTo where it bent in the undergrowth;\r\rThen took the other, as just as fair,\rAnd having perhaps the better claim,\rBecause it was grassy and wanted wear;\rThough as for that the passing there\rHad worn them really about the same,\r\rAnd both that morning equally lay\rIn leaves no step had trodden black. Oh, I kept the first for another day! Yet knowing how way leads on to way,\rI doubted if I should ever come back. I shall be telling this with a sigh\rSomewhere ages and ages hence:\rTwo roads diverged in a wood, and IэI took the one less traveled by,\rAnd that has made all the difference.', metadata={'source': '../roadnottaken.txt', 'start_index': 0}),
 Document(page_content='any unnecessary steps,” is useful in ﬂagging inaccuracies in the user’s original request so that the ﬁnal recipe is efﬁcient.', metadata={'sou

In [None]:
#check qa chain for functionality
ans = qa_chain({'question':'What is the best prompt to use when I want the model to take on a certain attitude of a person?'})

In [None]:
#show result
ans

{'question': 'What is the best prompt to use when I want the model to take on a certain attitude of a person?',
 'answer': 'The best prompt to use when you want the model to take on a certain attitude of a person is to provide a persona for the model to embody. This can be expressed as a job description, title, fictional character, historical figure, or any other attributes associated with a well-known type of person. The prompt should specify the outputs that this persona would create. Additionally, personas can also represent inanimate or non-human entities, such as a Linux terminal or a database. In this case, the prompt should specify how the inputs should be delivered to the entity and what outputs the entity should produce. It is also possible to provide a better version of the question and prompt the model to ask if the user would like to use the better version instead.\n',
 'sources': '../2302.11382.pdf',
 'source_documents': [Document(page_content='4) Example Implementation: A

In conclusion, this is looking pretty solid. Let's leverage this functionality within the code base.