Outline
RAG (Retrieval Augmented Generation) is touted as the decisive way to turn capable AI models into productive assistants. The general idea is to simply provide the AI model with relevant context and it will produce higher quality results. I have seen and used a variety of data sources for this from Google Search, Vector Databases, meeting transcripts, Salesforce, etc... but I have yet to run across a concise example of using Youtube as the source in a RAG system. In this article, I will discuss the benefits of using YouTube, and how to use it with a working code example as well.
Why YouTube
The type of content is quite varied across YouTube but one area it really excels in is the "How-To" sector. Ever hear the common phrase, "There's a YouTube video for that..."? A few examples are:
- How to fix a leaky pipe?
- How to optimize my PPC campaign?
- How to make fried rice?
Beyond the "How-To" sector, YouTube also has recent news postings, product reviews, and slews of tutorials that provide great context!
I think many developers have stayed away from YouTube so far as a data source for their AI tools mostly since it is all Video based ... thankfully there is already a super-handy video transcript PyPi package already out in the wild! It's super simple to work with and will return a full-text transcript for almost any YouTube video ready for AI ingestion.
Example Usage
GitHub Repository: LINK
- CLI-based chatbot with a custom YouTube "plugin". We first take in the user's input and generate a natural language search query. We hit Google Search with this and site:youtube.com
to ensure we only get back relevant YouTube videos. Next, we use the above-mentioned Python YouTube Transcript tool to extract the text transcripts. Last but not least, we pass this all back into the AI model to produce a final result. (note, chat history is mutated/copied at points to keep token counts down and retain only pertinent information)
Setup
- Clone the repository to your local device.
- Replace
<YOUR API KEY HERE>
with your actual OpenAI API key. - Open your terminal.
- Install the required packages:
pip install openai beautifulsoup4 requests youtube_transcript_api
. - Run
python youtube.py
.
You are now running the CLI locally :)
CLI Commands
q | quit
: Quit the programh | history
: Show the chat historyd | delete
: Delete the chat historyc | clear
: Clear the terminaly | youtube
: Toggle youtube mode
Chat Example
For my chat example, I chose to ask a question about the Jet's vs. Brown's game from this past Thursday (Dec. 28, 2023) - something the AI model would know nothing about. As you can see below, it found relevant YouTube videos, cited its sources, and provided a decent breakdown of the football game and the reasons for the Jet's loss.
Conclusion
Please share any use cases you find for this in the comments below or questions you may have. I hope this has sparked your interest in using YouTube as a RAG data source in conjunction with others for a more well-rounded knowledge base!
Comments
Login to Add comments.