Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Welcome & Goals

Welcome! If you are reading this, you are probably navigating the vast and fascinating world of Big Data.

In recent years, the explosion of Large Language Models (LLMs) like ChatGPT has completely changed how we interact with information. However, LLMs have a significant limitation: they only know what they were trained on, and they cannot securely access your private or strictly up-to-date data.

This is where Vector Databases and the RAG (Retrieval-Augmented Generation) architecture come into play. They are the missing link that allows AI models to dynamically read, understand, and use your specific data to generate accurate answers.

The Goal of This Guide

This tutorial was created as a practical, hands-on resource for our Big Data course. The main objective is to move from theory to practice without getting bogged down in unnecessary complexity.

By the end of this guide, you will be able to:

  • Understand what a Vector Database is and why traditional relational databases (SQL) aren’t enough for AI applications.
  • Set up and interact with Weaviate, a leading open-source Vector Database.
  • Build a foundational RAG pipeline from scratch using Python.

Prerequisites

To get the most out of this tutorial, you don’t need to be a machine learning expert. You only need:

  • A basic understanding of Python.
  • A terminal and a code editor (like VS Code).
  • Curiosity about how AI actually searches and retrieves information under the hood.

Ready to build your first AI-powered search engine? Let’s dive in!

Repository

Inside the github repository you can find a fully working implementation the code you’re going to see in this guide


Credits This guide is writter by Andrea Moschetto as an academic course project under the supervision of Professor Alfredo Pulvirenti

What is a Vector Database?

To understand what a Vector Database is, we first need to understand how traditional databases fail when it comes to Artificial Intelligence.

In a standard relational database (like PostgreSQL or MySQL), data is stored in rows and columns. If you want to find information, you query it using exact matches or specific keywords. If you search for “dog”, the database looks for the exact word “dog”. It won’t return results containing “puppy” or “canine” unless you explicitly tell it to. Traditional databases lack context.

AI models, however, don’t read words like humans do; they read numbers.

The Magic of Embeddings

To make an AI understand text, images, or audio, we must translate that data into arrays of numbers called Vectors (or Embeddings).

Imagine a massive multidimensional map. An embedding model assigns coordinates to every piece of data. Concepts that share similar meanings are placed physically close to each other on this map, while unrelated concepts are placed far apart.

For example, the vector coordinates for “apple” will be very close to “banana” (both are fruits), but very far from “spaceship”.

Why Do We Need a Vector Database?

A Vector Database is purpose-built to store, manage, and query these massive lists of numbers efficiently.

While a traditional database searches for exact keyword matches, a Vector Database performs a Semantic Search (or Similarity Search). When you ask a question, the system converts your question into a vector and looks for the stored vectors that are physically closest to it in that multidimensional space.

In summary, a Vector Database allows you to:

  • Search by meaning: You can search for “how to fix a flat tire” and find an article titled “repairing a punctured wheel,” even if they share zero keywords.
  • Handle unstructured data: Easily manage raw text, PDFs, images, and audio without forcing them into rigid tables.
  • Power AI applications: Quickly retrieve the most relevant information to feed into an LLM (which is exactly what we will do with RAG).

Now that we know what a Vector Database does, it’s time to meet the tool we will be using to build our project.

Why Weaviate?

The Vector Database ecosystem has exploded recently. You might have heard of other tools like Pinecone, Milvus, Qdrant, or Chroma. So, why did we choose Weaviate for this Big Data project?

Weaviate is an open-source vector database that is incredibly robust, highly scalable, and exceptionally developer-friendly. It is designed not just to store vectors, but to actively build AI-powered search engines with minimal friction.

Here are the main reasons Weaviate is the perfect fit for our tutorial:

1. Built-in Vectorization (Modules)

This is arguably Weaviate’s “killer feature.” In many vector databases, you have to manually convert your text into embeddings using a separate Python script before inserting them into the database.

Weaviate handles this for you. Through its integration modules (like text2vec-openai, text2vec-cohere, or text2vec-huggingface), you simply send raw text to Weaviate. The database itself reaches out to the embedding provider, generates the vector, and stores everything automatically. It does the exact same thing when you perform a search query.

2. Open-Source and Flexible Deployment

As students, we need tools that are accessible. Weaviate is open-source, meaning you can run it entirely on your own machine using Docker. Alternatively, if your laptop struggles with heavy workloads, they offer a generous free tier on Weaviate Cloud (WCD) (formerly WCS), which gives you a fully managed sandbox cluster in the cloud for free.

3. Hybrid Search Out-of-the-Box

Semantic search (vectors) is amazing, but sometimes you still need exact keyword matches (like searching for a specific serial number or name). Weaviate natively supports Hybrid Search, which combines the best of both worlds: vector search and BM25 (traditional keyword search). You can even adjust the weight of each method using a simple slider parameter called alpha.

4. The Python Client v4

Weaviate recently released the v4 version of their Python client. It is entirely rewritten to be extremely “Pythonic”, strongly typed, and intuitive. You get great IDE autocomplete support (like in VS Code), making it much harder to make mistakes when writing your RAG pipeline.


Weaviate takes away the heavy lifting of managing embeddings so we can focus on building the actual RAG logic. In the next chapter, we will look at how Weaviate organizes data under the hood.

Core Concepts

Before we start writing Python code, we need to understand how Weaviate structures and organizes data. If you have ever used a relational database (like MySQL) or a document database (like MongoDB), these concepts will feel very familiar, but with a specific focus on AI and vectors.

1. Collections

In Weaviate, the highest level of data organization is a Collection (previously called a Class in older versions of Weaviate). You can think of a Collection as a “Table” in a SQL database. If you are building a movie recommendation RAG, you might have a Movie collection and a Review collection.

When you define a Collection, you also define its configuration, such as which embedding model to use (the Vectorizer) and what kind of data it will hold.

2. Objects and Properties

Inside a Collection, you store Objects. An Object is equivalent to a “Row” in SQL or a “Document” in MongoDB. It represents a single item of data, like one specific movie.

Each Object is made up of Properties (equivalent to SQL “Columns”). Properties hold the actual data values. For example, a Movie object might have the following properties:

  • title (text)
  • release_year (integer)
  • description (text)

3. Vectorizers

This is where Weaviate differs from traditional databases. When you create a Collection, you usually assign a Vectorizer to it (e.g., text2vec-openai or text2vec-cohere).

When you insert a new Object into that Collection, the Vectorizer automatically takes the text from specific Properties (like the description), sends it to the AI model to get the numerical embeddings, and stores those vectors alongside your Object. You don’t have to calculate the vectors yourself!

4. Cross-References

Just like foreign keys in SQL, Weaviate allows you to link Objects together using Cross-References. You can link a Review object directly to a Movie object. This is incredibly powerful for complex searches, allowing you to retrieve a movie based on the vectors of its connected reviews.


Quick Comparison: SQL vs. Weaviate

To summarize, here is how Weaviate’s terminology maps to traditional relational databases:

Traditional SQL DBWeaviate Vector DBDescription
TableCollectionA logical grouping of data with a specific schema.
RowObjectA single data record.
ColumnPropertyA defined field within an object (e.g., text, int, boolean).
IndexVector Index (HNSW)The algorithm used to make searching through millions of vectors incredibly fast.
Foreign KeyCross-ReferenceA directional link between objects.

Now that we understand the vocabulary, we are ready to move on to the architectural pattern we will be building: RAG.

Beyond LLM Limits

Large Language Models (LLMs) are incredibly powerful engines for understanding and generating human language. However, if you try to build a reliable, production-ready application relying only on an LLM, you will quickly run into severe roadblocks.

To understand why the tech industry has widely adopted the RAG (Retrieval-Augmented Generation) pattern, we must first understand the four main limitations of standalone LLMs:

1. The Knowledge Cutoff

Training an LLM takes massive computational power and months of time. Because of this, a model’s knowledge is “frozen” at the exact moment its training data was collected. If you ask a standard LLM about news from yesterday or a newly released software version, it simply won’t know the answer.

2. The Data Privacy Problem

Public models are trained on publicly available internet data. They know absolutely nothing about your university’s internal documents, your company’s private financial records, or your proprietary codebases.

3. Hallucinations

LLMs are probabilistic models; fundamentally, they predict the next most likely word in a sequence. When they lack the specific knowledge to answer a question, they often do not admit ignorance. Instead, they confidently invent plausible-sounding but entirely fake information. This is known as a hallucination, and it is unacceptable in any professional or academic environment.

4. The Context Window Bottleneck

A common question is: “Why not just copy and paste all my private documents into the prompt?” Every LLM has a context window—a strict limit on how much text it can process at one time (measured in tokens). While these windows are growing larger, pasting a database of a million documents into a single prompt is computationally impossible, extremely slow, and prohibitively expensive.


The Solution: Grounding the AI

We cannot teach the LLM everything in advance, and we cannot paste everything into the prompt. Instead, we need a way to dynamically find only the exact pieces of information relevant to the user’s question, and feed just those small pieces to the LLM.

We need to give the AI an external, searchable memory. This is exactly what RAG does, and it uses a Vector Database as its core engine.

How RAG Works

Retrieval-Augmented Generation (RAG) might sound like a complex academic term, but the concept is actually very straightforward. Think of it as giving an AI an “open-book exam.” Instead of relying on its flawed internal memory, the AI is allowed to look up the exact facts in a trusted database before answering.

A standard RAG pipeline is always divided into three distinct phases: Ingestion, Retrieval, and Generation.

Phase 1: Ingestion (Data Preparation)

Before we can search for anything, we need to populate our database. This happens behind the scenes before the user even asks a question.

  1. Load Data: We gather our private documents (PDFs, text files, website scrapes).
  2. Chunking: LLMs and embedding models have limits on how much text they can process at once. We break our large documents into smaller, meaningful pieces called “chunks” (e.g., a few paragraphs each).
  3. Embedding: We pass these chunks through an AI model to convert the text into numerical vectors.
  4. Storage: We save the original text and its corresponding vector into our Vector Database (Weaviate).

Now the system is ready for the user.

  1. The Query: The user asks a question in natural language (e.g., “What is the university’s policy on remote exams?”).
  2. Query Embedding: The system takes the user’s question and converts it into a vector using the exact same embedding model used in Phase 1.
  3. Semantic Search: Weaviate compares the question’s vector against all the document vectors in the database. It instantly retrieves the “Top K” (e.g., the top 3) most mathematically similar chunks. These chunks represent the most relevant information needed to answer the question.

Phase 3: Generation (The Answer)

This is where the magic happens. We don’t just show the raw retrieved chunks to the user; we use an LLM to synthesize a perfect answer.

  1. Prompt Construction: We build a prompt that combines the user’s original question with the text of the retrieved chunks. We give the LLM strict instructions: “Answer the user’s question using ONLY the provided context.”
  2. Final Output: The LLM reads the context, understands it, and generates a natural, accurate, and conversational response for the user. No hallucinations, just facts!

💡 The Librarian Analogy

To easily remember this, think of RAG as a team effort between a Librarian and a Scholar:

  • The Vector Database (Weaviate) is the ultra-fast Librarian. When you ask a question, the Librarian runs into the archives, finds the three most relevant book pages, and hands them to the Scholar.
  • The LLM (ChatGPT, Llama, etc.) is the brilliant Scholar. The Scholar doesn’t need to memorize the entire library. They just read the three pages the Librarian provided and write a beautiful, accurate essay based on them.

With this theory out of the way, we are finally ready to get our hands dirty with code!

Python Environment

It is time to transition from theory to practice. In this section, we will set up a clean, isolated Python workspace for our RAG project.

Keeping your project dependencies isolated is a fundamental best practice in data engineering and software development. We will achieve this using Python’s built-in virtual environments.

1. Create a Virtual Environment

Open your terminal, navigate to the root folder of your project (where you plan to write your code, outside of the docs folder), and run the following command to create a virtual environment named venv:

python -m venv venv

2. Activate the Environment

Before installing any packages, you must activate the environment. The command depends on your operating system:

  • macOS and Linux:
source venv/bin/activate

  • Windows (Command Prompt):
venv\Scripts\activate.bat

  • Windows (PowerShell):
venv\Scripts\Activate.ps1

(You will know it worked if you see (venv) at the beginning of your terminal prompt).

3. Install Dependencies

With the environment activated, we need to install the required libraries. We will be using the official Weaviate v4 Python client, which is completely rewritten to be highly intuitive and strongly typed. We also need python-dotenv to securely manage our API keys.

Run this command:

pip install -U weaviate-client python-dotenv

Optional: If you want to use OpenAI’s LLMs for the final text generation step of our RAG pipeline, you should also install their official client:

pip install openai

4. Secure Your API Keys

Since Weaviate will be reaching out to external AI models (like OpenAI, Cohere, or Google) to generate vectors, it needs an API key. Never hardcode API keys directly into your Python scripts! If you push hardcoded keys to GitHub, they will be compromised immediately.

Instead, create a file named exactly .env in the root of your project folder. Add your keys there like this:

# .env file
OPENAI_API_KEY="sk-your-openai-api-key-here"

Next, create a .gitignore file in the same folder and add .env to it. This ensures Git will ignore your secrets.

We will use the python-dotenv library to load these variables securely into our Python scripts later.

✨ Want to use Google Gemini instead? (Click to expand)

If you prefer to use Google Gemini instead of OpenAI for your embeddings and text generation (which is a great choice and offers a generous free tier for developers!), the setup is very similar.

First, you might want to install the Google GenAI SDK (optional, but useful if you want to test the model outside of Weaviate):

pip install google-genai

Then, in your .env file, simply add your Gemini API key instead:

# .env file
GEMINI_API_KEY="your-gemini-api-key-here"

Later in the code, you will pass this key to Weaviate to enable its built-in Google integration modules (text2vec-google and generative-google).


Your local machine is now prepped and ready for coding! Next, we need to actually get our Vector Database up and running.

Weaviate Infrastructure

Weaviate is not just a Python library; it is a full-fledged database server. This means you need to have a Weaviate instance running somewhere before your Python code can interact with it.

For this tutorial, you have two choices for setting up your infrastructure. We highly recommend Option A for this course, as it requires zero setup and won’t consume your laptop’s resources.


Weaviate Cloud (WCD) provides a free “Sandbox” tier. This is a fully managed, serverless instance of Weaviate that lives in the cloud and lasts for 14 days (perfect for a university project).

Step-by-step setup:

  1. Go to the Weaviate Cloud Console and create a free account.
  2. Click on Create cluster.
  3. Select the Free Sandbox tier.
  4. Give your cluster a name (e.g., bigdata-rag-project) and click Create.
  5. Wait a minute or two for the cluster to provision. Once it shows a green “Ready” status, click on Details.

You will need two crucial pieces of information from this details page:

  • REST Endpoint: The URL of your cluster (e.g., https://bigdata-rag-project-xyz.weaviate.network).
  • Admin API Key: Click on the “API Keys” button to reveal it.

Open the .env file you created in the previous chapter and add these new variables:

# .env file
OPENAI_API_KEY="sk-your-openai-api-key-here"
WEAVIATE_URL="https://your-cluster-url.weaviate.network"
WEAVIATE_API_KEY="your-weaviate-admin-key"


Option B: Local Docker Deployment (Advanced)

If you prefer to run everything locally on your own machine, Weaviate provides official Docker images.

Ensure you have Docker Desktop installed and running. Then, create a file named docker-compose.yml in the root of your project folder and paste the following configuration:

---
services:
  weaviate:
    command:
    - --host
    - 0.0.0.0
    - --port
    - '8080'
    - --scheme
    - http
    image: cr.weaviate.io/semitechnologies/weaviate:1.24.4
    ports:
    - 8080:8080
    - 50051:50051
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'text2vec-openai'
      ENABLE_MODULES: 'text2vec-openai,generative-openai'
      CLUSTER_HOSTNAME: 'node1'

Open your terminal in the same folder as the docker-compose.yml file and run:

docker compose up -d

Your local Weaviate instance will now be running at http://localhost:8080 (without needing a Weaviate API key).


Now that our Python environment is ready and our database is running, the real fun begins. Let’s write our first Python script!

Connecting to the Database

With our environment set up and our database running, it is time to write our first Python script.

Create a new folder in your project called src (or code), and inside it, create a file named 01_setup_connection.py.

The Python v4 Client

Weaviate’s v4 Python client is designed around Context Managers (the with statement in Python). This is a best practice because it automatically opens the connection to the database when you enter the block and, most importantly, gracefully closes the connection when you exit the block, preventing memory leaks.

Here is the code to connect to your Weaviate Cloud (WCD) instance and verify that it is ready to receive commands:

import weaviate
from weaviate.classes.init import Auth
import os
from dotenv import load_dotenv

# 1. Load the environment variables from the .env file
load_dotenv()

# 2. Retrieve the keys securely
weaviate_url = os.getenv("WEAVIATE_URL")
weaviate_api_key = os.getenv("WEAVIATE_API_KEY")
gemini_api_key = os.getenv("GEMINI_API_KEY")

# 3. Establish the connection using a context manager
with weaviate.connect_to_weaviate_cloud(
    cluster_url=weaviate_url,
    auth_credentials=Auth.api_key(weaviate_api_key),
    headers={
        "X-Goog-Studio-Api-Key": gemini_api_key  # We pass this so Weaviate can use Google Gemini
    }
) as client:

    # 4. Check if the connection was successful
    if client.is_ready():
        print("Successfully connected to Weaviate!")
    else:
        print("Failed to connect to Weaviate.")

    # As soon as we exit this 'with' block, the client automatically disconnects.

Local Docker Connection Alternative

If you chose Option B (Local Docker) in the previous chapter, your connection script is even simpler because you do not need a Weaviate API key or URL. You only need to pass the Gemini key in the headers.

Replace the connect_to_weaviate_cloud block with this:

with weaviate.connect_to_local(
    headers={
        "X-Goog-Studio-Api-Key": gemini_api_key
    }
) as client:
    
    if client.is_ready():
        print("Successfully connected to local Weaviate!")

Running the Script

Open your terminal, ensure your virtual environment is still activated, and run the script:

python src/01_setup_connection.py

If everything is configured correctly, you should see Successfully connected to Weaviate! printed in your terminal.

Now that we have an active line of communication, we can start shaping how our data will be stored.

Creating a Collection

Now that we have successfully connected to our database, we need to create a place to store our data. In Weaviate, this is called a Collection.

Think of a Collection as a blueprint. We need to tell Weaviate what properties (columns) our data will have, and more importantly, which AI model it should use to generate the vector embeddings.

Create a new file in your src folder named 02_create_collection.py and add the following code:

import weaviate
import weaviate.classes.config as wvcc
from weaviate.classes.init import Auth
import os
from dotenv import load_dotenv

load_dotenv()

with weaviate.connect_to_weaviate_cloud(
    cluster_url=os.getenv("WEAVIATE_URL"),
    auth_credentials=Auth.api_key(os.getenv("WEAVIATE_API_KEY")),
    headers={"X-Goog-Studio-Api-Key": os.getenv("GEMINI_API_KEY")}
) as client:

    collection_name = "Article"

    if client.collections.exists(collection_name):
        client.collections.delete(collection_name)
        print(f"Old collection '{collection_name}' deleted.")

    print(f"Creating collection '{collection_name}'...")
    client.collections.create(
        name=collection_name,
        properties=[
            wvcc.Property(name="title", data_type=wvcc.DataType.TEXT),
            wvcc.Property(name="content", data_type=wvcc.DataType.TEXT),
        ],

        # 3. Vectorizer for Google Gemini
        vector_config=wvcc.Configure.Vectors.text2vec_google_gemini(),

        # 4. Generative Module: SPECIFYING THE GEMINI MODEL!
        generative_config=wvcc.Configure.Generative.google_gemini(
            model="gemini-2.5-flash" # this model name could change in the future, check the documentation for the latest Gemini models available in Weaviate
        )
    )

    print(f"Collection '{collection_name}' created successfully with Gemini integrations!")

Breaking Down the Code

Let’s look at the most important parts of this script:

  • wvcc.Property(...): Here we define the shape of our data. For this simple RAG example, our Article collection will only have a title and the main content. Both are defined as text.
  • vector_config: This is where Weaviate shines. By setting this to text2vec_google_gemini(), we instruct Weaviate to take the text we insert, automatically call Google’s embedding API in the background, and store the resulting vector coordinates alongside our text.
  • generative_config: Weaviate also has built-in RAG capabilities. By defining a generative module here and passing a specific model like gemini-2.5-flash, we allow Weaviate to take search results and pass them directly to the LLM to generate an answer, all in a single database query.

Run the script from your terminal:

python src/02_create_collection.py

With our Article blueprint successfully created, our database is finally ready to receive data.

Loading Documents

Our Article collection is created and waiting. Now, we need to populate it with some actual text.

In a real-world Big Data scenario, you would be loading thousands of rows from a JSON file, a CSV, or scraping a company wiki. To keep things simple and focused on the mechanics, we will define a small list of dictionaries directly in our Python script.

Create a new file named 03_insert_data.py in your src folder and add the following code:

import weaviate
from weaviate.classes.init import Auth
import os
from dotenv import load_dotenv

load_dotenv()

documents = [
    {
        "title": "Introduction to Vector Databases",
        "content": "A vector database is a type of database that stores data as high-dimensional vectors. This allows for similarity search, where the database can find data that is semantically similar to a given query, rather than just exact keyword matches."
    },
    {
        "title": "Understanding RAG",
        "content": "Retrieval-Augmented Generation (RAG) is a technique that enhances large language models by retrieving relevant information from an external knowledge base before generating a response. This reduces hallucinations and keeps the information up to date."
    },
    {
        "title": "Weaviate Architecture",
        "content": "Weaviate is an open-source vector database. It uses modular architecture, allowing users to plug in different machine learning models for vectorization, such as Google Gemini, Hugging Face, or Cohere. It stores both objects and vectors."
    }
]

with weaviate.connect_to_weaviate_cloud(
    cluster_url=os.getenv("WEAVIATE_URL"),
    auth_credentials=Auth.api_key(os.getenv("WEAVIATE_API_KEY")),
    headers={"X-Goog-Studio-Api-Key": os.getenv("GEMINI_API_KEY")}
) as client:
    
    collection = client.collections.get("Article")
    
    print(f"Inserting {len(documents)} documents into Weaviate...")
    
    # Batch insertion
    response = collection.data.insert_many(documents)
    
    if response.has_errors:
        print("Errors occurred during insertion:")
        for error in response.errors:
            print(error)
    else:
        print("Successfully inserted all documents!")
        print("Notice: We did not calculate any vectors manually. Weaviate called Gemini in the background.")

The Magic of Automatic Vectorization

Run the script from your terminal:

python src/03_insert_data.py

If it succeeds, pause for a moment and look at the code. Did you notice what is missing?

Nowhere in this script did we import an AI library to calculate embeddings. We didn’t write code to convert our text strings into arrays of floats. We just threw standard Python strings at Weaviate.

Because we configured the Collection with text2vec_google_gemini(), Weaviate intercepted our raw data, securely sent it to Google AI Studio to get the vector coordinates, and stored both the text and the vectors together.

This completes the Ingestion phase. Our database is now fully populated and vectorized. Next, we will learn how to extract this knowledge using Semantic Search.

Semantic Search

We have reached the Retrieval phase of our RAG pipeline. Our database is populated with vectorized documents, and now we want to extract the most relevant information using a natural language question.

In a traditional SQL database, if you search for “How does the AI access external information?”, the database looks for those exact words. If the document says “Retrieval-Augmented Generation,” the SQL database will return zero results.

A Vector Database, however, maps the meaning of your question and finds the documents closest to it in the vector space.

Writing the Search Script

Create a new file named 04_vector_search.py in your src folder and add the following code:

import weaviate
from weaviate.classes.init import Auth
import os
from dotenv import load_dotenv

load_dotenv()

with weaviate.connect_to_weaviate_cloud(
    cluster_url=os.getenv("WEAVIATE_URL"),
    auth_credentials=Auth.api_key(os.getenv("WEAVIATE_API_KEY")),
    headers={"X-Goog-Studio-Api-Key": os.getenv("GEMINI_API_KEY")}
) as client:
    
    collection = client.collections.get("Article")
    
    user_query = "What happens when an AI model invents fake facts?"
    
    print(f"Executing semantic search for: '{user_query}'\n")
    print("-" * 50)
    
    # The 'near_text' function now automatically vectorizes our query using Gemini
    response = collection.query.near_text(
        query=user_query,
        limit=2,
        return_properties=["title", "content"]
    )
    
    for i, obj in enumerate(response.objects, 1):
        print(f"Result {i}:")
        print(f"Title: {obj.properties['title']}")
        print(f"Content: {obj.properties['content']}")
        print("-" * 50)

Run the script from your terminal:

python src/04_vector_search.py

Understanding the Output

Look closely at the results in your terminal. You asked about an AI model inventing “fake facts.” Weaviate should have returned the article titled “Understanding RAG” as the top result.

Why? Because that article mentions the word “hallucinations.” The AI embedding model knows that “inventing fake facts” and “hallucinations” share a very similar semantic meaning, placing their vectors close together.

Weaviate seamlessly handled the vectorization of your search query and the complex K-Nearest Neighbors (KNN) math required to find the closest matches.

Now we have the retrieved context. The final step is to pass this context to an LLM to generate a clean, conversational answer for the user.

Connecting the Dots

We have successfully ingested our data and learned how to retrieve it using semantic search. Now, it is time to complete our RAG (Retrieval-Augmented Generation) pipeline.

The final step is Generation. We need to take the raw documents we retrieved and pass them, along with the user’s original question, to a Large Language Model (LLM). The LLM will use those documents as its single source of truth to formulate a conversational, accurate answer.

While you could manually write code to build a prompt and send it to the Google Gemini API yourself, Weaviate makes this incredibly easy. Because we added a generative_config when we created our Collection, Weaviate can perform the retrieval and the generation in a single database query!

The Complete RAG Script

Create a new file named 05_rag_pipeline.py in your src folder and add the following code:

import weaviate
from weaviate.classes.init import Auth
import os
from dotenv import load_dotenv

load_dotenv()

with weaviate.connect_to_weaviate_cloud(
    cluster_url=os.getenv("WEAVIATE_URL"),
    auth_credentials=Auth.api_key(os.getenv("WEAVIATE_API_KEY")),
    headers={"X-Goog-Studio-Api-Key": os.getenv("GEMINI_API_KEY")}
) as client:
    
    collection = client.collections.get("Article")
    
    user_query = "Can you explain what a vector database is and how RAG improves AI models?"
    
    system_prompt = "Answer the user's question using ONLY the provided information. Be concise and clear."
    
    print(f"User Question: {user_query}\n")
    print("Thinking with Gemini...\n")
    
    # Weaviate will find the documents and pass them to Gemini to generate the answer
    response = collection.generate.near_text(
        query=user_query,
        limit=2,
        grouped_task=system_prompt
    )
    
    print("🤖 AI Generated Answer:")
    print("-" * 50)
    print(response.generative.text)
    print("-" * 50)
    
    print("\n📚 Sources Used:")
    for i, obj in enumerate(response.objects, 1):
        print(f" - {obj.properties['title']}")  

Running the Full Pipeline

Run the final script from your terminal:

python src/05_rag_pipeline.py

Reviewing the Output

When you look at your terminal, you will see a beautifully formatted, natural language answer.

Behind the scenes, in a fraction of a second, Weaviate:

  1. Vectorized your question using Google’s embedding model.
  2. Searched the vector space for the most relevant chunks of text.
  3. Extracted the title and content properties of those chunks.
  4. Assembled a hidden prompt containing your grouped_task instructions, your original question, and the retrieved text.
  5. Sent that prompt to Google’s text generation model (like Gemini 2.5 Flash).
  6. Returned the clean, finalized string to your Python script. Congratulations! You have just built a fully functional Retrieval-Augmented Generation AI system from scratch.

Summary

Congratulations! You have successfully reached the end of this practical guide and built a fully functional, AI-powered search engine from scratch.

Let’s take a moment to review the core Big Data and AI concepts we covered in this tutorial:

  • The Problem with LLMs: We learned that while Large Language Models are brilliant at generating text, they suffer from hallucinations, have strict knowledge cutoffs, and cannot access private data.
  • Vector Databases & Embeddings: We explored how AI translates raw text into arrays of numbers (vectors) and maps them in a multidimensional space. This allows us to search by meaning rather than just exact keywords.
  • Weaviate Infrastructure: We set up a dedicated Vector Database instance and connected to it using the modern Weaviate v4 Python client.
  • Automated Data Ingestion: We defined a schema (Collection) and configured Weaviate to automatically handle the heavy lifting of generating embeddings using OpenAI’s models in the background.
  • Semantic Search (Retrieval): We successfully extracted relevant documents by asking natural language questions, proving that our database understands context.
  • The RAG Pipeline (Generation): Finally, we connected the dots. We used Weaviate’s built-in generative modules to retrieve context and seamlessly pass it to an LLM, forcing it to generate a factual answer based only on our private data.

By completing this project, you have moved beyond simply chatting with an AI. You have built the foundational architecture that powers modern, enterprise-grade AI applications.

You now have a working template that you can expand with your own datasets, whether they are course notes, research papers, or massive gigabytes of unstructured Big Data.

Next Steps (Advanced RAG)

What we have built so far is often referred to as Naive RAG. It is incredibly powerful, but in complex, real-world Big Data environments, data is messy. Documents are massive, vocabularies are highly technical, and a simple semantic search might not always retrieve the perfect context.

If you want to take your project to the next level, here are the core concepts of Advanced RAG you should explore next:

1. Advanced Chunking Strategies

In our example, we ingested tiny, perfectly sized documents. In reality, you will be parsing 100-page PDFs. You cannot embed an entire book into a single vector, nor should you split it randomly in the middle of a sentence.

  • Goal: Look into intelligent text splitting techniques (like Recursive Character Chunking or Semantic Chunking) to ensure that each piece of text retains its full context before being vectorized.

Semantic search is brilliant for understanding concepts, but it sometimes struggles with exact identifiers (e.g., searching for a specific product code like “TX-9902” or a specific person’s name).

  • Goal: Weaviate natively supports Hybrid Search, which runs a traditional keyword search (BM25) and a vector search simultaneously, merging the results. You can easily implement this by changing collection.query.near_text() to collection.query.hybrid().

3. Re-Ranking (Two-Stage Retrieval)

When you query a massive database, retrieving the top 100 results is fast, but the absolute best document might end up at position #15 instead of #1.

  • Goal: A Re-ranker is a specialized AI model that takes the initial broad results from Weaviate and carefully scores and re-orders them based on ultimate relevance to the user’s prompt, ensuring the LLM gets only the absolute best context.

4. Orchestration Frameworks

While writing pure Python code with the Weaviate client is the best way to learn the fundamentals, building complex AI agents often requires higher-level tools.

  • Goal: Explore frameworks like LangChain or LlamaIndex. They have built-in integrations for Weaviate and offer pre-built modules for parsing complex PDFs, managing conversational memory (chat history), and routing queries to different databases.

Final Thoughts

The field of AI and Big Data is evolving at a breakneck pace, but the underlying mechanics of Vector Databases and Retrieval-Augmented Generation are here to stay.

You now have the fundamental knowledge to not just use AI, but to actively build with it. Happy coding, and good luck with your Big Data projects!