Skip to content

DS1 Quickstart Tutorial

This guide walks you through building an intelligent chatbot using DS1 embeddings and Retrieval-Augmented Generation (RAG). You'll learn how to prepare your data, generate embeddings with DS1, implement semantic search, and combine everything with a language model to create a contextually-aware chatbot.

Brief Overview of the RAG Stack

RAG systems work by combining document retrieval with generative AI to produce informed, accurate responses. Here's the workflow:

  • When a user asks a question, the system first converts it into a vector using DS1 embeddings.
  • This query vector is then compared against pre-computed document vectors stored in your database.
  • The system identifies the most semantically relevant documents by measuring vector similarity.
  • These retrieved documents provide context that gets combined with the original question and sent to a large language model (like Claude or GPT-4), which generates a response grounded in your specific data.

Figure 1: RAG Workflow

Question → DS1 Query Embedding → Vector Similarity Search → Top Documents → Context + Question → LLM → Answer

This approach ensures your chatbot responds with information from your actual knowledge base rather than relying solely on the LLM's training data.

Prerequisites

To use DS1, you'll need:

  • AWS credentials configured with access to your SageMaker endpoint (for further information please refer to the SageMaker Notebook guide)
  • Python 3.8 or higher
  • boto3 library installed

Install required packages:

bash
pip install boto3 numpy

Prepare Data

Every RAG system starts with a knowledge base - a collection of documents your chatbot will reference. For this tutorial, we'll use six example documents covering diverse topics. In production, this would be your company's documentation, product information, or any domain-specific content you want your chatbot to be knowledgeable about.

You can use the following set of documents as a starting point:

python
documents = [
    "The Amazon rainforest produces about 20% of the world's oxygen and is home to over 10% of all species on Earth, playing a crucial role in regulating global climate.",
    "Quantum computers use quantum bits or qubits that can exist in multiple states simultaneously, potentially solving complex problems exponentially faster than classical computers.",
    "The Mediterranean diet emphasizes fish, olive oil, vegetables, and whole grains, and has been linked to reduced risk of heart disease and improved cognitive function.",
    "The Apollo 11 mission successfully landed humans on the Moon on July 20, 1969, with Neil Armstrong becoming the first person to walk on the lunar surface.",
    "Machine learning models learn patterns from data through training algorithms, enabling applications like image recognition, natural language processing, and predictive analytics.",
    "The Great Barrier Reef, spanning over 2,300 kilometers off Australia's coast, is the world's largest coral reef system and faces threats from climate change and ocean acidification."
]

Vectorise/Embed the Documents

Now we'll convert our text documents into numerical vectors using DS1. Since DS1 is hosted on SageMaker, we'll interact with it through AWS's boto3 SDK rather than a dedicated client library.

Setup SageMaker Client

python
import boto3
import json
import numpy as np

profile_name = "your-profile-name"

# Initialise SageMaker runtime client
boto_session = boto3.Session(profile_name=profile_name)
sagemaker_runtime = boto_session.client('sagemaker-runtime', region_name='eu-west-2')  # Replace with your region

# Your DS1 endpoint name
ENDPOINT_NAME = 'your-ds1-endpoint-name'  # Replace with your actual endpoint name

Embed Documents

python
def get_ds1_embeddings(texts):
    """
    Get embeddings from DS1 SageMaker endpoint
    
    Args:
        texts: List of strings to embed
    
    Returns:
        List of embeddings
    """
    # Prepare the payload
    payload = {
        "inputs": texts
    }
    
    # Invoke the endpoint
    response = sagemaker_runtime.invoke_endpoint(
        EndpointName=ENDPOINT_NAME,
        ContentType='application/json',
        Body=json.dumps(payload)
    )
    
    # Parse the response
    result = json.loads(response['Body'].read().decode())
    if isinstance(result, list):
        embeddings = result  # Embeddings returned directly
    elif isinstance(result, dict) and "embeddings" in result:
        embeddings = result["embeddings"]
    else:
        raise ValueError(f"Unexpected response format: {type(result)}")
    
    return embeddings

# Embed all documents
documents_embeddings = get_ds1_embeddings(documents)
print(f"Created {len(documents_embeddings)} document embeddings")
print(f"Embedding dimension: {len(documents_embeddings[0])}")

Batch Processing for Large Document Sets

When working with thousands of documents, process them in batches to optimise performance and avoid timeouts. The batch size depends on your endpoint configuration, but 16-32 documents per batch typically works well:

python
def embed_documents_in_batches(documents, batch_size=32):
    """
    Embed documents in batches to handle large datasets
    """
    all_embeddings = []
    
    for i in range(0, len(documents), batch_size):
        batch = documents[i:i + batch_size]
        batch_embeddings = get_ds1_embeddings(batch)
        all_embeddings.extend(batch_embeddings)
        print(f"Processed batch {i//batch_size + 1}/{(len(documents)-1)//batch_size + 1}")
    
    return all_embeddings

# For large document sets
# documents_embeddings = embed_documents_in_batches(documents, batch_size=32)

A Minimalist Retrieval System

Embeddings transform text into numerical vectors that preserve semantic relationships. When two pieces of text have similar meanings, their embedding vectors will be close together in vector space. This mathematical property enables semantic search - finding relevant documents based on meaning rather than just keyword matching.

We measure the "closeness" between vectors using dot product (since DS1 vectors are L2 normalised). Higher scores indicate greater semantic similarity.

Let's see how to find the most relevant document for a user's question:

python
query = "What makes quantum computers different from regular computers?"

# Get the embedding of the query
query_embedding = get_ds1_embeddings([query])[0]

Find the closest embedding among the documents based on similarity:

python
# Convert to numpy arrays
doc_embed = np.array(documents_embeddings)
query_embed = np.array(query_embedding)

# Compute the similarity using dot product
# Since DS1 embeddings are L2 normalised (unit length), dot product 
# is mathematically equivalent to cosine similarity but faster to compute
similarities = np.dot(doc_embed, query_embed)
retrieved_id = np.argmax(similarities)

print("Most relevant document:")
print(documents[retrieved_id])
print(f"Similarity score: {similarities[retrieved_id]:.4f}")

k-Nearest Neighbors Search (k-NN)

Often you'll want to retrieve multiple relevant documents, not just the single best match. The k-nearest neighbors algorithm finds the top k documents with the highest similarity scores:

python
def k_nearest_neighbors(query_embedding, documents_embeddings, k=5):
    """
    Find k most similar documents to the query using dot product
    
    Since DS1 embeddings are L2 normalized, dot product is equivalent 
    to cosine similarity but computationally more efficient.
    
    Args:
        query_embedding: Query vector
        documents_embeddings: List of document vectors
        k: Number of top documents to retrieve
    
    Returns:
        top_k_embeddings: Top k document embeddings
        top_k_indices: Indices of top k documents
    """
    # Convert to numpy arrays
    query_embedding = np.array(query_embedding)
    documents_embeddings = np.array(documents_embeddings)
    
    # Calculate dot product for each document (equivalent to cosine similarity for L2 normalized vectors)
    similarities = np.dot(documents_embeddings, query_embedding)
    
    # Sort by similarity in descending order
    sorted_indices = np.argsort(similarities)[::-1]
    
    # Take top k
    top_k_indices = sorted_indices[:k]
    top_k_embeddings = documents_embeddings[top_k_indices]
    
    return top_k_embeddings, top_k_indices

# Retrieve top 3 most relevant documents
top_k_embeddings, top_k_indices = k_nearest_neighbors(
    query_embedding, 
    documents_embeddings, 
    k=3
)

print("Top 3 most relevant documents:")
for i, idx in enumerate(top_k_indices):
    similarity = np.dot(query_embedding, documents_embeddings[idx])
    print(f"\n{i+1}. {documents[idx]}")
    print(f"   Similarity: {similarity:.4f}")

L2 Normalisation and Distance Metrics

DS1 embeddings are L2 normalised - each vector has unit length (magnitude = 1). This simplifies similarity calculations:

python
# For unit-length vectors:
# cosine_similarity(a, b) = (a · b) / (||a|| × ||b||) 
#                         = (a · b) / (1 × 1)
#                         = a · b

# Therefore: dot product equals cosine similarity

Why this matters:

  • Faster computation: Skip magnitude calculations, just compute dot products
  • Identical rankings: Same results as cosine similarity, better performance
  • Database optimisation: Vector DBs can use specialized indices for inner product search

Cosine Similarity

Cosine similarity measures the angle between two vectors, with values ranging from -1 to 1. The formula is:

cosine_similarity(q, d) = (q · d) / (||q|| × ||d||)

Where:

  • q · d represents the dot product
  • ||q|| and ||d|| are the vector magnitudes

Higher values (closer to 1) indicate greater semantic similarity. Vectors pointing in the same direction have high cosine similarity, even if they have different magnitudes.

Nearest Neighbor Search

To find the most relevant document, we calculate similarity scores between the query vector and all document vectors, then select the document with the highest score:

best_match = document with max(similarity(query, document_i)) for all i

For k-nearest neighbors, we rank all documents by their similarity scores and select the top k results.

Vector Databases

When your document collection grows beyond a few thousand items, in-memory search becomes impractical. Vector databases solve this by using approximate nearest neighbor (ANN) algorithms that trade a small amount of accuracy for massive speed improvements - finding similar vectors in milliseconds even across millions of documents.

Popular vector database options:

  • Pinecone: Fully managed, cloud-native vector database
  • Weaviate: Open-source with GraphQL API and hybrid search
  • FAISS: Meta's high-performance library, runs locally or in your infrastructure
  • Qdrant: Rust-based engine with filtering and payload support
  • MongoDB Atlas: Vector search added to your existing MongoDB deployment
  • PostgreSQL + pgvector: Extension bringing vector search to Postgres

Configuring Vector Databases for DS1

Critical: Since DS1 embeddings are L2 normalised, configure your vector database to use the dot product or inner product distance metric instead of cosine similarity. They produce identical rankings but dot product is more efficient.

Example configurations:

Pinecone:

python
import pinecone

# Use 'dotproduct' metric for L2 normalized embeddings
pinecone.create_index(
    name="ds1-index",
    dimension=512,
    metric="dotproduct"  # Use dot product for L2 normalised vectors
)

Weaviate:

python
{
    "class": "Document",
    "vectorizer": "none",  # We're providing embeddings
    "vectorIndexConfig": {
        "distance": "dot"  # Use dot product
    }
}

FAISS:

python
import faiss

# Use inner product (dot product) index
dimension = 512
index = faiss.IndexFlatIP(dimension)  # IP = Inner Product
# Add normalised vectors directly
index.add(documents_embeddings)

Qdrant:

python
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

client = QdrantClient("localhost", port=6333)

client.create_collection(
    collection_name="ds1_documents",
    vectors_config=VectorParams(
        size=512,
        distance=Distance.DOT  # Use dot product
    )
)

pgvector (PostgreSQL):

sql
-- Create table with vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(512)
);

-- Create index using inner product (dot product)
CREATE INDEX ON documents USING ivfflat (embedding vector_ip_ops);

-- Query using inner product operator
SELECT content, (embedding <#> query_embedding) as similarity
FROM documents
ORDER BY embedding <#> query_embedding
LIMIT 10;

Why this matters:

  • Using cosine similarity when embeddings are already L2 normalised wastes computation
  • Dot product gives identical rankings but is 15-30% faster
  • Some databases (like FAISS) are specifically optimised for inner product search
  • Wrong metric configuration won't break functionality but reduces efficiency

A Minimalist RAG Chatbot

Now we'll build a complete RAG system that combines DS1's retrieval capabilities with a large language model's generation abilities. This hybrid approach produces responses that are both contextually accurate (thanks to retrieved documents) and naturally written (thanks to the LLM).

The process: retrieve relevant context from your knowledge base, then ask the LLM to answer the question using that specific information. This grounds the AI's responses in your data rather than relying on potentially outdated or generic training knowledge.

Retrieve Relevant Context

python
# Use the k-NN search to find top 3 relevant documents
query = "What makes quantum computers different from regular computers?"
query_embedding = get_ds1_embeddings([query], input_type="query")[0]
top_k_embeddings, top_k_indices = k_nearest_neighbors(
    query_embedding, 
    documents_embeddings, 
    k=3
)

# Get the most relevant document
retrieved_doc = documents[top_k_indices[0]]
print(f"Retrieved document: {retrieved_doc}")

Generate Response with Claude

With our retrieved context in hand, we'll construct a prompt that gives Claude both the user's question and the relevant background information:

python
import anthropic

# Initialize Anthropic client
client = anthropic.Anthropic(api_key="YOUR_ANTHROPIC_API_KEY")

# Create a prompt with the retrieved context
prompt = f"Based on the information: '{retrieved_doc}', generate a response to: {query}"

# Generate response
message = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": prompt}
    ]
)

print("RAG Response:")
print(message.content[0].text)

Output with DS1 retrieval:

Quantum computers differ from regular computers in their fundamental computing units. While classical 
computers use bits that can be either 0 or 1, quantum computers use quantum bits (qubits) that can 
exist in multiple states simultaneously through a property called superposition. This allows quantum 
computers to potentially solve complex problems exponentially faster than classical computers by 
processing multiple possibilities at once.

Output without using DS1 retrieved documents:

Quantum computers differ from classical computers in several ways, primarily in how they process 
information. While I can provide some general information, for the most accurate and specific details 
about quantum computing technology and its advantages, I'd recommend consulting recent technical 
resources or academic papers on the subject.

Notice how the RAG approach produces a more specific, confident answer using your knowledge base, while the non-RAG response is more general and cautious.

Using OpenAI GPT-4

DS1 works equally well with other language models. Here's the same example using GPT-4o:

python
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI(api_key="YOUR_OPENAI_API_KEY")

# Create prompt with retrieved context
prompt = f"Based on the information: '{retrieved_doc}', generate a response to: {query}"

# Generate response
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
)

print("RAG Response:")
print(response.choices[0].message.content)

Complete Example

Here's everything we've covered assembled into a single, runnable script. This demonstrates the full RAG pipeline from document embedding through answer generation:

python
import boto3
import json
import numpy as np
import anthropic

# Configuration
SAGEMAKER_ENDPOINT = 'your-ds1-endpoint-name'
AWS_REGION = 'eu-west-2'
ANTHROPIC_API_KEY = 'your-anthropic-api-key'

# Initialize clients
profile_name= "your-profile-name"
boto_session = boto3.Session(profile_name=profile_name)
sagemaker_runtime = boto_session.client("sagemaker-runtime", region_name="eu-west-2")
anthropic_client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)

# Sample documents
documents = [
    "The Amazon rainforest produces about 20% of the world's oxygen and is home to over 10% of all species on Earth, playing a crucial role in regulating global climate.",
    "Quantum computers use quantum bits or qubits that can exist in multiple states simultaneously, potentially solving complex problems exponentially faster than classical computers.",
    "The Mediterranean diet emphasizes fish, olive oil, vegetables, and whole grains, and has been linked to reduced risk of heart disease and improved cognitive function.",
    "The Apollo 11 mission successfully landed humans on the Moon on July 20, 1969, with Neil Armstrong becoming the first person to walk on the lunar surface.",
    "Machine learning models learn patterns from data through training algorithms, enabling applications like image recognition, natural language processing, and predictive analytics.",
    "The Great Barrier Reef, spanning over 2,300 kilometers off Australia's coast, is the world's largest coral reef system and faces threats from climate change and ocean acidification."
]

def get_ds1_embeddings(texts):
    """Get embeddings from DS1 SageMaker endpoint"""
    payload = {"inputs": texts}
    response = sagemaker_runtime.invoke_endpoint(
        EndpointName=SAGEMAKER_ENDPOINT,
        ContentType='application/json',
        Body=json.dumps(payload)
    )
    result = json.loads(response['Body'].read().decode())
    embeddings = result
    return embeddings

def k_nearest_neighbors(query_embedding, documents_embeddings, k=3):
    """Find k most similar documents using dot product"""
    query_embedding = np.array(query_embedding)
    documents_embeddings = np.array(documents_embeddings)
    similarities = np.dot(documents_embeddings, query_embedding)
    top_k_indices = np.argsort(similarities)[::-1][:k]
    return top_k_indices

def rag_chatbot(query, documents, documents_embeddings):
    """Complete RAG pipeline"""
    # 1. Embed the query
    query_embedding = get_ds1_embeddings([query])[0]
    
    # 2. Find most relevant documents
    top_k_indices = k_nearest_neighbors(query_embedding, documents_embeddings, k=3)
    retrieved_doc = documents[top_k_indices[0]]
    
    # 3. Create prompt with context
    prompt = f"Based on the information: '{retrieved_doc}', generate a response to: {query}"
    
    # 4. Generate response with LLM
    message = anthropic_client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return message.content[0].text, retrieved_doc

# Main execution
if __name__ == "__main__":
    # Step 1: Embed all documents
    print("Embedding documents...")
    documents_embeddings = get_ds1_embeddings(documents)
    print(f"Created {len(documents_embeddings)} embeddings\n")
    
    # Step 2: Query the chatbot
    query = "What makes quantum computers different from regular computers?"
    print(f"Query: {query}\n")
    
    # Step 3: Get RAG response
    response, context = rag_chatbot(query, documents, documents_embeddings)
    
    print(f"Retrieved Context: {context}\n")
    print(f"RAG Response: {response}")

Error Handling and Best Practices

Production systems need robust error handling. SageMaker endpoints can experience transient failures, rate limits, or timeouts. Here's how to handle these gracefully:

Handle SageMaker Endpoint Errors

python
def get_ds1_embeddings_safe(texts, max_retries=3):
    """
    Get embeddings with error handling and retries
    """
    import time
    
    for attempt in range(max_retries):
        try:
            payload = {"inputs": texts}
            response = sagemaker_runtime.invoke_endpoint(
                EndpointName=ENDPOINT_NAME,
                ContentType='application/json',
                Body=json.dumps(payload)
            )
            result = json.loads(response['Body'].read().decode())
            embeddings = result
            return embeddings
            
        except Exception as e:
            if attempt < max_retries - 1:
                wait_time = 2 ** attempt  # Exponential backoff
                print(f"Error: {e}. Retrying in {wait_time} seconds...")
                time.sleep(wait_time)
            else:
                print(f"Failed after {max_retries} attempts: {e}")
                raise

Best Practices

  1. Optimize Batch Size: Test different batch sizes (16-32 documents) to find the sweet spot between throughput and latency for your endpoint
  2. Cache Document Embeddings: Compute document embeddings once and store them. Only generate new embeddings when documents change
  3. Implement Robust Error Handling: Network issues happen. Use retry logic with exponential backoff to handle transient failures gracefully
  4. Monitor Endpoint Performance: Track SageMaker metrics like invocation count, latency, and error rates to identify bottlenecks
  5. Right-size Your Instance: Use SageMaker Inference Recommender to select the most cost-effective instance type for your workload

Next Steps

Ready to take your RAG system further? Here are some directions to explore:

  • Production-Scale Retrieval: Move beyond in-memory search by integrating vector databases like Pinecone, Weaviate, or FAISS for sub-millisecond searches across millions of documents
  • Hybrid Search Strategies: Combine semantic search (using DS1) with traditional keyword search (BM25) to capture both conceptual and exact matches
  • Multimodal Applications: Extend your RAG system to handle images, PDFs, and other content types alongside text
  • Quality Metrics: Implement evaluation frameworks to measure retrieval precision, answer accuracy, and end-to-end system performance

Additional Resources

For deeper dives into the technologies used in this guide:


Questions about DS1? Contact our technical support team 🤓