DS1 Quickstart Tutorial
This guide walks you through building an intelligent chatbot using DS1 embeddings and Retrieval-Augmented Generation (RAG). You'll learn how to prepare your data, generate embeddings with DS1, implement semantic search, and combine everything with a language model to create a contextually-aware chatbot.
Brief Overview of the RAG Stack
RAG systems work by combining document retrieval with generative AI to produce informed, accurate responses. Here's the workflow:
- When a user asks a question, the system first converts it into a vector using DS1 embeddings.
- This query vector is then compared against pre-computed document vectors stored in your database.
- The system identifies the most semantically relevant documents by measuring vector similarity.
- These retrieved documents provide context that gets combined with the original question and sent to a large language model (like Claude or GPT-4), which generates a response grounded in your specific data.
Figure 1: RAG Workflow
Question → DS1 Query Embedding → Vector Similarity Search → Top Documents → Context + Question → LLM → AnswerThis approach ensures your chatbot responds with information from your actual knowledge base rather than relying solely on the LLM's training data.
Prerequisites
To use DS1, you'll need:
- AWS credentials configured with access to your SageMaker endpoint (for further information please refer to the SageMaker Notebook guide)
- Python 3.8 or higher
- boto3 library installed
Install required packages:
pip install boto3 numpyPrepare Data
Every RAG system starts with a knowledge base - a collection of documents your chatbot will reference. For this tutorial, we'll use six example documents covering diverse topics. In production, this would be your company's documentation, product information, or any domain-specific content you want your chatbot to be knowledgeable about.
You can use the following set of documents as a starting point:
documents = [
"The Amazon rainforest produces about 20% of the world's oxygen and is home to over 10% of all species on Earth, playing a crucial role in regulating global climate.",
"Quantum computers use quantum bits or qubits that can exist in multiple states simultaneously, potentially solving complex problems exponentially faster than classical computers.",
"The Mediterranean diet emphasizes fish, olive oil, vegetables, and whole grains, and has been linked to reduced risk of heart disease and improved cognitive function.",
"The Apollo 11 mission successfully landed humans on the Moon on July 20, 1969, with Neil Armstrong becoming the first person to walk on the lunar surface.",
"Machine learning models learn patterns from data through training algorithms, enabling applications like image recognition, natural language processing, and predictive analytics.",
"The Great Barrier Reef, spanning over 2,300 kilometers off Australia's coast, is the world's largest coral reef system and faces threats from climate change and ocean acidification."
]Vectorise/Embed the Documents
Now we'll convert our text documents into numerical vectors using DS1. Since DS1 is hosted on SageMaker, we'll interact with it through AWS's boto3 SDK rather than a dedicated client library.
Setup SageMaker Client
import boto3
import json
import numpy as np
profile_name = "your-profile-name"
# Initialise SageMaker runtime client
boto_session = boto3.Session(profile_name=profile_name)
sagemaker_runtime = boto_session.client('sagemaker-runtime', region_name='eu-west-2') # Replace with your region
# Your DS1 endpoint name
ENDPOINT_NAME = 'your-ds1-endpoint-name' # Replace with your actual endpoint nameEmbed Documents
def get_ds1_embeddings(texts):
"""
Get embeddings from DS1 SageMaker endpoint
Args:
texts: List of strings to embed
Returns:
List of embeddings
"""
# Prepare the payload
payload = {
"inputs": texts
}
# Invoke the endpoint
response = sagemaker_runtime.invoke_endpoint(
EndpointName=ENDPOINT_NAME,
ContentType='application/json',
Body=json.dumps(payload)
)
# Parse the response
result = json.loads(response['Body'].read().decode())
if isinstance(result, list):
embeddings = result # Embeddings returned directly
elif isinstance(result, dict) and "embeddings" in result:
embeddings = result["embeddings"]
else:
raise ValueError(f"Unexpected response format: {type(result)}")
return embeddings
# Embed all documents
documents_embeddings = get_ds1_embeddings(documents)
print(f"Created {len(documents_embeddings)} document embeddings")
print(f"Embedding dimension: {len(documents_embeddings[0])}")Batch Processing for Large Document Sets
When working with thousands of documents, process them in batches to optimise performance and avoid timeouts. The batch size depends on your endpoint configuration, but 16-32 documents per batch typically works well:
def embed_documents_in_batches(documents, batch_size=32):
"""
Embed documents in batches to handle large datasets
"""
all_embeddings = []
for i in range(0, len(documents), batch_size):
batch = documents[i:i + batch_size]
batch_embeddings = get_ds1_embeddings(batch)
all_embeddings.extend(batch_embeddings)
print(f"Processed batch {i//batch_size + 1}/{(len(documents)-1)//batch_size + 1}")
return all_embeddings
# For large document sets
# documents_embeddings = embed_documents_in_batches(documents, batch_size=32)A Minimalist Retrieval System
Embeddings transform text into numerical vectors that preserve semantic relationships. When two pieces of text have similar meanings, their embedding vectors will be close together in vector space. This mathematical property enables semantic search - finding relevant documents based on meaning rather than just keyword matching.
We measure the "closeness" between vectors using dot product (since DS1 vectors are L2 normalised). Higher scores indicate greater semantic similarity.
Single Query Search
Let's see how to find the most relevant document for a user's question:
query = "What makes quantum computers different from regular computers?"
# Get the embedding of the query
query_embedding = get_ds1_embeddings([query])[0]Nearest Neighbor Search
Find the closest embedding among the documents based on similarity:
# Convert to numpy arrays
doc_embed = np.array(documents_embeddings)
query_embed = np.array(query_embedding)
# Compute the similarity using dot product
# Since DS1 embeddings are L2 normalised (unit length), dot product
# is mathematically equivalent to cosine similarity but faster to compute
similarities = np.dot(doc_embed, query_embed)
retrieved_id = np.argmax(similarities)
print("Most relevant document:")
print(documents[retrieved_id])
print(f"Similarity score: {similarities[retrieved_id]:.4f}")k-Nearest Neighbors Search (k-NN)
Often you'll want to retrieve multiple relevant documents, not just the single best match. The k-nearest neighbors algorithm finds the top k documents with the highest similarity scores:
def k_nearest_neighbors(query_embedding, documents_embeddings, k=5):
"""
Find k most similar documents to the query using dot product
Since DS1 embeddings are L2 normalized, dot product is equivalent
to cosine similarity but computationally more efficient.
Args:
query_embedding: Query vector
documents_embeddings: List of document vectors
k: Number of top documents to retrieve
Returns:
top_k_embeddings: Top k document embeddings
top_k_indices: Indices of top k documents
"""
# Convert to numpy arrays
query_embedding = np.array(query_embedding)
documents_embeddings = np.array(documents_embeddings)
# Calculate dot product for each document (equivalent to cosine similarity for L2 normalized vectors)
similarities = np.dot(documents_embeddings, query_embedding)
# Sort by similarity in descending order
sorted_indices = np.argsort(similarities)[::-1]
# Take top k
top_k_indices = sorted_indices[:k]
top_k_embeddings = documents_embeddings[top_k_indices]
return top_k_embeddings, top_k_indices
# Retrieve top 3 most relevant documents
top_k_embeddings, top_k_indices = k_nearest_neighbors(
query_embedding,
documents_embeddings,
k=3
)
print("Top 3 most relevant documents:")
for i, idx in enumerate(top_k_indices):
similarity = np.dot(query_embedding, documents_embeddings[idx])
print(f"\n{i+1}. {documents[idx]}")
print(f" Similarity: {similarity:.4f}")Understanding Similarity Metrics and Vector Search
L2 Normalisation and Distance Metrics
DS1 embeddings are L2 normalised - each vector has unit length (magnitude = 1). This simplifies similarity calculations:
# For unit-length vectors:
# cosine_similarity(a, b) = (a · b) / (||a|| × ||b||)
# = (a · b) / (1 × 1)
# = a · b
# Therefore: dot product equals cosine similarityWhy this matters:
- Faster computation: Skip magnitude calculations, just compute dot products
- Identical rankings: Same results as cosine similarity, better performance
- Database optimisation: Vector DBs can use specialized indices for inner product search
Cosine Similarity
Cosine similarity measures the angle between two vectors, with values ranging from -1 to 1. The formula is:
cosine_similarity(q, d) = (q · d) / (||q|| × ||d||)Where:
q · drepresents the dot product||q||and||d||are the vector magnitudes
Higher values (closer to 1) indicate greater semantic similarity. Vectors pointing in the same direction have high cosine similarity, even if they have different magnitudes.
Nearest Neighbor Search
To find the most relevant document, we calculate similarity scores between the query vector and all document vectors, then select the document with the highest score:
best_match = document with max(similarity(query, document_i)) for all iFor k-nearest neighbors, we rank all documents by their similarity scores and select the top k results.
Vector Databases
When your document collection grows beyond a few thousand items, in-memory search becomes impractical. Vector databases solve this by using approximate nearest neighbor (ANN) algorithms that trade a small amount of accuracy for massive speed improvements - finding similar vectors in milliseconds even across millions of documents.
Popular vector database options:
- Pinecone: Fully managed, cloud-native vector database
- Weaviate: Open-source with GraphQL API and hybrid search
- FAISS: Meta's high-performance library, runs locally or in your infrastructure
- Qdrant: Rust-based engine with filtering and payload support
- MongoDB Atlas: Vector search added to your existing MongoDB deployment
- PostgreSQL + pgvector: Extension bringing vector search to Postgres
Configuring Vector Databases for DS1
Critical: Since DS1 embeddings are L2 normalised, configure your vector database to use the dot product or inner product distance metric instead of cosine similarity. They produce identical rankings but dot product is more efficient.
Example configurations:
Pinecone:
import pinecone
# Use 'dotproduct' metric for L2 normalized embeddings
pinecone.create_index(
name="ds1-index",
dimension=512,
metric="dotproduct" # Use dot product for L2 normalised vectors
)Weaviate:
{
"class": "Document",
"vectorizer": "none", # We're providing embeddings
"vectorIndexConfig": {
"distance": "dot" # Use dot product
}
}FAISS:
import faiss
# Use inner product (dot product) index
dimension = 512
index = faiss.IndexFlatIP(dimension) # IP = Inner Product
# Add normalised vectors directly
index.add(documents_embeddings)Qdrant:
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
client = QdrantClient("localhost", port=6333)
client.create_collection(
collection_name="ds1_documents",
vectors_config=VectorParams(
size=512,
distance=Distance.DOT # Use dot product
)
)pgvector (PostgreSQL):
-- Create table with vector column
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
embedding vector(512)
);
-- Create index using inner product (dot product)
CREATE INDEX ON documents USING ivfflat (embedding vector_ip_ops);
-- Query using inner product operator
SELECT content, (embedding <#> query_embedding) as similarity
FROM documents
ORDER BY embedding <#> query_embedding
LIMIT 10;Why this matters:
- Using cosine similarity when embeddings are already L2 normalised wastes computation
- Dot product gives identical rankings but is 15-30% faster
- Some databases (like FAISS) are specifically optimised for inner product search
- Wrong metric configuration won't break functionality but reduces efficiency
A Minimalist RAG Chatbot
Now we'll build a complete RAG system that combines DS1's retrieval capabilities with a large language model's generation abilities. This hybrid approach produces responses that are both contextually accurate (thanks to retrieved documents) and naturally written (thanks to the LLM).
The process: retrieve relevant context from your knowledge base, then ask the LLM to answer the question using that specific information. This grounds the AI's responses in your data rather than relying on potentially outdated or generic training knowledge.
Retrieve Relevant Context
# Use the k-NN search to find top 3 relevant documents
query = "What makes quantum computers different from regular computers?"
query_embedding = get_ds1_embeddings([query], input_type="query")[0]
top_k_embeddings, top_k_indices = k_nearest_neighbors(
query_embedding,
documents_embeddings,
k=3
)
# Get the most relevant document
retrieved_doc = documents[top_k_indices[0]]
print(f"Retrieved document: {retrieved_doc}")Generate Response with Claude
With our retrieved context in hand, we'll construct a prompt that gives Claude both the user's question and the relevant background information:
import anthropic
# Initialize Anthropic client
client = anthropic.Anthropic(api_key="YOUR_ANTHROPIC_API_KEY")
# Create a prompt with the retrieved context
prompt = f"Based on the information: '{retrieved_doc}', generate a response to: {query}"
# Generate response
message = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=1024,
messages=[
{"role": "user", "content": prompt}
]
)
print("RAG Response:")
print(message.content[0].text)Output with DS1 retrieval:
Quantum computers differ from regular computers in their fundamental computing units. While classical
computers use bits that can be either 0 or 1, quantum computers use quantum bits (qubits) that can
exist in multiple states simultaneously through a property called superposition. This allows quantum
computers to potentially solve complex problems exponentially faster than classical computers by
processing multiple possibilities at once.Output without using DS1 retrieved documents:
Quantum computers differ from classical computers in several ways, primarily in how they process
information. While I can provide some general information, for the most accurate and specific details
about quantum computing technology and its advantages, I'd recommend consulting recent technical
resources or academic papers on the subject.Notice how the RAG approach produces a more specific, confident answer using your knowledge base, while the non-RAG response is more general and cautious.
Using OpenAI GPT-4
DS1 works equally well with other language models. Here's the same example using GPT-4o:
from openai import OpenAI
# Initialize OpenAI client
client = OpenAI(api_key="YOUR_OPENAI_API_KEY")
# Create prompt with retrieved context
prompt = f"Based on the information: '{retrieved_doc}', generate a response to: {query}"
# Generate response
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
)
print("RAG Response:")
print(response.choices[0].message.content)Complete Example
Here's everything we've covered assembled into a single, runnable script. This demonstrates the full RAG pipeline from document embedding through answer generation:
import boto3
import json
import numpy as np
import anthropic
# Configuration
SAGEMAKER_ENDPOINT = 'your-ds1-endpoint-name'
AWS_REGION = 'eu-west-2'
ANTHROPIC_API_KEY = 'your-anthropic-api-key'
# Initialize clients
profile_name= "your-profile-name"
boto_session = boto3.Session(profile_name=profile_name)
sagemaker_runtime = boto_session.client("sagemaker-runtime", region_name="eu-west-2")
anthropic_client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)
# Sample documents
documents = [
"The Amazon rainforest produces about 20% of the world's oxygen and is home to over 10% of all species on Earth, playing a crucial role in regulating global climate.",
"Quantum computers use quantum bits or qubits that can exist in multiple states simultaneously, potentially solving complex problems exponentially faster than classical computers.",
"The Mediterranean diet emphasizes fish, olive oil, vegetables, and whole grains, and has been linked to reduced risk of heart disease and improved cognitive function.",
"The Apollo 11 mission successfully landed humans on the Moon on July 20, 1969, with Neil Armstrong becoming the first person to walk on the lunar surface.",
"Machine learning models learn patterns from data through training algorithms, enabling applications like image recognition, natural language processing, and predictive analytics.",
"The Great Barrier Reef, spanning over 2,300 kilometers off Australia's coast, is the world's largest coral reef system and faces threats from climate change and ocean acidification."
]
def get_ds1_embeddings(texts):
"""Get embeddings from DS1 SageMaker endpoint"""
payload = {"inputs": texts}
response = sagemaker_runtime.invoke_endpoint(
EndpointName=SAGEMAKER_ENDPOINT,
ContentType='application/json',
Body=json.dumps(payload)
)
result = json.loads(response['Body'].read().decode())
embeddings = result
return embeddings
def k_nearest_neighbors(query_embedding, documents_embeddings, k=3):
"""Find k most similar documents using dot product"""
query_embedding = np.array(query_embedding)
documents_embeddings = np.array(documents_embeddings)
similarities = np.dot(documents_embeddings, query_embedding)
top_k_indices = np.argsort(similarities)[::-1][:k]
return top_k_indices
def rag_chatbot(query, documents, documents_embeddings):
"""Complete RAG pipeline"""
# 1. Embed the query
query_embedding = get_ds1_embeddings([query])[0]
# 2. Find most relevant documents
top_k_indices = k_nearest_neighbors(query_embedding, documents_embeddings, k=3)
retrieved_doc = documents[top_k_indices[0]]
# 3. Create prompt with context
prompt = f"Based on the information: '{retrieved_doc}', generate a response to: {query}"
# 4. Generate response with LLM
message = anthropic_client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
return message.content[0].text, retrieved_doc
# Main execution
if __name__ == "__main__":
# Step 1: Embed all documents
print("Embedding documents...")
documents_embeddings = get_ds1_embeddings(documents)
print(f"Created {len(documents_embeddings)} embeddings\n")
# Step 2: Query the chatbot
query = "What makes quantum computers different from regular computers?"
print(f"Query: {query}\n")
# Step 3: Get RAG response
response, context = rag_chatbot(query, documents, documents_embeddings)
print(f"Retrieved Context: {context}\n")
print(f"RAG Response: {response}")Error Handling and Best Practices
Production systems need robust error handling. SageMaker endpoints can experience transient failures, rate limits, or timeouts. Here's how to handle these gracefully:
Handle SageMaker Endpoint Errors
def get_ds1_embeddings_safe(texts, max_retries=3):
"""
Get embeddings with error handling and retries
"""
import time
for attempt in range(max_retries):
try:
payload = {"inputs": texts}
response = sagemaker_runtime.invoke_endpoint(
EndpointName=ENDPOINT_NAME,
ContentType='application/json',
Body=json.dumps(payload)
)
result = json.loads(response['Body'].read().decode())
embeddings = result
return embeddings
except Exception as e:
if attempt < max_retries - 1:
wait_time = 2 ** attempt # Exponential backoff
print(f"Error: {e}. Retrying in {wait_time} seconds...")
time.sleep(wait_time)
else:
print(f"Failed after {max_retries} attempts: {e}")
raiseBest Practices
- Optimize Batch Size: Test different batch sizes (16-32 documents) to find the sweet spot between throughput and latency for your endpoint
- Cache Document Embeddings: Compute document embeddings once and store them. Only generate new embeddings when documents change
- Implement Robust Error Handling: Network issues happen. Use retry logic with exponential backoff to handle transient failures gracefully
- Monitor Endpoint Performance: Track SageMaker metrics like invocation count, latency, and error rates to identify bottlenecks
- Right-size Your Instance: Use SageMaker Inference Recommender to select the most cost-effective instance type for your workload
Next Steps
Ready to take your RAG system further? Here are some directions to explore:
- Production-Scale Retrieval: Move beyond in-memory search by integrating vector databases like Pinecone, Weaviate, or FAISS for sub-millisecond searches across millions of documents
- Hybrid Search Strategies: Combine semantic search (using DS1) with traditional keyword search (BM25) to capture both conceptual and exact matches
- Multimodal Applications: Extend your RAG system to handle images, PDFs, and other content types alongside text
- Quality Metrics: Implement evaluation frameworks to measure retrieval precision, answer accuracy, and end-to-end system performance
Additional Resources
For deeper dives into the technologies used in this guide:
- AWS SageMaker: Complete documentation on deployment, monitoring, and optimization at https://docs.aws.amazon.com/sagemaker/
- Boto3 Reference: API documentation for SageMaker Runtime at https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime.html
- RAG Research: Academic papers and industry blog posts exploring retrieval-augmented generation patterns and best practices
Questions about DS1? Contact our technical support team 🤓