Generative AI & API Integration

OpenAI, Azure, Bedrock, Google, RAG & Caching

Python Tutorial Series

Introduction to Generative AI

What is Generative AI?

  • AI models that create human-like content
    • Text, code, images, audio
    • Trained on massive datasets
    • Use prompt-based interfaces
  • Key providers in 2024:
    • OpenAI (GPT-4, ChatGPT)
    • Azure OpenAI (Enterprise features)
    • AWS Bedrock (Multiple models)
    • Google AI (Gemini, Vertex AI)

Why Python for AI Integration?

# Simple, readable API interactions
from openai import OpenAI

client = OpenAI(api_key="your-key")
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello AI!"}]
)
print(response.choices[0].message.content)

Python advantages: - Rich ecosystem of AI libraries - Easy API integration - Excellent data handling capabilities - Strong community support

OpenAI API Integration

Basic Setup & Authentication

import openai
from openai import OpenAI
import os

# Secure API key management
client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY")
)

def generate_text(prompt: str, model: str = "gpt-3.5-turbo") -> str:
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=1000,
        temperature=0.7  # Controls creativity (0.0-1.0)
    )
    return response.choices[0].message.content

Key parameters: - model: Choose GPT-3.5-turbo (fast) or GPT-4 (better) - temperature: 0.0 = consistent, 1.0 = creative - max_tokens: Limit response length

Streaming for Real-time Responses

def stream_response(prompt: str):
    """Stream AI responses word by word"""
    stream = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )
    
    for chunk in stream:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="")

When to use streaming: - Long responses (stories, explanations) - Interactive chat applications - Better user experience perception

Advanced Features

# System messages for behavior control
messages = [
    {"role": "system", "content": "You are a helpful Python tutor"},
    {"role": "user", "content": "Explain loops"}
]

# Function calling (tools)
functions = [{
    "name": "get_weather",
    "description": "Get weather for a location",
    "parameters": {
        "type": "object",
        "properties": {
            "location": {"type": "string"}
        }
    }
}]

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=messages,
    functions=functions
)

Azure OpenAI Service

Enterprise-Grade AI

Why Azure OpenAI? - Data residency & compliance - Private endpoints - Enterprise security - Same models, better governance

from openai import AzureOpenAI

azure_client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    api_version="2024-02-01",
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")
)

Key Differences from OpenAI

# OpenAI: Use model names
response = openai_client.chat.completions.create(
    model="gpt-3.5-turbo",  # Model name
    messages=[...]
)

# Azure: Use deployment names
response = azure_client.chat.completions.create(
    model="my-gpt-35-deployment",  # Your deployment name
    messages=[...]
)

Azure setup steps: 1. Create Azure OpenAI resource 2. Deploy models to endpoints 3. Use deployment names, not model names

AWS Bedrock Integration

Multi-Model Platform

Access multiple AI providers: - Anthropic Claude (reasoning) - AI21 Jurassic (creativity) - Cohere Command (enterprise) - Meta Llama (open source)

import boto3
import json

bedrock = boto3.client(
    'bedrock-runtime',
    region_name='us-east-1'
)

Model-Specific Formats

def bedrock_generate(prompt: str, model_id: str):
    # Claude format
    if "anthropic.claude" in model_id:
        body = json.dumps({
            "prompt": f"\n\nHuman: {prompt}\n\nAssistant:",
            "max_tokens_to_sample": 1000
        })
    
    # AI21 format
    elif "ai21.j2" in model_id:
        body = json.dumps({
            "prompt": prompt,
            "maxTokens": 1000
        })
    
    response = bedrock.invoke_model(
        modelId=model_id,
        body=body
    )
    return json.loads(response['body'].read())

Each model has different: - Input/output formats - Strengths and use cases - Pricing models

Google AI Platform

Two Options: AI Studio vs Vertex AI

Google AI Studio (Gemini API): - Quick experimentation - Simple API access - Free tier available

Vertex AI: - Production workloads - Enterprise features - Advanced ML ops

Gemini API Integration

import google.generativeai as genai

genai.configure(api_key=os.getenv("GOOGLE_AI_API_KEY"))

def google_generate(prompt: str):
    model = genai.GenerativeModel("gemini-pro")
    response = model.generate_content(prompt)
    return response.text

# Multimodal capabilities
model = genai.GenerativeModel("gemini-pro-vision")
response = model.generate_content([
    "What's in this image?",
    image_data
])

Gemini advantages: - Large context window (1M+ tokens) - Multimodal (text + images) - Fast inference speeds

RAG: Retrieval Augmented Generation

The Problem with Basic AI

# Basic AI - Limited knowledge
response = client.chat.completions.create(
    messages=[{"role": "user", "content": "What's our company policy?"}]
)
# AI doesn't know your specific company policies!

Limitations: - Training data cutoff dates - No access to private/recent data - Generic responses only

RAG Solution Architecture

graph TD
    A[User Query] --> B[Document Retrieval]
    B --> C[Relevant Docs Found]
    C --> D[Enhanced Prompt]
    D --> E[AI Generation]
    E --> F[Contextual Response]

RAG = Retrieval + Augmentation + Generation

  1. Retrieve relevant documents
  2. Augment prompt with context
  3. Generate informed response

Simple RAG Implementation

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

class SimpleRAG:
    def __init__(self):
        self.documents = []
        self.vectorizer = TfidfVectorizer()
        
    def add_documents(self, docs):
        self.documents = docs
        self.vectors = self.vectorizer.fit_transform(docs)
    
    def retrieve(self, query, top_k=3):
        query_vec = self.vectorizer.transform([query])
        similarities = cosine_similarity(query_vec, self.vectors)
        top_indices = similarities.argsort()[0][-top_k:][::-1]
        return [self.documents[i] for i in top_indices]
    
    def generate_response(self, query):
        context = "\n".join(self.retrieve(query))
        prompt = f"Context: {context}\n\nQuestion: {query}\nAnswer:"
        return generate_text(prompt)

Advanced RAG with Vector Databases

import chromadb
from sentence_transformers import SentenceTransformer

class AdvancedRAG:
    def __init__(self):
        self.client = chromadb.Client()
        self.collection = self.client.create_collection("docs")
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
    
    def add_documents(self, documents):
        embeddings = self.encoder.encode(documents)
        self.collection.add(
            embeddings=embeddings.tolist(),
            documents=documents,
            ids=[f"doc_{i}" for i in range(len(documents))]
        )
    
    def semantic_search(self, query, n_results=3):
        query_embedding = self.encoder.encode([query])
        results = self.collection.query(
            query_embeddings=query_embedding.tolist(),
            n_results=n_results
        )
        return results['documents'][0]

Vector database benefits: - Better semantic understanding - Faster similarity search - Metadata filtering capabilities

Caching Strategies

Why Cache AI Responses?

Problems without caching: - Repeated API calls for same questions - High latency (network round-trips) - Expensive API costs - Poor user experience

Benefits of caching: - 90%+ faster repeated responses - Significant cost savings - Better reliability

In-Memory Cache

import time
from typing import Dict, Optional

class ResponseCache:
    def __init__(self, ttl_seconds=3600):
        self.cache: Dict[str, dict] = {}
        self.ttl = ttl_seconds
    
    def get(self, key: str) -> Optional[str]:
        if key in self.cache:
            entry = self.cache[key]
            if time.time() - entry['timestamp'] < self.ttl:
                return entry['response']
            del self.cache[key]  # Remove expired
        return None
    
    def set(self, key: str, response: str):
        self.cache[key] = {
            'response': response,
            'timestamp': time.time()
        }

cache = ResponseCache(ttl_seconds=1800)  # 30 minutes

Persistent Cache with SQLite

import sqlite3
import hashlib
from datetime import datetime

class PersistentCache:
    def __init__(self, db_path="ai_cache.db"):
        self.db_path = db_path
        self.init_db()
    
    def init_db(self):
        conn = sqlite3.connect(self.db_path)
        cursor = conn.cursor()
        cursor.execute("""
            CREATE TABLE IF NOT EXISTS responses (
                prompt_hash TEXT PRIMARY KEY,
                response TEXT,
                created_at TIMESTAMP,
                model TEXT
            )
        """)
        conn.commit()
        conn.close()
    
    def get(self, prompt: str, model: str):
        prompt_hash = hashlib.sha256(f"{model}:{prompt}".encode()).hexdigest()
        conn = sqlite3.connect(self.db_path)
        cursor = conn.cursor()
        cursor.execute(
            "SELECT response FROM responses WHERE prompt_hash = ?",
            (prompt_hash,)
        )
        result = cursor.fetchone()
        conn.close()
        return result[0] if result else None

Production Best Practices

Error Handling & Retries

import time
import random

def robust_ai_call(prompt, max_retries=3):
    for attempt in range(max_retries):
        try:
            return generate_text(prompt)
        except Exception as e:
            if attempt == max_retries - 1:
                raise e
            
            # Exponential backoff with jitter
            delay = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)

Error handling strategies: - Exponential backoff for rate limits - Circuit breakers for service failures - Graceful degradation with fallbacks

Rate Limiting

from collections import deque
import time

class RateLimiter:
    def __init__(self, max_calls=60, window_seconds=60):
        self.max_calls = max_calls
        self.window = window_seconds
        self.calls = deque()
    
    def acquire(self):
        now = time.time()
        # Remove old calls
        while self.calls and self.calls[0] <= now - self.window:
            self.calls.popleft()
        
        if len(self.calls) < self.max_calls:
            self.calls.append(now)
            return True
        return False
    
    def wait_if_needed(self):
        while not self.acquire():
            time.sleep(0.1)

Multi-Provider Fallback

class MultiProviderAI:
    def __init__(self):
        self.providers = [
            ("openai", self.openai_generate),
            ("azure", self.azure_generate),
            ("bedrock", self.bedrock_generate),
        ]
    
    def generate_with_fallback(self, prompt):
        for name, provider in self.providers:
            try:
                result = provider(prompt)
                if result:
                    return result
            except Exception as e:
                print(f"{name} failed: {e}")
                continue
        
        raise Exception("All providers failed")

Real-World Applications

Document Q&A System

class DocumentQASystem:
    def __init__(self):
        self.rag = AdvancedRAG()
        self.cache = PersistentCache()
        self.rate_limiter = RateLimiter()
    
    def load_documents(self, file_path):
        with open(file_path, 'r') as f:
            content = f.read()
        
        # Split into chunks
        chunks = content.split('\n\n')
        self.rag.add_documents(chunks)
    
    def ask_question(self, question):
        # Check cache first
        cached = self.cache.get(question, "gpt-3.5-turbo")
        if cached:
            return cached
        
        # Rate limiting
        self.rate_limiter.wait_if_needed()
        
        # Generate with RAG
        response = self.rag.generate_response(question)
        
        # Cache result
        self.cache.set(question, response, "gpt-3.5-turbo")
        
        return response

Code Assistant

class CodeAssistant:
    def __init__(self):
        self.system_prompt = """
        You are an expert Python programmer.
        Provide clear, working code examples.
        Explain complex concepts simply.
        """
    
    def get_code_help(self, question):
        messages = [
            {"role": "system", "content": self.system_prompt},
            {"role": "user", "content": question}
        ]
        
        response = client.chat.completions.create(
            model="gpt-4",
            messages=messages,
            temperature=0.3  # Lower for more consistent code
        )
        
        return response.choices[0].message.content

Summary & Next Steps

Key Takeaways

Multiple AI Providers: - OpenAI: Best general performance - Azure: Enterprise features - Bedrock: Multi-model access - Google: Multimodal capabilities

Essential Techniques: - RAG for domain knowledge - Caching for performance - Error handling for reliability - Rate limiting for stability

Production Checklist

Security: API keys in environment variables ✅ Caching: Implement persistent cache ✅ Monitoring: Log usage and errors ✅ Rate Limiting: Respect API limits ✅ Fallbacks: Multiple provider support ✅ Testing: Comprehensive error scenarios

Next Steps

  1. Experiment with different models and providers
  2. Implement RAG for your specific domain
  3. Set up monitoring and alerting
  4. Optimize costs with intelligent caching
  5. Consider fine-tuning for specialized tasks

Remember: Start simple, then add complexity as needed!

Questions & Discussion

Common Questions

Q: Which AI provider should I choose? A: Depends on your needs: - OpenAI: Best for general use - Azure: Enterprise requirements - Bedrock: Want multiple models - Google: Multimodal applications

Q: How much does caching help? A: Typically 80-95% cost reduction for repeated queries

Q: Is RAG always necessary? A: Use RAG when you need: - Current information - Private/proprietary data - Domain-specific knowledge

Thank You!

Resources: - OpenAI API Documentation - Azure OpenAI Service Guide - AWS Bedrock Developer Guide - Google AI Studio Documentation

Happy AI coding! 🤖✨