The "Hello World" of AI is a Parlor Trick. So I Built a Real One.
A 0-to-1 deep dive into 'whatsgood', my live, stateful AI news analyst.
View Live WebsiteThe Premise: The Parlor Trick Problem
Picture this: You've built your first RAG (Retrieval-Augmented Generation) app. It takes a user's query—say, "AI and healthcare"—and finds 10 "similar" articles from a vector database. You pat yourself on the back. It’s the "Hello World" of modern AI.
The problem? It’s just a parlor trick. And let's be honest, it's a boring one.
It's a "dumb" search engine disguised as "AI." It has no memory; it will happily recommend the exact same article you disliked 10 seconds ago. And it has no concept of time; it will proudly serve you a "breaking news" story from six months ago with a 0.99 similarity score.
To a user, this isn't just stale; it's wrong. And as the builder, it just felt lazy.
This realization was the starting point for whatsgood: a 0-to-1 build of a stateful, real-time, AI-powered news analyst. I wanted to build something with a memory, something that learns from every swipe, and something that reasons like an analyst, not a search index.
Me being me, I thought, "How hard could it be to build a real one?" (Spoiler: It was hard.)
So, I'm pulling back the curtain. This post is the story of that journey—a deep dive into the system architecture, the product decisions, and the "war stories" from the trenches of building a production-ready AI system from scratch.
The Core Architectural Choice: Monolith Trap vs. Microservices
Let's talk about the 0-to-1 trap. That first instinct to build one giant main.py file. It's fast, it's simple, and it's a trap I was determined to avoid.
What happens when your (hourly) data scraper crashes? It takes down your (24/7) recommendation API. What happens when you need to update a single line of text in the user-service? You have to re-deploy the entire 2GB AI model. It's inefficient, fragile, and unscalable.
So, we tore the monolith apart before it was even built and designed a distributed, microservice architecture from day one.
A high-level system architecture diagram. User (Netlify) -> AWS API Gateway. The Gateway routes to: user-service (Lambda), interaction-service (Lambda), and recommendation-service (ALB -> ECS on Fargate). A separate EventBridge trigger points to the ingestion-service (Lambda), which writes to Pinecone & Supabase. The ECS service reads from this.
This design immediately slammed me into a new, more interesting problem: the "Always On" vs. "On-Demand" Cost Dilemma.
- Our recommendation-service (the AI "brain") needs to be "always on." It has to keep models (Bi-Encoder, Cross-Encoder) loaded in memory, ready to serve a recommendation in < 1 second.
- Our ingestion-service (the data scraper) needs to be "off" 99% of the time. It runs for 5 minutes every 2 days.
- Our user-service (for login/persona) is light and stateless.
Running all of these on the same platform is either slow (if using Lambda for the AI, with its 10-second INIT timeout) or wildly expensive (if using an "always on" ECS container for the scraper).
The fix was simple: We didn't choose one platform; we chose the right platform for each job.
- The AI Core (Stateful, Heavy): Deployed to AWS ECS on Fargate. It's an "always on" container, models in memory, sitting behind an Application Load Balancer for high availability.
- The Other Services (Stateless, Light): The user-service, interaction-service, and ingestion-service were all deployed as AWS Lambda functions. They scale to zero, cost nothing when idle, and are perfect for event-driven tasks.
This is core MLOps: using abstraction to get the best performance for the lowest cost.
Building an AI With a Memory (And an Opinion)
This is the fun part. This is where we actually fight the "dumb search" problem. The recommendation-service on ECS isn't just a vector search. It's a five-stage pipeline that runs in real-time, on demand.
1. The System Had Amnesia
Our first system had the memory of a goldfish. A user could "like" an amazing article, refresh, and get the exact same recommendations. It was stateless, which is a nice word for 'dumb'.
The Fix: The "Living" Persona.
We killed the idea of a "static" persona. The user's persona is now a living vector, created on the fly. When a user requests recommendations, we call get_dynamic_query_vector.
This function fetches the user's base persona, fetches their 10 most recent likes/dislikes, pulls all the corresponding vectors, and then performs this vector arithmetic:
# From: services/recommendation-service/main.py
async def get_dynamic_query_vector(user_id: str) -> np.ndarray:
# ... (Code to fetch base_vector and interaction data) ...
# 4. Vector Arithmetic
dynamic_vector = base_vector
weight = 0.2 # How much each interaction "pulls" the vector
for item in interaction_response.data:
article_id = item['article_id']
interaction_type = item['interaction_type']
if article_id in fetched_vectors:
article_vector = np.array(fetched_vectors[article_id])
if interaction_type == 'like':
# Pull the vector TOWARDS the liked item
dynamic_vector = dynamic_vector + weight * (article_vector - dynamic_vector)
elif interaction_type == 'dislike':
# Push the vector AWAY from the disliked item
dynamic_vector = dynamic_vector - weight * (article_vector - dynamic_vector)
# Return the new, normalized, "living" vector
return dynamic_vector / np.linalg.norm(dynamic_vector)
The system now reacts in real-time. Liking an article about "MLOps" will immediately surface more articles on deployment and CI/CD.
2. Your AI is Recommending Last Week's "Breaking" News
A vector search will happily return a 6-month-old article with a 0.99 score. It's similar, but it's not relevant. And for a news app, 'not relevant' means 'useless'.
The Fix: Multi-Stage Re-Ranking.
Simple retrieval isn't enough. We retrieve 50+ candidates from Pinecone, then run them through a second, more computationally expensive re-ranking step. The final_score is a weighted, tunable blend of three different scores.
This is the exact code from the get_recommendations endpoint:
# From: services/recommendation-service/main.py
# ... (Inside Stage 4: Re-rank) ...
ranked_list = []
for i, article in enumerate(articles_response.data):
article_id = article['id']
# 1. Pinecone's "fast" vector score
p_score = pinecone_scores.get(article_id, 0.0)
# 2. The "slow & accurate" Cross-Encoder score
c_score = cross_scores[i]
c_score_normalized = 1 / (1 + np.exp(-c_score)) # Normalize to 0-1
# 3. The freshness score (0.0 to 1.0)
f_score = get_freshness_score(article['published_date'])
# The final, weighted formula
final_score = (
(0.60 * c_score_normalized) + # 60% weight to accuracy
(0.30 * p_score) + # 30% weight to vector similarity
(0.10 * f_score) # 10% weight to freshness
)
article['final_score'] = final_score
ranked_list.append(article)
# Sort by the new score
ranked_list.sort(key=lambda x: x['final_score'], reverse=True)
final_top_5_articles = ranked_list[:5]
This lets us abstract "relevance" into a simple, tunable equation. Suddenly, we're not just searching; we're curating.
3. From "Dumb List" to "Smart Conversation"
Even with all this, at the end of the day, we were still just showing a list of 5 articles. We're right back to the parlor trick!
The Fix: The "Crisp Curator" Prompt.
This was the final piece. We retrieve the Top 5 re-ranked articles, but then we feed the full text of all five to Groq (Llama 3.1 8B).
We don't ask it to "summarize." We ask it to act as an analyst. This is the exact prompt:
# From: services/recommendation-service/main.py
system_prompt = "You are an expert news editor and analyst. Your job is to read a full article and write a single, sharp, insightful summary (the 'hook') that gets straight to the core of the story. You write directly to a smart reader, without fluff or robotic phrases."
human_prompt = f"""
My user's persona is: "{user_persona_text}"
Based *only* on the user's persona and the 5 articles provided below, do the following:
1. Read the **Full Text Snippet** for each article to understand its core argument, not just its topic.
2. Write a single, concise "summary" sentence (15-30 words). This sentence must be a powerful hook that. captures the **most important takeaway, key statistic, or surprising insight** from the text.
3. **DO NOT** just describe what the article is "about" (e.g., "This article is about...").
4. **DO NOT** mention the user or their persona in the summary. Your tone should be smart, neutral, and authoritative.
5. Write a "reason" sentence (1-2 sentences). This is your internal note explaining the *specific link* between the article's core takeaway and the user's persona.
6. Return your response as a single, valid JSON array.
7. The JSON array must contain 5 objects. Each object must have *only* these keys: "id", "title", "summary", "reason".
8. Use the exact "id" and "title" provided for each article in the context.
**Examples of desired crisp summary style:**
* *Article about Google/Wiz:* "Google is reportedly in talks to acquire cybersecurity startup Wiz for $23 billion, marking its largest acquisition ever."
* *Article about crypto volume:* "Global crypto trading volume is projected to hit $108 trillion in 2024, a 90% increase from 2022 levels, with Europe leading the transactions."
**Here is the context for the 5 articles you need to process:**{context_block}
"""
And that was the final piece. We abstracted the "dumb list" into a "smart conversation." The LLM doesn't just summarize; it reasons and explains why each article matters, generating the summary and reason fields for the UI.
The MLOps War Stories (aka The Hard Problems)
A clean architecture diagram is a beautiful lie. It hides the brutal, hair-pulling, "why-is-this-not-working-at-2-AM" moments. The real, senior-level engineering is in the "backbone"—the MLOps foundation that makes this whole thing run.
Here are the war stories.
War Story #1: The 10-Second "Init" Timeout Wall
The Challenge: Our ingestion-service Lambda kept failing. And failing. And failing. I'm digging into CloudWatch logs and I see the dreaded message:
INIT_REPORT Init Duration: 9999.49 ms Status: timeout
The Reason: AWS Lambda has a 10-second hard limit to "thaw" a container on a cold start. Our Docker image, with the sentence-transformers model baked in, was over 1GB. Trying to "unzip" and load a 1GB container in 10 seconds is physically impossible.
My Solution: The Lazy-Loading Handler. Instead of loading the model when the container boots, we load it inside the handler function on the first invocation. We moved the model out of the Docker image and attached it via EFS to the Lambda's /tmp dir (which is what model_path points to). This way, we trade a 10-second hard failure for a 30-second one-time-only cold start. A trade I will make any day of the week.
# From: services/ingestion-service/handler.py
import os
# ... other imports
from sentence_transformers import SentenceTransformer
# --- MODIFICATION ---
# Clients are initialized to None. The 'init' phase is now < 1ms.
supabase: Client = None
pinecone_index = None
bi_encoder = None
model_path = os.path.join(os.path.dirname(__file__), 'model')
# --- END MODIFICATION ---
def ingest(event, context):
"""
This is the main function that AWS Lambda will call.
"""
# --- MODIFICATION ---
# Use 'global' to access and initialize our clients ONCE.
global supabase, pinecone_index, bi_encoder
# This 'if' block only runs on a COLD start
if supabase is None:
print("Handler: Cold start. Initializing Supabase client...")
url: str = os.environ.get("SUPABASE_URL")
key: str = os.environ.get("SUPABASE_ANON_KEY")
supabase = create_client(url, key)
if pinecone_index is None:
print("Handler: Cold start. Initializing Pinecone connection...")
pc = Pinecone(api_key=os.environ.get("PINECONE_API_KEY"))
pinecone_index = pc.Index("whats-good-v2")
if bi_encoder is None:
print("Handler: Cold start. Loading bi-encoder model...")
# This is the slow step (~30 seconds)
bi_encoder = SentenceTransformer(model_path)
print("Handler: Bi-encoder model loaded.")
else:
print("Handler: Warm start. All clients and model already loaded.")
# --- END MODIFICATION ---
print("Starting ingestion cycle...")
# ... (rest of the scraping logic) ...
War Story #2: The "It Works On My (M3) Mac" Nightmare
The Challenge: This one was a killer. I mean, this one really hurt. I built all my arm64 Docker images on my Mac M3. The GitHub Actions runner (an x86_64 machine) built them just fine. The deploy pipeline was green. But the arm64 Lambda function would crash instantly. Just... poof. Gone.
The Reason: I was building an arm64 image on an arm64 machine (my Mac). The GitHub runner was emulating an arm64 build on an x86 machine. But it wasn't a true cross-compilation. The resulting image was subtly broken.
My Solution: We had to stop building for my machine and start cross-compiling for the target machine. We used QEMU (an emulator) and Docker Buildx inside the CI/CD pipeline to force the x86 runner to build a true arm64 image.
# From: .github/workflows/deploy-ingestion-service.yml
name: Deploy Ingestion Service to Lambda
on:
push:
branches: [ main ]
paths:
- 'services/ingestion-service/**' # Only deploy this service
env:
AWS_REGION: us-east-1
ECR_REPOSITORY: whatsgood/ingestion-service
LAMBDA_FUNCTION_NAME: whatsgood-ingestion-service
permissions:
id-token: write # Required for OIDC
contents: read
jobs:
deploy:
name: Build, Push, and Deploy
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
# --- THE FIX STARTS HERE ---
- name: Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
# --- END QEMU/BUILDX SETUP ---
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::[YOUR_ACCOUNT_ID]:role/github-actions-whatsgood-deploy-role
aws-region: ${{ env.AWS_REGION }}
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v2
- name: Build and push Docker image
uses: docker/build-push-action@v5
with:
context: ./services/ingestion-service
push: true
tags: ${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:${{ github.sha }}
# --- THIS IS THE MAGIC LINE ---
platforms: linux/arm64 # Force the build for ARM64
provenance: false
- name: Update Lambda function
run: |
aws lambda update-function-code \
--function-name ${{ env.LAMBDA_FUNCTION_NAME }} \
--image-uri ${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:${{ github.sha }}
War Story #3: The API's Identity Crisis
The Challenge: My first instinct was to use FastAPI for all my Python services, even in Lambda. I'd use the mangum adapter to translate AWS events into HTTP. But for the simple services, it was just... deaf. It was adding 100 lines of boilerplate, new dependencies, and a new failure point for literally no reason. Why run a whole web server just to parse a JSON and call a database?
My Solution: The Ultimate Abstraction is Removing the Abstraction. I threw out FastAPI and mangum for the simple services. I rewrote them as 100% pure, 30-line Python handlers. The interaction-service is the perfect example. It has zero dependencies outside the standard library and supabase-client.
This is the entire file. It's clean, fast, cheap, and impossible to break.
# From: services/interaction-service/main.py
import os
import json
from supabase import create_client, Client
# --- Clients are loaded ONCE during the cold start ---
url: str = os.environ.get("SUPABASE_URL")
key: str = os.environ.get("SUPABASE_ANON_KEY")
supabase: Client = create_client(url, key)
def handler(event, context):
"""
This is the new handler function that API Gateway will call.
"""
try:
# 1. Get the request data from the 'body'
body = json.loads(event.get('body', '{}'))
# 2. Your original Supabase logic
response = supabase.table('user_interactions').insert(body).execute()
if not response.data:
raise Exception("Failed to insert interaction.")
# 3. Return a special dictionary that API Gateway understands
return {
'statusCode': 200,
'headers': {
'Access-control-allow-origin': '*', # Required for CORS
'Access-control-allow-headers': 'Content-Type',
'Access-control-allow-methods': 'POST,OPTIONS'
},
'body': json.dumps({
"status": "ok",
"service": "interaction-service",
"logged_interaction": response.data[0]
})
}
except Exception as e:
return {
'statusCode': 500,
'headers': { 'Access-control-allow-origin': '*' }, # CORS
'body': json.dumps({"message": str(e)})
}
This, to me, is mature engineering. It's boring, and that's what makes it brilliant. It just works.
Final Reflection
So, yeah. The frontend "Tinder for News" UI? That was the easy part. Building whatsgood was an exercise in end-to-end systems design. The real work was building the... well, the system.
I solved the 10-second INIT timeout. I solved the M3 Mac cross-compilation nightmare. I solved the "API Identity Crisis" by choosing simplicity over dogma. And I built a fully keyless, secure CI/CD pipeline using OIDC—something many large companies still haven't adopted.
The final system isn't just a "project." It's a scalable, secure, and fully automated platform. It's an AI analyst that learns from every interaction, and most importantly, it's a foundation I can build on.
That's the difference between a parlor trick and a product. And I know which one I'd rather build.