type
Post
status
Published
date
Mar 10, 2026
slug
cloud-native-rag-chatbot
summary
How we turned a fragile single‑machine RAG into a session‑aware, PostgreSQL‑backed chatbot running on Cloud Run with Vertex AI doing the heavy lifting.
tags
LLM
GCP
Cloud
category
Sharing
icon
password
 
This article walks through how we took a single‑machine RAG prototype and turned it into a cloud‑hosted chatbot you can depend on: session‑aware, backed by PostgreSQL, and running on Google Cloud primitives instead of shell scripts.
 

From desktop toy to something you can rely on

The original setup was small and comfortable: a RAG chatbot wired to a personal note vault and a simple web page.
Locally, the loop looked like this:
  • A browser pointed at a lightweight frontend served from one machine.
  • Behind it, a Dockerised FastAPI backend handled vector search and an LLM agent.
  • During setup, we walked the vault once, embedded the content, and served answers from that in‑container vector index; nothing lived beyond that box and the vault itself.
For experiments, this was fine. For actual day‑to‑day usage, it broke in predictable ways: it only existed when that one machine was awake and online, conversation history and agent state lived purely in memory, and there was no clean boundary between “chatbot logic” and “telemetry and plumbing.”
Before touching any cloud service, we had to pin down what “production‑grade” meant for a personal RAG chatbot. Here we took it to mean three things: it survives restarts and deployments, it persists conversations and knowledge in a real database instead of process memory, and it has enough security and observability that failures show up somewhere other than a local terminal.
Everything else in this article is downstream of that definition.
 

The cloud‑native RAG chatbot at 10,000 feet

Once we stopped treating it as “a script that happens to talk to an LLM” and started treating it as a service, the architecture became much easier to reason about.
At a high level, the cloud‑hosted chatbot now looks like this:
notion image
The core pieces are:
  • FastAPI app on Cloud Run. The chatbot is now an HTTP service with a /webhook endpoint instead of an ad‑hoc process on a laptop.
  • Cloud SQL PostgreSQL with pgvector. One managed database stores note chunks, embeddings, conversation history, and LangGraph checkpoints.
  • Vertex AI. Gemini models handle both text generation and embeddings, so we do not manage models directly.
 
Compared to the local version:
  • The application is stateless. Agent state and sessions are externalized to PostgreSQL instead of living in memory.
  • Persistence is explicit. Notes, sessions, and checkpoints sit in tables that Cloud SQL backs up and replicates.
  • The LLM stack is managed. We call Vertex AI endpoints instead of running or wiring third‑party models ourselves.
The goal is the same as the local bot—“let me chat with my notes”—but the cloud version behaves differently enough that it is effectively a new chatbot that reuses the same knowledge base.
 

What changed in the application layer

Most of the work was not in picking services, but in reshaping the application so it could live comfortably in a serverless environment.
 

A real HTTP boundary instead of a private process

The first step was extracting the bot into a web application.
Incoming requests now hit a FastAPI route instead of a bespoke event loop. The handler offloads longer‑running work to background tasks so Cloud Run is free to scale. FastAPI and Pydantic give us a typed API surface and generated docs, which makes it easier to debug formats and add health checks than when everything lived inside one long‑running script.
 

LangGraph for stateful conversations

The local prototype handled state implicitly: a Python process and a few in‑memory variables. That falls apart as soon as requests can land on different instances.
We introduced LangGraph as the backbone of the agent. A typed AgentState now holds message history and tool outputs; the flow (summarisation, retrieval, LLM call, tools) is declared as a graph of nodes and conditional edges; and a PostgreSQL‑backed checkpointer means each conversation thread can be resumed even if Cloud Run cycles instances. In practice, the chatbot can remember previous turns, generate and store titles for history views, and safely run tools like create a new note without losing context mid‑request.
 

Session‑aware retrieval instead of stateless Q&A

Once state lives in PostgreSQL, retrieval stops being “embedding of the last message only.”
Each user is associated with a current thread in a session_owners table, and that thread ID drives which LangGraph checkpoints and messages we load. The chatbot can choose retrieval strategies that take the whole session into account instead of treating every question as independent, so it behaves more like an assistant that can pick up where you left off rather than a one‑shot Q&A endpoint.
 

How Google Cloud glues the pieces together

With the application shaped for serverless, the rest of the architecture is about picking the right primitives and letting them do their jobs.
The FastAPI app runs on Cloud Run, which fits a chat workload well: it scales down to zero when nobody is talking to the bot, spins up new instances as webhook traffic increases, and treats the whole system as a container image with Python, LangGraph, and database clients baked in. Each instance connects to Cloud SQL over the Cloud SQL connector using a dedicated service account, so from the app’s point of view it is “just PostgreSQL,” and from the operator’s point of view there is no public database endpoint to manage.
Behind that, the chatbot talks to a single Cloud SQL PostgreSQL instance that holds note chunks, embeddings (via the pgvector extension), session metadata, and LangGraph checkpoints. Choosing one managed relational database for both operational data and retrieval has real trade‑offs—which is why it gets its own article next—but at this stage the important part is that persistence and retrieval both sit behind one strongly consistent endpoint.
For the model layer, the chatbot calls Vertex AI: Gemini handles text generation using prompts that include retrieved notes and session context, and a Vertex AI embedding model turns text into vectors that land in pgvector columns. That keeps the LLM stack deliberately boring: no separate model‑hosting layer, one set of credentials, and a single place in GCP to investigate when things go wrong. A small scheduled ingestion job runs on Cloud Build, embedding and ingesting only new notes on a daily cadence so the knowledge base stays fresh without continuously reprocessing the entire vault.
 

What this looks like from the chat window

To make this a bit less abstract, here is what a real session with the bot looks like when it reaches back into the note vault and then writes something new to Git.
In this example, I wanted to revisit my German‑learning plan. The underlying notes already lived in the vault as scattered entries about courses, time commitments, and costs.
Pic 1. Telegram conversation with retrieval and note creation.
Pic 1. Telegram conversation with retrieval and note creation.
Behind that last line, the agent has already retrieved the relevant chunks from PostgreSQL, generated a cleaned‑up summary, and called a “write note” tool that renders Markdown and pushes it to the Git repo backing the Obsidian vault. A few seconds later, GitHub shows a commit adding German Learning Considerations.md with exactly the checklist the bot just walked through in chat.
 
Pic 2. Bot-generated commit in Obsidian Git Repo (minor bug on the  logging timestamp haha)
Pic 2. Bot-generated commit in Obsidian Git Repo (minor bug on the logging timestamp haha)
 
This is the main loop the system is built for: use RAG to ground the conversation in existing notes, then let the agent turn the result back into first‑class knowledge by writing structured Markdown straight into version control.
 

A quick look ahead at PostgreSQL details

On top of the local prototype, the cloud‑hosted RAG chatbot adds session‑aware retrieval: instead of treating every message as a fresh request, it can remember and resume conversations tied to a specific thread, with PostgreSQL backing both the thread metadata and the chunks we retrieve against. It also uses a hybrid search strategy to keep answers timely, first running vector similarity search using embeddings and then reranking those candidates by recency so newer notes win when they are still relevant.
Designing that schema and retrieval logic—and working out how far you can push “one PostgreSQL for everything” before it hurts—is its own story. The next article in this series digs into that decision: how the tables are structured, how hybrid search works in practice, and when you should reach for a dedicated vector database instead.
Loading...