AI Research RAG Assistant: Multi-Turn Q&A & Hybrid Retrieval | Nie Er

This project is an internal enterprise research assistant, not a public SaaS product. It lets analysts and advisory teams ask natural-language questions, retrieves evidence from internal research reports, financial news, and domain metadata, then produces answers that can be traced back to source material.

I am describing it here as an engineering project rather than a client case study. The interesting part is the delivery path: turning messy research content, financial tagging, conversational follow-ups, and citation requirements into a maintainable RAG system.

Problem

Investment research questions are rarely clean search queries. A user may ask, “How does it look recently?”, then follow up with “What if we use a Fed perspective?” or “Any opposing views?” A plain vector search over the latest message can easily lose the subject, miss the intended time window, or combine unrelated snippets into a convincing but weak answer.

Trust is the harder constraint. In this domain, the model should not fill gaps from memory or state unsupported facts with confidence. The answer must be grounded in retrieved material, and the user must be able to inspect the evidence behind it. When the evidence is insufficient, the system should say so.

Stack

Backend: Python services for orchestration, retrieval, context handling, and answer generation
Frontend: TypeScript interface for chat, streaming output, and evidence inspection
Retrieval: BM25, vector similarity, tag filters, and hybrid ranking
Generation: a two-step LLM flow for query understanding and answer generation
Transport: SSE streaming for lower perceived latency
Data: anonymized internal reports, news, metadata, tags, and citation snippets

Architecture

The system follows a four-stage flow: understand the question, retrieve evidence, generate the answer, and expose the sources.

The first stage rewrites the user message with recent conversation context. It resolves references, extracts entities and keywords, and turns vague time language into retrieval parameters. “Recent” is not treated as a fixed number of days; the interpretation depends on the question type and surrounding context.

The second stage runs retrieval in parallel. Questions with clear entities or domain tags favor exact filters first. Broader questions combine BM25 and vector recall. Time-sensitive questions increase the weight of more recent material. This works better than relying only on dense retrieval because financial research already has a useful tagging system, and users often need to know why a result was retrieved.

The third stage generates the answer from the retrieved snippets, the rewritten question, and limited conversation context. The frontend streams the response over SSE and keeps the supporting evidence available for review.

Streaming answer in the research assistant

My Role

I worked on the engineering flow that moved the assistant from a demo into something usable in a real business environment:

Designed the query rewriting, entity extraction, time extraction, and retrieval-parameter generation flow
Combined BM25, vector search, tag filters, and time weighting into a hybrid retrieval strategy
Constrained answer generation to retrieved evidence, with graceful fallback when the material was insufficient
Designed the multi-turn context handling used for reference resolution and topic continuity
Connected streaming frontend output with citation and evidence display
Debugged representative failure cases, including wrong subject resolution, weak time-window handling, missing sources, and mixed viewpoints

Challenges

The most underestimated part was query understanding. Research users do not write search syntax; they continue a conversation. The system first turns the latest message into a standalone research question and retrieval plan, instead of sending the raw message directly to the retriever.

Retrieval also needed domain judgment. Financial content often has mature labels and metadata. Pure vector retrieval can return text that is semantically close but wrong in business context. The project uses tags and structured filters first, then adds keyword and vector recall where they help coverage.

The final challenge was answer discipline. Strong models are good at producing fluent answers even when the evidence is thin. The system handles this with prompt constraints, limited evidence input, and visible citations so that “can we answer this?” becomes part of the application behavior, not just a hope placed on the model.

Evidence view for a generated answer

Delivery

The project was delivered as an internal enterprise application, including backend Q&A services, retrieval integration, frontend chat UI, data ingestion scripts, and basic deployment configuration. It connects tens of thousands of internal reports and tens of millions of news items; answers begin streaming in about 3–5 seconds and complete with citations in roughly a dozen seconds, supporting 10+ follow-up turns. Because it depends on non-public research reports, news feeds, business tags, and internal deployment details, there is no public repository or live demo.

The screenshots and architecture notes here are anonymized. Customer identity, exact data scale, model details, internal organization, and deployment specifics have been generalized to avoid revealing the original environment.

AI Research RAG Assistant