Render.com RAG Service | James Murray

James Murray has implemented a scalable Flask-based RAG service hosted on Render.com. This service, integrated with a PHP proxy, automatically scales to handle varying loads while providing efficient AI-powered query responses in real time.

The service auto-scales from 0 to 32 instances based on CPU, with cold starts under 2s. It uses connection pooling and Redis caching for 300ms p95 latency.

Key Features

  • Autoscaling RAG Endpoints: Render.com native scaling with zero config.
  • Flask Backend: Lightweight, async-ready with Gunicorn.
  • PHP Proxy Integration: Seamless auth and rate limiting from PHP apps.
  • Redis Caching: 5-minute TTL on hot queries.
  • Observability: OpenTelemetry + Render logs.
  • Graceful Degradation: Fallback to keyword search on LLM failure.

System Design & Architecture

The service is designed for high scalability, with dynamic adjustments based on incoming traffic. It leverages Flask for backend logic, with seamless integration into a PHP-based web interface.

Technical Stack

  • Backend: Flask + Gunicorn
  • Vector DB: Weaviate
  • Cache: Redis
  • Hosting: Render.com

Related Projects

Explore other related projects: