Skip to main content

Project Synopsis

Problem Statement: Reducing the load on MedLib’s Healthcare AI Assistant by avoiding redundant inference calls via exact and semantic caching.

 

The project involves implementing a caching mechanism for MedLib's Healthcare AI Assistant. This assistant is designed to provide healthcare providers instant access to concise, curated, and credible medical information, complete with references. The AI assistant is powered by a large language model (LLM) adapted for the medical domain and includes multimodal capabilities (text and image inputs). The caching mechanism will aim to reduce costs and improve system performance by minimizing redundant LLM inference calls. 

Our system stores and reuses previously generated responses or results. Implements a mechanism to recognize and cache responses to semantically similar queries, reducing redundant inference even if the phrasing differs.

This project will be an API that allows the user, which, in this case, is MedLib, to send queries to our caching system to see if it matches any cached phrases. It will return a cached response if found instead of calling the user’s LLM.

The system will be supported by a vector database, which will store embeddings for the semantic cache system to compare against to check the similar score of stored phrases to decide whether to return a response from the cache or call the LLM. This will improve performance and reduce costs for users who don’t want to make redundant LLM calls.

Example flowExample flow of application