Introduction
Software engineers often need to search for relevant code snippets across large, proprietary codebases. However, traditional search tools fall short when it comes to understanding natural language queries. Large Language Models (LLMs) like ChatGPT can handle natural language well, but they are often ineffective on private or domain-specific codebases, and training a custom LLM is prohibitively expensive for most organizations.
Key Problems
- You can’t search your codebase using plain English.
- Proprietary or internal code is not covered by public LLMs.
- Training your own model is too costly and inefficient.
- Existing solutions don’t bridge the gap between natural language and code search effectively.
Our project enables natural language search for code snippets in domain-specific codebases without needing to train a full LLM. Instead, we use a combination of vector similarity search and LLM-based ranking to retrieve the most relevant code based on user queries.
How It Works
- We start with a dataset of code snippets paired with descriptive comments.
- These comments are embedded using a pretrained model (like UniXCoder).
- When a user enters a natural language query, we embed the query too.
- We then compute the dot product similarity between the query and all comment embeddings.
- The top results are re-ranked using an LLM to produce the most relevant code snippets.
Key Components
- Python backend
- UniXCoder for embeddings
- Ollama LLM (DeepSeek) for ranking
- JSON data pipeline
- Flask Web-based UI for user interaction