What is the problem being addressed?
Large Language Models (LLMs) have shown very impressive capabilities in general-purpose question and answering prompts. However, they often struggle in highly specialized domains, like High-Performance Computing (HPC) code optimization. The main reason for this performance gap is the lack of high-quality, labeled datasets in these niche areas. While languages like Java and Python are well-covered, newer or less common languages like JAX do not have comprehensive datasets. This shortfall limits the effectiveness of LLMs when applied to domain-specific tasks that require expert-level understanding, leading to suboptimal solutions or even incorrect answers.
What is your project idea and how will it work (what are its components etc)?
Our project proposes the development of a framework for improving LLM performance in domain-specific applications. This involves two primary objectives:
- Creating a protocol for collecting and validating high-quality datasets, specifically for underrepresented domains like HPC code optimization using JAX.
- Designing and implementing a question-answering system powered by open-source LLMs that are fine-tuned using our curated datasets.
The project will consist of several key components:
- Backend infrastructure for data ingestion, processing, and model training/validation.
- Data analysis tools to ensure quality control and consistency in the dataset.
- Machine learning pipeline for fine-tuning existing LLMs on the newly collected data.
- Evaluation framework to measure the performance of the fine-tuned models in answering domain-specific questions accurately.
We will use Python and have the option to explore HPC-friendly libraries such as JAX or OpenMP.