Project Synopsis

Problem 

High performance computing (HPC) clusters are necessary to run a variety of jobs that may take a large amount of computing power and or time. These clusters require expensive and powerful hardware that can help meet the needs of it's users, thus maintaining and ensuring the cluster is performing well is extremely important. Our team's task is to help develop an HPC health monitoring system for Lockheed Martin that allows administrators to track the overall system health. The goal of the project is to create a monitoring system fully flushed out with real time hardware data updates, data visualization, and abnormality detection for potential malicious programs being run on the cluster.

Implementation

Due to security risks both Lockheed Martin and Iowa State University did not allow our team to work with an actual HPC cluster, resulting in having to create our own. We set up our environment using Slurm (a job scheduler), docker, and docker-compose. This allows us to freely add nodes and mess with their settings, but also limits us in a number of ways. The most noticeable limitation of using docker is the hardware statistics. Docker uses the host device's hardware statistics, making for inconsistent sensor data across multiple devices.

Using the docker environment we successfully scrapers to target computer processing units (cpu), memory, storage, jobs, and network data across all nodes. This data is then stored in tables using sqlite, that we are able to flush whenever an admin would want. This data is accompanied by a simple and effective html page. These pages will allow an admin to see all data, job, and abnormalities across the entire HPC cluster, or view each and every node individually to see all of their specific data.

network graph