Justin Miller

Resume

PROFESSIONAL SUMMARY

Senior HPC Engineer with 15+ years of experience, uniquely positioned to bridge vendor and end-user perspectives. Expertise in architecting automated testing frameworks that identify critical integration issues and performance bottlenecks. Specialized in GPU-accelerated computing, high-speed interconnects, and distributed storage systems. Proven ability to optimize system designs and validate complex environments using Python automation and real-world workloads. Passionate about advancing HPC/AI integration through cutting-edge hardware and software solutions.

EDUCATION

PROFESSIONAL EXPERIENCE

Hewlett Packard Enterprise (formerly Cray Inc.)

Senior Systems Integration Engineer (Apr 2019 – Present)

(Incorporates duties previously held as Senior System Test Engineer from Apr 2019 - Mar 2020)

  • Testing Framework & Automation
    • Develop and maintain Python-based integration testing framework, enabling end-to-end system validation from bare metal to application layer.
    • Utilize and add to Cray’s extensive database of test cases to test system-level functionality.
    • Develop automation workflows and comprehensive test suites to validate functionality and identify performance regressions.
    • Execute large-scale stress and endurance testing to identify potential failure points before deployment.
    • Validate GPU-accelerated workloads across NVIDIA and AMD environments, ensuring optimal performance.
  • HPC System Management & Software
    • Test and validate both Cray System Management (CSM) software and HPE Performance Cluster Manager (HPCM) for HPC/AI clusters.
    • Install and administer Shasta (now CSM), Cray/HPE’s Kubernetes-based platform that enables converged HPC/ML/AI workflows.
    • Work with Slurm and PBS workload managers to validate job scheduling and resource allocation.
    • Validate and test Singularity container workflows using Slurm in the HPE Cray EX runtime environment.
  • Programming Environment & Applications
    • Test and validate the HPE Cray Programming Environment (CPE) across multiple architectures including ARM, x86-64 (Intel/AMD), and GPU accelerators (NVIDIA/AMD).
    • Build, deploy, and validate diverse HPC/AI applications and benchmarks with focus on performance and scalability.
  • Networking & Interconnects
    • Develop and execute tests for HPE Slingshot that verify health and configuration of the high-speed network fabric.
    • Test and validate both Slingshot 10 (Mellanox ConnectX HCA with HPE switches) and Slingshot 11 (HPE Cassini NIC with HPE switches).
    • Work with libfabric across multiple providers including Mellanox and HPE.
  • Collaboration & Customer Focus
    • Apply unique end-user perspective to ensure solutions meet customer requirements for reliability, usability, scale, and performance.
    • Collaborate closely with Support teams to triage, root cause, analyze defects, and capture reproducers for complex customer issues.
    • Work with sales support and benchmarking engineers to run reproductions of customer workflows and develop tests for specific metrics.
    • Provide technical, experiential, and architectural feedback to product owners and R&D teams.
Storage Quality Engineer (Jan 2015 – Mar 2019)
  • Worked alongside development teams to create and execute tests used to verify functionality of the Lustre client and server for the Cray Linux Environment and ClusterStor storage appliance.
  • Designed and executed tests verifying features and functionality of Linux kernel module based Lustre client and server (ext4 and ZFS backing storage).
  • Demonstrated advanced skills in the configuration, administration of Lustre using ext4 and ZFS as backing filesystems and LNet networking.
  • Architected, deployed, and operated CI pipeline for Lustre using Jenkins and OpenStack.
  • Analyzed failure patterns and improved test coverage to enhance system reliability.
  • Tested Lustre-related monitoring products based on Docker containers developed by Cray.
  • Participated in change control and release management processes for Lustre client and server.

Indiana University – Research Technologies, High Performance File Systems

Senior Systems Analyst Programmer (HPC Storage Systems Administrator) (Nov 2007 – Dec 2014)
  • Assisted in the operation, maintenance, and administration of two production Lustre file systems for academic research computing using DDN hardware and Mellanox InfiniBand networks.
  • Developed and utilized software for file system analysis, monitoring, and system automation.
  • Deployed, administered, and monitored Linux systems for performance, security, stability, and errors.
  • Performed extensive hardware validation and acceptance testing for new HPC cluster deployments.
  • Administered and provided user support for OpenSFS Test Cluster for testing and development of OpenSFS funded Lustre development.
  • Evaluated Lustre releases and associated tools using a Vagrant/VirtualBox VM testing and development environment I designed and implemented.
  • Collaborated directly with research teams to implement scientific computing solutions.
  • Provided direct support to all users through ticketing system and offered consultations for extended or complex projects.
  • Created and presented documentation for users, including usage and best practices of Lustre file system.
  • Tested, diagnosed, and improved Ethernet and Infiniband networking performance and reliability.
  • Contributed to the evaluation, assessment, and procurement of software and hardware; evaluated early builds of Chroma/Intel Manager for Lustre.
  • Worked with vendors to report and diagnose issues, including the replacement and repair of hardware.
  • Executed critical data management and provided technology support for scientists in remote field deployments to Antarctica and Greenland in association with NASA and NSF (Operation IceBridge).

Purdue University

Student Systems Administrator (2001 – 2006)
  • Provided system administration and hardware support for the Rosen Center for Advanced Computing (RCAC).
  • Provided technical support to students, faculty, and staff of the Earth and Atmospheric Sciences Dept.

PROFESSIONAL ACTIVITIES

Conferences and Professional Associations:

Special Projects

PUBLICATIONS AND PRESENTATIONS

Testing Lustre: The Basics (Video Recording)
Presentation at Lustre Administrator and Developer Workshop (LAD), Paris, France, September 2015

Enabling Lustre WAN for production use on the TeraGrid: a lightweight UID mapping scheme.
Joshua Walgenbach, Stephen C. Simms, Kit Westneat, and Justin P. Miller. 2010. In Proceedings of the 2010 TeraGrid Conference (TG '10). ACM, New York, NY, USA, Article 19, 6 pages. DOI: 10.1145/1838574.1838593