publications | Jeffrey Young

For a complete list of publications, please see my Google Scholar Profile here.

2025

A Blueprint for Q-CS1, an Introductory Quantum Programming Course

Austin J. Adams, Rodrigo Borela, Jeffrey S. Young, and 1 more author

In Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 2, Pittsburgh, PA, USA, 2025

Abs DOI

Despite the need to build a quantum workforce, current courses that introduce quantum programming are rooted in quantum notation that students may find intimidating. We propose Q-CS1, a quantum equivalent of CS1 that begins with hands-on quantum programming. Q-CS1 is enabled by the Qwerty quantum programming language, which allows for reasoning about qubit behavior without physics notation or quantum circuits. An outline of Q-CS1 is provided along with plans for assessing its effectiveness.
ASDF: A Compiler for Qwerty, a Basis-Oriented Quantum Programming Language

Austin J. Adams, Sharjeel Khan, Arjun S. Bhamra, and 6 more authors

2025

2024

Asynchronous Distributed-Memory Parallel Algorithms for Influence Maximization

Shubhendra Pal Singhal, Souvadra Hati, Jeffrey Young, and 3 more authors

In SC24: International Conference for High Performance Computing, Networking, Storage and Analysis, 2024

DOI
A Workflow for the Synthesis of Irregular Memory Access Microbenchmarks

Kevin Sheridan, Jered Dominguez-Trujillo, Galen Shipman, and 5 more authors

In Proceedings of the International Symposium on Memory Systems, , 2024

Abs DOI

Codesign of hardware technologies and applications for sparse memory access dominated workloads can be challenging due to the complexity of the codes, restrictions on access to the codes, or both. To address this challenge we have developed a novel methodology and set of tools, GS Patterns, that analyze and then synthesize memory access patterns from applications of arbitrary complexity. The results of this are patterns which only contain the normalized sampled memory access addresses as an array of indirection indices organized as either gather (read) or scatter (write) operations. These patterns can then be used to generate memory traffic suitable for hardware optimization and design.In this paper we present GS Patterns including a detailed description of the workflow and algorithms underlying it. The results of analysis and synthesis of access patterns in both proxy and real-world applications using GS Patterns are presented followed by evaluation of performance of these patterns on latest generation hardware technologies including AMD EPYC 9654P, Intel Xeon Max, NVIDIA Grace, and NVIDIA Hopper H100 and H200. Results of this evaluation clearly demonstrate performance differences across different hardware technologies that are not captured by and in many cases are contrary to the performance behavior of simpler memory microbenchmarks.
Understanding Performance Implications of LLM Inference on CPUs

Seonjin Na, Geonhwa Jeong, Byung Hoon Ahn, and 3 more authors

In 2024 IEEE International Symposium on Workload Characterization (IISWC), 2024

DOI
CuPBoP: Making CUDA a Portable Language

Ruobing Han, Jun Chen, Bhanu Garg, and 5 more authors

ACM Trans. Des. Autom. Electron. Syst., Jun 2024

Abs DOI

CUDA is designed specifically for NVIDIA GPUs and is not compatible with non-NVIDIA devices. Enabling CUDA execution on alternative backends could greatly benefit the hardware community by fostering a more diverse software ecosystem. To address the need for portability, our objective is to develop a framework that meets key requirements, such as extensive coverage, comprehensive end-to-end support, superior performance, and hardware scalability. Existing solutions that translate CUDA source code into other high-level languages, however, fall short of these goals. In contrast to these source-to-source approaches, we present a novel framework, CuPBoP , which treats CUDA as a portable language in its own right. Compared to two commercial source-to-source solutions, CuPBoP offers a broader coverage and superior performance for the CUDA-to-CPU migration. Additionally, we evaluate the performance of CuPBoP against manually optimized CPU programs, highlighting the differences between CPU programs derived from CUDA and those that are manually optimized. Furthermore, we demonstrate the hardware scalability of CuPBoP by showcasing its successful migration of CUDA to AMD GPUs. To promote further research in this field, we have released CuPBoP as an open-source resource.
Qwerty: A Basis-Oriented Quantum Programming Language

Austin J. Adams, Sharjeel Khan, Jeffrey S. Young, and 1 more author

Jun 2024
The Framework Makes the Mission - An Analytical Comparison of Two Popular NASA Open Source Flight Software Framework Offerings

Sterling L. Peet, Scott M. Gilliland, and Jeffrey S. Young

Jun 2024
Multifidelity Memory System Simulation in SST

Patrick Lavin, Jeffrey Young, and Richard Vuduc

In Proceedings of the International Symposium on Memory Systems, Alexandria, VA, USA, Jun 2024

Abs DOI

As computer systems grow larger and more complex, it takes more time to simulate their behavior in detail. Researchers interested in simulating large-scale systems must choose between less-accurate high-level models or simulating smaller portions of their benchmark suite, both of which are highly manual, offline approaches that require time-consuming analysis by experts. Multifidelity simulation aims to lessen this burden by automatically adapting the fidelity of a simulation to the complexity of the behavior occurring at any given point in time. We show how a multifidelity memory system model can be used to accelerate single node simulation by up to 2x with 1-5% mean absolute percent error in the simulated instructions per cycle across benchmark suites.

2023

HIPLZ: Enabling performance portability for exascale systems

Jisheng Zhao, Colleen Bertoni, Jeffrey Young, and 3 more authors

Concurrency and Computation: Practice and Experience, Jun 2023

Abs DOI

Summary While heterogeneous computing has emerged as a dominant trend in current and future High-Performance Computing (HPC) systems, it is also widely recognized that this shift has led to increased software complexity due to a proliferation of programming systems for different heterogeneous processors. One such example is the Heterogeneous-Compute Interface for Portability from AMD (HIP ), which is composed of a C Runtime API and C++ Kernel Language. Many HPC applications will likely use HIP on future exascale systems (e.g., Frontier and El Capitan), but HIP currently only targets AMD and NVIDIA processors. This limitation creates challenges for users who would also like to run their applications on exascale systems based on other architectures (e.g., Aurora, which is based on Intel hardware) that are currently not targeted by HIP . In this paper, we introduce the design and implementation of HIPLZ , a compiler and runtime system that uses the Intel Level Zero API to support HIP on Intel GPU architectures. We discuss the design of HIPLZ , derived from HIPCL (an implementation of HIP on top of OpenCL ), and portability issues that occur from using the Level Zero runtime as a backend. We evaluate our implementation by running several performance benchmarks and mini-apps written in HIP on Intel architectures using HIPLZ . Our results show that this approach provides competitive performance relative to Intel’s OpenCL implementations on Intel Gen9 and UHD Graphics 770 GPUs, while providing good coverage of features needed by HPC applications. Overall, this approach is a promising demonstration of enabling performance portability for exascale systems.
Towards Safe HPC: Productivity and Performance via Rust Interfaces for a Distributed C++ Actors Library (Work in Progress)

John Parrish, Nicole Wren, Tsz Hang Kiang, and 3 more authors

In Proceedings of the 20th ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes, Cascais, Portugal, Jun 2023

Abs DOI

In this work-in-progress research paper, we make the case for using Rust to develop applications in the High Performance Computing (HPC) domain which is critically dependent on native C/C++ libraries. This work explores one example of Safe HPC via the design of a Rust interface to an existing distributed C++ Actors library. This existing library has been shown to deliver high performance to C++ developers of irregular Partitioned Global Address Space (PGAS) applications. Our key contribution is a proof-of-concept framework to express parallel programs safe-ly in Rust (and potentially other languages/systems), along with a corresponding study of the problems solved by our runtime, the implementation challenges faced, and user productivity. We also conducted an early evaluation of our approach by converting C++ actor implementations of four applications taken from the Bale kernels to Rust Actors using our framework. Our results show that the productivity benefits of our approach are significant since our Rust-based approach helped catch bugs statically during application development, without degrading performance relative to the original C++ actor versions.
EZ: An efficient, charge conserving current deposition algorithm for electromagnetic particle-in-cell simulations

Klaus Steiniger, Rene Widera, Sergei Bastrakov, and 13 more authors

Computer Physics Communications, Oct 2023

Abs DOI

We present EZ, a novel current deposition algorithm for particle-in-cell (PIC) simulations. EZ calculates the current density on the electromagnetic grid due to macro-particle motion within a time step by solving the continuity equation of electrodynamics. Being a charge conserving hybridization of Esirkepov’s method and ZigZag, we refer to it as “EZ” as shorthand for “Esirkepov meets ZigZag”. Simulations of a warm, relativistic plasma with PIConGPU show that EZ achieves the same level of charge conservation as the commonly used method by Esirkepov, yet reaches higher performance for macro-particle assignment-functions up to third-order. In addition to a detailed description of the functioning of EZ, reasons for the expected and observed performance increase are given, and guidelines for its implementation aiming at highest performance on GPUs are provided.
Hardware-Agnostic Interactive Exascale In Situ Visualization of Particle-In-Cell Simulations

Felix Meyer, Benjamin Hernandez, Richard Pausch, and 15 more authors

In Proceedings of the Platform for Advanced Scientific Computing Conference, Davos, Switzerland, Oct 2023

Abs DOI

The volume of data generated by exascale simulations requires scalable tools for analysis and visualization. Due to the relatively low I/O bandwidth of modern HPC systems, it is crucial to work as close as possible with simulated data via in situ approaches. In situ visualization provides insights into simulation data and, with the help of additional interactive analysis tools, can support the scientific discovery process at an early stage. Such in situ visualization tools need to be hardware-independent given the ever-increasing hardware diversity of modern supercomputers. We present a new in situ 3D vector field visualization algorithm for particle-in-cell (PIC) simulations and performance evaluation of the solution developed at large-scale. We create a solution in a hardware-agnostic approach to support high throughput and interactive in situ processing on leadership class computing systems. To that end, we demonstrate performance portability on Summit’s and the Frontier’s pre-exascale testbed at the Oak Ridge Leadership Computing Facility.
Unified Co-Simulation Framework for Autonomous UAVs

Sri Ranganathan Palaniappan, Varun Pateel, Sam Jijina, and 2 more authors

In Practice and Experience in Advanced Research Computing 2023: Computing for the Common Good, Portland, OR, USA, Oct 2023

Abs DOI

Autonomous drones (UAVs) have rapidly grown in popularity due to their form factor, agility, and ability to operate in harsh or hostile environments. Drone systems come in various form factors and configurations and operate under tight physical parameters. Further, it has been a significant challenge for architects and researchers to develop optimal drone designs as open-source simulation frameworks either lack the necessary capabilities to simulate a full drone flight stack or they are extremely tedious to setup with little or no maintenance or support. In this paper, we develop and present UniUAVSim, our fully open-source co-simulation framework capable of running software-in-the-loop (SITL) and hardware-in-the-loop (HITL) simulations concurrently. The paper also provides insights into the abstraction of a drone flight stack and details how these abstractions aid in creating a simulation framework which can accurately provide an optimal drone design given physical parameters and constraints. The framework was validated with real-world hardware and is available to the research community to aid in future architecture research for autonomous systems.
Observed Memory Bandwidth and Power Usage on FPGA Platforms with OneAPI and Vitis HLS: A Comparison with GPUs

Christopher M. Siefert, Stephen L. Olivier, Gwendolyn R. Voskuilen, and 1 more author

In High Performance Computing, Oct 2023

Abs

The two largest barriers to adoption of FPGA platforms for HPC applications are the difficulty of programming FPGAs and the performance gap when compared to GPUs. To address the first barrier, new ecosystems like Intel oneAPI, and Xilinx Vitis HLS aim to improve programmability for FPGA platforms. From a performance aspect, FPGAs trade off lower compute frequencies for more customized hardware acceleration and power efficiency when compared to GPUs. The performance for memory-bound applications on recent GPU platforms like NVIDIA’s H100 and AMD’s MI210 has also improved due to the inclusion of high-bandwidth memories (HBM), and newer FPGA platforms are also starting to include HBM in addition to traditional DRAM.
Future Computing with the Rogues Gallery

Aaron Jezghani, Jeffrey Young, Will Powell, and 2 more authors

In 2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Oct 2023

DOI
Enabling Multi-threading in Heterogeneous Quantum-Classical Programming Models

Akihiro Hayashi , Austin Adams, Jeffrey Young, and 4 more authors

In 2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Oct 2023

DOI
Application Experiences on a GPU-Accelerated Arm-based HPC Testbed

Wael Elwasif, William Godoy, Nick Hagerty, and 31 more authors

In Proceedings of the HPC Asia 2023 Workshops, Raffles Blvd, Singapore, Oct 2023

Abs DOI

This paper assesses and reports the experience of ten teams working to port, validate, and benchmark several High Performance Computing applications on a novel GPU-accelerated Arm testbed system. The testbed consists of eight NVIDIA Arm HPC Developer Kit systems, each one equipped with a server-class Arm CPU from Ampere Computing and two data center GPUs from NVIDIA Corp. The systems are connected together using InfiniBand interconnect. The selected applications and mini-apps are written using several programming languages and use multiple accelerator-based programming models for GPUs such as CUDA, OpenACC, and OpenMP offloading. Working on application porting requires a robust and easy-to-access programming environment, including a variety of compilers and optimized scientific libraries. The goal of this work is to evaluate platform readiness and assess the effort required from developers to deploy well-established scientific workloads on current and future generation Arm-based GPU-accelerated HPC systems. The reported case studies demonstrate that the current level of maturity and diversity of software and tools is already adequate for large-scale production deployments.

2022

“Smarter” NICs for faster molecular dynamics: a case study

S. Karamati, C. Hughes, K. Hemmert, and 6 more authors

In 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Jun 2022

DOI

2021

Online Model Swapping for Architectural Simulation

Patrick Lavin, Jeffrey Young, Richard Vuduc, and 1 more author

In Proceedings of the 18th ACM International Conference on Computing Frontiers, Virtual Event, Italy, Jun 2021

DOI
Performance Analysis of PIConGPU: Particle-in-Cell on GPUs using NVIDIA’s NSight Systems and NSight Compute

Matthew Leinhauser, Jeffrey Young, Sergei Bastrakov, and 3 more authors

Jun 2021

2020

Co-designing OpenMP Features Using OMPT and Simulation Tools

Matthew Baker, Oscar Hernandez, and Jeffrey Young

In OpenMP: Portable Multi-Level Parallelism on Modern Systems, Jun 2020
Evaluating Gather and Scatter Performance on CPUs and GPUs

Patrick Lavin, Jeffrey Young, Richard Vuduc, and 3 more authors

In The International Symposium on Memory Systems, Washington, DC, USA, Jun 2020

DOI
Programming Strategies for Irregular Algorithms on the Emu Chick

Eric R. Hein, Srinivas Eswar, Abdurrahman Yaşar, and 7 more authors

ACM Trans. Parallel Comput., Oct 2020

DOI
Spatter Github

Patrick Lavin, Jeffrey Young, and Richard Vuduc

Oct 2020

2019

Experimental Insights from the Rogues Gallery

Jeffrey S Young, Jason Riedy, Thomas M Conte, and 3 more authors

In 2019 IEEE International Conference on Rebooting Computing (ICRC), Oct 2019
Linear Algebra-Based Triangle Counting via Fine-Grained Tasking on Heterogeneous Environments : (Update on Static Graph Challenge)

A. Yaşar, S. Rajamanickam, J. Berry, and 3 more authors

In 2019 IEEE High Performance Extreme Computing Conference (HPEC), Oct 2019
Performance Impact of Memory Channels on Sparse and Irregular Algorithms

Oded Green, James Fox, Jeff Young, and 2 more authors

In 2019 IEEE/ACM 9th Workshop on Irregular Applications: Architectures and Algorithms (IA3), Oct 2019

DOI
A microbenchmark characterization of the Emu chick

Jeffrey S. Young, Eric Hein, Srinivas Eswar, and 5 more authors

Parallel Computing, Oct 2019

DOI
Wrangling Rogues: A Case Study on Managing Experimental Post-Moore Architectures

Will Powell, Jason Riedy, Jeffrey S. Young, and 1 more author

In Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (Learning), Chicago, IL, USA, Oct 2019

DOI
Spatter: A Customizable Scatter/Gather Benchmark

Patrick Lavin, Jason Riedy, Richard Vuduc, and 1 more author

Oct 2019

\urlhttp://spatter.io/
Rogues Gallery Public Gitlab Page

online, Oct 2019

\urlhttps://crnch-rg.gitlab.io/rg/
Programming Novel Architectures in the Post-Moore Era with the Rogues Gallery

E. Jason Riedy, and Jeffrey S. Young

In Practice and Experience in Advanced Research Computing (PEARC), Jul 2019

\urlhttps://crnch-rg.gitlab.io/pearc-2019/
Programming Novel Architectures in the Post-Moore Era with The Rogues Gallery

E. Jason Riedy, and Jeffrey S. Young

In 24th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Apr 2019

\urlhttps://crnch-rg.gitlab.io/asplos-2019/

2018

Performance Implications of NoCs on 3D-Stacked Memories: Insights from the Hybrid Memory Cube

R. Hadidi, B. Asgari, J. Young, and 4 more authors

In 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Apr 2018
An Energy-Efficient Single-Source Shortest Path Algorithm

Sara Karamati, Jeffrey Young, and Richard Vuduc

In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Apr 2018

DOI
An Initial Characterization of the Emu Chick

Eric Hein, Tom Conte, Jeffrey Young, and 5 more authors

In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Apr 2018

DOI
2018 Neuromorphic Workshop

Jennifer Hasler, and Jeffrey Young

Apr 2018

\urlhttp://crnch.gatech.edu/neuro-workshop18

2017

Evaluating Hybrid Memory Cube Infrastructure to Support High-Performance Sparse Algorithms

Kartikay Garg, and Jeffrey Young

In Proceedings of the International Symposium on Memory Systems, Alexandria, Virginia, Apr 2017

DOI

2016

Optimizing communication for a 2D-partitioned scalable BFS

Jeffrey Young, Julian Romera, Matthias Hauck, and 1 more author

In 2016 IEEE High Performance Extreme Computing Conference (HPEC), Apr 2016

DOI
GPUShare: Fair-Sharing Middleware for GPU Clouds

Anshuman Goswami, Jeffrey Young, Karsten Schwan, and 4 more authors

In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Apr 2016

DOI
Landrush: Rethinking In-Situ Analysis for GPGPU Workflows

Anshuman Goswami, Yuan Tian, Karsten Schwan, and 5 more authors

In 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Apr 2016

DOI
GraphIn: An Online High Performance Incremental Graph Processing Framework

Dipanjan Sengupta, Narayanan Sundaram, Xia Zhu, and 4 more authors

In Proceedings of the 22Nd International Conference on Euro-Par 2016: Parallel Processing - Volume 9833, Apr 2016

DOI

2015

Examining Recent Many-core Architectures and Programming Models Using SHOC

M. Graham Lopez, Jeffrey Young, Jeremy S. Meredith, and 3 more authors

In Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems, Austin, Texas, Apr 2015

DOI