

## Efficiently Using Architectures in the Era of Customization and Specialization

Prof. Martin Schulz Chair for Computer Architecture and Parallel Systems, TU Munich Department of Computer Engineering



## Efficiently Using Architectures in the Era of Customization and Specialization and Quantum

Prof. Martin Schulz Chair for Computer Architecture and Parallel Systems, TU Munich Department of Computer Engineering

#### Trends in HPC The HPC World is Changing





#### Trends in HPC Power and Energy as Hard Constraints



Power Limits

Cost are limiting

Power density is pushing limits

Societal pressure



#### Trends in HPC Cost of Data Movement is Becoming a Limiter



**Power Limits** 

Cost are limiting

Power density is pushing limits

Societal pressure

10,000

1,000

10

pJ per 64-bit operation 100



#### Trends in HPC Workloads are Becoming More Complex and Diverse



**Power Limits** 

Cost are limiting

Power density is pushing limits

Societal pressure

10,000

1,000

100

10

pJ per 64-bit operation



#### Trends in HPC **Cambrian Explosion of Architectures**





Cost are limiting

Power density is pushing limits

Societal pressure

10,000

1,000

100

10

pJ per 64-bit operation



Efficiently Using Architectures in the Era of Customization and Specialization, Martin Schulz, April 2024, Salishan, OR

Hybrid Workflows

#### A Brave New World You Cannot Run Away From New Hardware

**Customization** 

Special purpose

- architectures
- Hardware compression



- Enabling easy access to market
- New instructions/HW blocks Reduced engineering cost
- Memory centric accelerators Decreased design times
  - Heterogeneous integration



#### **Better GPUs**

Open source ٠ **GPU** architecture



- Building on RISC-V vectors
- CuPBoP portability layer
- Complex ecosystem needed

#### **Common theme:**

- New hardware, new ISAs, new system designs •
- Option for deep integration relying on common memory •
- New programming approaches, techniques and models •
- New chances for power and energy steering •

#### A Brave New World Large Scale Accelerators

#### **AI Accelerators**

- Physically separated systems
- Tied into highspeed networks
- Data optimized environments
- Separate programming environments
- Focus on lower precision
- Themselves often clustered





ТШΠ

#### A Brave New World Large Scale Accelerators



#### Al Accelerators

- Physically separated systems
- Tied into highspeed networks
- Data optimized environments
- Separate programming environments
- Focus on lower precision
- Themselves often clustered



#### Quantum Systems

- Radically new technology
- High potential, high risk, low readiness (so far)
- Not a standalone technology Needs "classic" compute





#### A Brave New World Quantum Computing as a Laboratory Experiment

Research on Optic Tables @ MPQ in Garching, Germany

#### A Brave New World Quantum Computing as a Laboratory Experiment

Research on Optic Tables @ MPQ in Garching, Germany

Efficiently Using

Source: MPQ

N-a

ТΠ

#### A Brave New World Quantum Integration Centre (QIC) at LRZ





#### HPCQC Integration: Hardware Integration Reducing the Gap between Host and Accelerator





#### Integrated Systems Large Scale Accelerators

## ТШ

#### **AI Accelerators**

- Physically separated systems
- Tied into highspeed networks
- Data optimized environments
- Separate programming environments
- Focus on lower precision
- Themselves often clustered



#### **Quantum Systems**

- Radically new technology
- High potential, high risk, low readiness (so far)
- Not a standalone technology Needs "classic" compute

#### Quantum Accelerators

- Heterogeneity to the extreme
- New paradigm with new comp. models
- Many common challenges compared to existing HPC accelerators



Making use of this new architectural world Where is the Software?

#### **Accelerated Systems**

- Used to be disruptive
- New ways to program
- Abstraction layers
- Today: integrated software and management stack





#### Making use of this new architectural world Where is the Software?

ТШТ

**Accelerated Systems** 

- Used to be disruptive
- New ways to program
- Abstraction layers
- Today: integrated software
   and management stack

#### **Integrated Systems**

- New opportunities
  - Quick load shifts
  - Power shifts
- Need for node sharing
- Impact on Scheduling





#### Making use of this new architectural world Topology Detection with sys-sage



Need to capture topology

- Dynamic
- With attributes
- Expandable

New project: sys-sage

- Capture topology
- Dual representation
  - Component tree
  - Data-path graph
- Express dynamic behavior
- Capture system changes



A Unified Representation of Dynamic Topologies & Attributes on HPC Systems, Stepan Vanecek, Martin Schulz, to appear in ICS 2024

Efficiently Using Architectures in the Era of Customization and Specialization, Martin Sch

#### Making use of this new architectural world Where is the Software?



#### **Accelerated Systems**

- Used to be disruptive
- New ways to program
- Abstraction layers
- Today: integrated software and management stack

#### **Integrated Systems**

- New opportunities
  - Quick load shifts
  - Power shifts
- Need for node sharing
- Impact on Scheduling

#### **Quantum Paradigm**

- Again disruptive
- For now: circuit-based
- New software stack needed







The Munich Quantum Software Stack (MQSS) Front-End / Languages













Quantum Systems

#### QC and HPCQC System Software OpenMP Quantum Tasks Integration

- Lower learning curve for HPC users
- Benefits from compiler level information instead of a library level
- Possibility to include offloading classical task to "nearby compute"

Quantum Task Offloading with the OpenMP API Joseph KL Lee, Oliver T Brown, Mark Bull, Martin Ruefenacht, Johannes Doerfert, Michael Klemm, Martin Schulz Posters at SC23

```
#include <omp.h>
#include <stdio.h>
void bell 0() {
    int states = 4;
    int shots = 1000;
    int results[states];
    #pragma omp target loop
    for(int shot=0; shot<shots; shot++)</pre>
        omp_q_reg result = omp_create_q_reg(2);
        omp_q_h(result, 0);
        omp_q_cx(result, 0, 1);
        int idx = omp_q_m(result);
        results[idx] += 1;
    for(int state idx=0; state idx < states; state idx++) {</pre>
        printf("|%d>: %d", state idx, results[state idx]);
```

1

23

5

6

8

9

10

11 12

13

14

15 16

17

18 19

20

21

22

23

24

The Munich Quantum Software Stack (MQSS) HPC Access











Quantum Systems

Enabling Domain User Communities to Compute on Quantum Devices

The Munich Quantum Software Stack (MQSS) Scheduling & Resource Management





#### The Munich Quantum Software Stack (MQSS) Quantum Compiler





Quantum Compiler based on QIR/LLVM

Comprehensive Toolkits, Optimizers, Verifiers, Simulators







Quantum Systems

Enabling Domain User Communities to Compute on Quantum Devices

#### The Munich Quantum Software Stack (MQSS) The QDMI Backend





Enabling Domain User Communities to Compute on Quantum Devices

Efficiently Using Architectures in the Era of Customization and Specialization, Martin Schulz, April 2024, Salishan, OR

Systems

#### The Munich Quantum Software Stack (MQSS) Feedback from the Target System





Enabling Domain User Communities to Compute on Quantum Devices

Efficiently Using Architectures in the Era of Customization and Specialization, Martin Schulz, April 2024, Salishan, OR

Systems

#### Making use of this new architectural world Where is the Software?



#### **Accelerated Systems**

- Used to be disruptive
- New ways to program
- Abstraction layers
- Today: integrated software and management stack

#### **Integrated Systems**

- New opportunities
  - Quick load shifts
  - Power shifts
- Need for node sharing
- Impact on scheduling

#### **Quantum Paradigm**

- Again disruptive
- For now: circuit-based
- New software stack needed
- Key: HPC integration
- Impact on scheduling







Consequences on the software side How to Get to One Software Stack?

#### Programming

- Multi-accelerator programming is hard and manual
- Abstractions can help
- Single source option / for quantum?

#### System software is equally critical

- Scheduling and resource management
- Efficient and uniform device usage



# We must revisit old HPC dogmas!



12 51

Consequences on the software side Breaking HPC Dogmas



One Node = On Job

## **On-Node Co-Scheduling**

Effective use of complementary accelerators

#### On-Node Co-Scheduling Mitigation of Inefficient GPU Utilization



#### **Issues with GPU utilization**

- Not all workloads use the entire GPU
- Multiple processes per node

#### **Co-scheduling as an option**

- Multiple applications share the node
- ... share the GPU

#### **Example: NVIDIA features**

- MIG (Multi-Instance GPU)
- MPS (Multi-Process Service)



#### On-Node Co-Scheduling Benefit of Flexible Partitioning by MPS





ТЛП



#### **Stage 1: Profiling**

- Classify jobs into categories:
  - Compute-Intensive
  - Memory-Intensive
  - UnScalable





#### **Stage 1: Profiling**

- Classify jobs into categories:
  - Compute-Intensive
  - Memory-Intensive
  - UnScalable

#### **Stage 2: Offline Training**

- Sweep over partitioning options
  - Explore parameter space
  - Train model
  - Store for online use





#### **Stage 1: Profiling**

- Classify jobs into categories:
  - Compute-Intensive
  - Memory-Intensive
  - UnScalable

#### **Stage 2: Offline Training**

- Sweep over policies
  - Explore parameter space
  - Train model
  - Store for online use

#### Stage 3: Query job assignments



#### Classify jobs into categories: • Compute-Intensive

- Memory-Intensive
- UnScalable

#### Stage 2: Offline Training

- Sweep over policies
  - Explore parameter space
  - Train model
  - Store for online use

### **Stage 3: Query job assignments**



#### On-Node Co-Scheduling Scheduling

#### Stage 1: Profiling

Job Profiles J<sub>1</sub> J<sub>2</sub> ----- J<sub>W-1</sub> J<sub>W</sub> Repo **Job Queue** W, Cmax Job Profiling **RL Agent OFFLINE TRAINING** DQN Model Coefficients **RL Environment** 0 Optimal Features Job Set (JS<sub>i</sub>) J<sub>1</sub> J<sub>2</sub> ----- J<sub>W-1</sub> J<sub>W</sub> ₽₽₽ ₽₽@ + W, Cmax Job Queue Action a+ Partitioning (R<sub>i</sub>) Reward Function **RL Agent** DQN GPU State st Reward r

**OFFLINE PROFILING** 



DQN

**ONLINE OPTIMIZATION** 

# On-Node Co-Scheduling Experimental Results: Throughput





Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach, e Er Urvij Saroliya, Eishi Arima, Dai Liu, Martin Schulz, Stepan Vanecek, Martin Schulz, IEEE Cluster 2023

## Consequences on the software side Breaking HPC Dogmas





# Enabling Malleable Programs Malleable Programming Models

Applications need to support malleability

- Overdecomposition/Virtualization
- Via dedicated programming abstractions (e.g., tasking)
- Explicit APIs embedded within the programming model



EU Grant #955606 BMBF #16HPC014

Most used HPC programming model/abstraction: MPI, the Message Passing Interface

- Static view on resources (MPI\_COMM\_WORLD)
- Moldability present in standard, but not in practice
- Dynamic adaptation impractical and rarely used

How to change MPI to support malleability

- Maintain basic "look & feel" of MPI
- MPI Process Sets and MPI Sessions



# Enabling Malleable Programs Building on top of MPI Sessions

Option 1: MPI Session form "MPI Bubbles"

- "All resources that are derived from a set of resources across a set of MPI processes"
- Implicitly derived from MPI application using sessions
- Within an MPI bubble, normal MPI
- Can invalidate and recreate new bubble, while maintaining state

Option 2: Process sets can change

- Ability for the runtime to "tell" something to the application
- Enable process sets ot grow or shrink
- Names are local to MPI Sessions
- Agreement protocol/versioning to agree on new set

## Consequences on the software side Breaking HPC Dogmas





# Active Power Steering The PowerStack Initiative

Active international initiative to identify

- ... Common terminology
- ... Compatible components
- ... Site-specific policies

Starting Point: June 2018 Seminar at the **TUM Science & Study Center** Raitenhaslach

> Multiple meetings since then

Sessions at major conferences









Efficiently Using Architectures in the Era of Customization and Specialization, Martin Schulz, April 2024, Salishan, OR



# Active Power Steering REGALE Architecture

Three levels of management

- System-level
- Job-level
- Node-level

Integration of existing software

- Elastic MPI
- Standardization (PMIx)
- Workflow systems



EU Grant #956560 BMBF #16HPC039K REGALE



# Active Power Steering **REGALE** Architecture

Three levels of management

- System-level
- Job-level
- Node-level

Integration of existing software

- Elastic MPI
- Standardization (PMIx)
- Workflow systems



EU Grant #956560 **BMBF #16HPC039K** REGALE



DB

## Consequences on the software side Breaking HPC Dogmas





Conclusions We Will See Specialized/Customized Hardware, But How Can We Use if Efficiently?

#### **Heterogeneous Architectures**

- More specialized "crazy" architectures
- Integrated processors with accelerators
- Tight integration of large-scale accelerators

### **Software Stacks**

- Programming models aside ...
- Flexible and dynamic scheduling will be key
- Importance of (fine grained) workflows
- Impact across the entire stack

# **Breaking HPC dogmas**

- Single node scheduling  $\rightarrow$  Co-Scheduling
- Rigid resource allocation  $\rightarrow$  Dynamic Resources
- Worst case power  $\rightarrow$  Dynamic Power Shifting



# Acknowledgements It takes a team, or rather many teams!

CAPS Team @ TUM





QCT Team @ LRZ



πп



Bayerisches Staatsministerium für Wissenschaft und Kunst

> Federal Ministry F of Education and Research

Federal Ministry for Economic Affairs and Climate Action





Conclusions We Will See Specialized/Customized Hardware, But How Can We Use if Efficiently?

#### **Heterogeneous Architectures**

- More specialized "crazy" architectures
- Integrated processors with accelerators
- Tight integration of large-scale accelerators

### **Software Stacks**

- Programming models aside ...
- Flexible and dynamic scheduling will be key
- Importance of (fine grained) workflows
- Impact across the entire stack

# **Breaking HPC dogmas**

- Single node scheduling  $\rightarrow$  Co-Scheduling
- Rigid resource allocation  $\rightarrow$  Dynamic Resources
- Worst case power  $\rightarrow$  Dynamic Power Shifting

49



