What is a Supercomputer?

A supercomputer is a computing system with exceptionally high computational power, far surpassing the capabilities of a standard personal computer. In the past, supercomputers were predominantly monolithic machines, whereas today they are most commonly found in the form of clusters of computing nodes. Each node contains multi-core processors (commonly referred to as CPUs) and often also graphics processing units (GPUs) to accelerate computations. All nodes are interconnected through a very high-speed network (e.g., Infiniband), which ensures low latency and high bandwidth for node-to-node communication.

It is important to note that exceptional computational power is derived from having a very large number of processing units capable of running numerous tasks in parallel. It does not mean that supercomputers have significantly faster processor cores than what can be installed in a personal computer! Therefore, when using a supercomputer, one must keep in mind that the computational problem must be of a nature that can be split into a large number of smaller tasks, which can then be executed in parallel on the supercomputer.

Physical Implementation of a Supercomputer

Supercomputers are physically hosted in data centers, very similar to those used by cloud service providers. The fundamental building blocks are compute nodes, each containing multiple processors (CPUs), where each processor may have multiple cores and shared (common) memory access. All nodes are connected by a high-speed network (e.g., Infiniband), which guarantees low latency and high throughput. In addition to compute nodes, there are also special storage nodes, which store most of the data. In essence, each node has a similar design to a powerful workstation, except that both the processors and main memory are significantly larger in capacity. The figure below depicts a schematic layout of a modern data center:

CHATGPT datacenter

This figure illustrates the schematic layout of a supercomputing data center composed of multiple nodes, each containing several processors and shared memory. All nodes are connected via a high-speed, low-latency network, which also links them to a shared disk subsystem or storage nodes.

Key Characteristics of Supercomputers

Parallel Data Processing: Supercomputers can process large amounts of data simultaneously, significantly reducing the time required for complex computations.
High-Performance Resources: Nodes may feature 32, 64, or even more CPU cores, hundreds of gigabytes of system memory (RAM), and often one or more graphics processing units (GPUs).
Advanced Network Infrastructure: Instead of standard 1 or 10 Gbit/s Ethernet, they use specialized high-speed interconnects, such as Infiniband or Omni-Path, which provide very low latency.
Job Scheduling Software (e.g., Slurm): Users submit jobs to a queue, and the system executes them once the necessary resources (CPU, GPUs, memory) become available.

Supercomputer Architecture

Nodes

A supercomputer consists of numerous interconnected nodes. Each node typically includes:

One or more multi-core CPUs (e.g., AMD or Intel).
Large amounts of working memory (RAM), often exceeding 256 GB or even 512 GB.
One or more graphics processing units (GPUs) to accelerate tasks related to machine learning or complex numerical simulations.
Local storage or solid-state drives (SSD).
A network interface (often Infiniband) offering high bandwidth and low latency for communication with other nodes.

Types of Nodes

Management Node (admin node): Used for system administration; typical user processes are not run here.
Login Node: The only node accessible directly from the internet. Users connect via SSH to prepare code and jobs, which are then submitted to the queue. This node should not be used for resource-intensive tasks.
Compute Nodes: Dedicated to executing user jobs. They have no direct internet access; they are reached through the login node.
Storage Nodes: Designed for high-speed data access. Users generally do not interact with these nodes directly; they are accessible through a shared or parallel file system.

Network Infrastructure

High-Speed Internal Network (e.g., Infiniband, Omni-Path): Provides high data rates and low latency between nodes and storage systems.
Front-End (Ethernet) Network: Handles external connectivity, typically restricted to the login node.

Typical Steps in Using a Supercomputer

Obtain a User Account: This is requested from the relevant institution (university, research institute, etc.).
Connect to the Login Node: Done via SSH (e.g., using PuTTY or OpenSSH).
Upload and Prepare Data: The user transfers code, datasets, etc., to the login node and places them in a home or project directory.
Job Preparation: A script (e.g., a Slurm script) is written, specifying:
The number of required CPU cores.
The number and type of GPUs.
The amount of memory.
The required runtime.
The modules and libraries to be loaded (e.g., Python, PyTorch, TensorFlow).
Submitting the Job to the Queue: Using commands such as sbatch or srun, the job is handed over to the Slurm scheduler.
Monitoring the Job: With commands like squeue (view the queue) and sinfo (cluster status), the user can track job progress and start times.
Retrieving the Results: After execution, results are saved to output files, which can be downloaded back to a local machine.

Difference Between a Regular Computer and a Supercomputer

Feature	Personal Computer	Supercomputer
Hardware	1 processor (CPU) with 2–8 cores, up to ~64 GB RAM, 1 GPU	Multiple processors (e.g., 32, 64, 128 cores) per node, 256+ GB RAM, multiple GPUs
Network Connection	Standard Ethernet (1–10 Gbit/s)	Specialized high-speed network (e.g., Infiniband) with low latency
Usage Model	Interactive work, locally running applications	Jobs are submitted to a queue and executed on compute nodes
Operating Systems	Windows, Linux, macOS	Mainly Linux, customized for HPC
Job Scheduling	Manually run programs	A scheduler (e.g., Slurm) intelligently assigns resources and manages queues
Typical Usage	Everyday tasks, office work, gaming, smaller projects	Scientific computing, simulations, large-scale data analysis, AI/ML

The most significant difference lies in parallelism: on a supercomputer, you can simultaneously use hundreds or even thousands of CPU cores and graphics processing units (GPUs), drastically reducing the time to complete resource-intensive tasks.

Difference Between a Supercomputer and Cloud Service Providers

Supercomputers and cloud services (AWS, Google Cloud, Azure, etc.) represent two different approaches to delivering high-performance computing resources. The key difference is in resource allocation, affecting the efficiency, availability, and adaptability of these systems. The diagram below shows how computing resources in a data center can be organized when used for cloud services.

CHATGPT datacenter

In this figure, multiple users (A, B, and C) are simultaneously using smaller portions of individual nodes under a cloud model. Data center operators achieve this through virtualization, utilizing technologies such as VMWare or Xen. Users pay based on resource consumption and usage time. This model offers remarkable elasticity, allowing them to swiftly acquire additional CPU cores or memory.

The following three figures illustrate how resource allocation (for users A, B, and C) might work if the data center is employed as a supercomputer.

Initially, User A uses a large portion of the capacity. CHATGPT datacenter

Then, User B uses a large portion of the capacity. CHATGPT datacenter

Finally, User C takes advantage of the entire supercomputer, but had to wait until Users A and B had finished their jobs. CHATGPT datacenter

The table below summarizes the differences between cloud service providers and a supercomputing cluster.

Feature	Cloud Service Provider	Supercomputer
Access and Payment	Pay-as-you-go; quick to start	Often subsidized or free for academic users; requires an approval process
Resource Management	Automatic scaling via a virtualized environment	Resources allocated through job queues, physically limited by available nodes
Network Infrastructure	High-speed but general-purpose network	Specialized (Infiniband) with low latency and high throughput
User Interface	Web portals, APIs, orchestration tools	SSH and HPC environment (Slurm, modules), often more manual setup
Typical Usage	Various services (hosting, databases, AI applications)	Scientific simulations, massive data analysis, advanced machine learning
Costs for Intensive Workloads	Can become very expensive over prolonged, resource-heavy usage	Research community often covers costs via state/EU funding

Choosing between cloud services and a supercomputer depends on availability and specific requirements. Supercomputers are frequently provided for research purposes, funded by national or EU resources, and offer specialized high-speed interconnects as well as an environment tailored to demanding computational challenges.

Practical Uses of Supercomputers

Large-Scale Data Analysis: For instance, in genomics, climate modeling, or physics simulations.
Machine Learning and Deep Learning: GPUs accelerate training of neural networks in computer vision or natural language processing tasks.
Engineering and Physics Simulations: Used in aerospace, construction, chemistry, computational fluid dynamics (CFD), and other scientific fields.
Development of New Materials: Molecular dynamics, chemical simulations, and interactions of large molecules require extensive CPU and/or GPU resources.

Summary

A supercomputer is a high-performance cluster of computing nodes, optimized for massive parallel execution. Users connect to a login node, prepare their jobs, and submit them to a queue. The jobs then run on compute nodes according to available resources. Compared to personal computers, a supercomputer significantly reduces the time needed for the most demanding scientific, research, and engineering tasks. Its main difference from cloud providers lies in how resources are allocated within the system.