Myrmics Overview

Myrmics is a parallel, task-based runtime system developed by by the Computer Architecture & VLSI Systems Laboratory (CARV) of the Institute of Computer Science (ICS), Foundation for Research & Technology - Hellas (FORTH). Myrmics is designed to run on the 520-core FPGA Prototype, also developed by FORTH-ICS/CARV.

Motivation

The end of the processor performance race in the start of the current century signaled the beginning of the multicore era. To harness the benefits of multiple CPU cores for a single application, programmers must now use parallel programming models. Semiconductor trends hint that processors within the next decade will manage to integrate hundreds of cores on a single chip; the architecture will be heterogeneous, with few strong (and power-hungry) cores and many weak (and power-efficient) ones; caches will be less or not at all coherent. As the new manycore era approaches, finding a productive and efficient programming model able to scale on such architectures is a major challenge.

Dependency-aware, task-based programming models have gained a significant following. The programmer provides a serial program, split into small functions (tasks) that run to completion, along with information about the data on which the tasks will operate. A runtime system analyzes the dependencies and schedules and executes the tasks in parallel. Despite their increasing popularity, these programming models are not ready to scale to emerging manycore architectures, as they primarily target today’s homogeneous, cachecoherent multicores. Their runtime implementations appear to be neither scalable, nor suitable for heterogeneous, less coherent architectures.

Our laboratory is actively researching multicore hardware architecture, runtime systems and compilers. More information about our research can be found here. In the work we did over the last few years for the EU-funded projects ENCORE and TEXT, we consider parallel programming challenges that lie ahead in the coming decade. We are interested in two major problems. First, how should a parallel runtime system be designed, in order to be able to scale well on a manycore processor ten years from now? And second, how can we implement and evaluate such runtime system designs, since such manycore processors are not currently available?

Towards the first problem, we enhance an existing task-based model to support nested parallelism and pointer-based, irregular data structures. We then design and implement Myrmics, a runtime system that implements this programming model. Myrmics is specifically designed to run on future, heterogeneous, non-coherent processors and to scale well using novel, distributed algorithms and policies for hierarchical memory management, dependency analysis and task scheduling. Our experimental results reveal that Myrmics scales well compared to reference, hand-tuned MPI baselines, while automatic parallelization overheads remain modestly low (10-30%). We verify that many of our proposed algorithms and policies are promising.

Towards the second problem, we created a heterogeneous 520-core FPGA prototype modeled faithfully after current predictions for manycore processors. The FPGA prototype is based on Formic, a new printed circuit board that we designed specifically for scalable systems. We used the FPGA prototype to evaluate the Myrmics runtime system, as it allowed us to run code significantly faster than software simulators (we estimate a 50,000 times speedup).

Myrmics Innovations

Existing task-based programming models target mainly cache-coherent, shared-memory workstations with few tens of CPU cores, or CPU/GPU combinations. In the related literature, authors rarely discuss implementation details of their runtime systems. To the best of our knowledge, existing runtime systems of task-based programming models exhibit at least some of the following weaknesses:

  • They assume that a single master task can spawn tasks, or that a single CPU node must handle all task generation. After a certain core count, the single master task and/or CPU node becomes a bottleneck.
  • They feature parallel scheduling and dependency analysis, but as they mostly target cache-coherent, shared-memory architectures, they implement algorithms that scale poorly (e.g., using centralized data structures and locking).
  • They evaluate their work running on systems with few tens of CPU cores (usually up to 32 or 64).
  • They do not project well to emerging heterogeneous architectures, in which it would be more advantageous if runtime code ran only on the few, strong cores and application code only on the many, weaker ones.

We argue that it remains largely unexplored how these existing systems will behave on single-chip, manycore, heterogeneous, partly or fully non-coherent processors. In emerging manycores, the latencies and CPU configurations will be vastly different than both today’s cache-coherent, shared-memory multicores and supercomputer clusters.

We specifically designed the Myrmics runtime system with a primary target to scale well on emerging manycore processors. We avoid the problems of the existing runtime systems and implemented scalable, hierarchical memory management, task dependency analysis and scheduling algorithms. We employ scalable, software-based coherency protocols with explicit data transfers, to maintain a global address space (much like the one in PGAS languages) although the underlying architecture is non-coherent. We implement and evaluate Myrmics on the 520-core FPGA prototype directly (a bare-metal software design), as it helps us disregard any operating system interference and focus on optimizing the runtime system. To do that, we also develop low-level layers for the FPGA prototype and its peripherals, as well as a limited-functionality, but resilient, CompactFlash filesystem for data storage.