UKMAC 2016

UK Many-Core Developer Conference 2016

Tuesday 10th May 2016

School of Informatics at the University of Edinburgh

Find Out More Register here

UK Many-Core Developer Conference 2016

The UK Many-Core Computing Conference is an informal day of talks spanning the whole landscape of accelerated, heterogeneous and many-core computing. Topics of interest include high-performance computing, embedded and mobile systems, computational science, finance, computer vision and beyond. The goal of the event is to develop and bring together the UK community of developers, both industrial and academic.

The 2016 event will be held on Tuesday the 10th May 2016 at the Informatics Forum in Edinburgh. It is the 7th event in the series. Previous meetings have taken place at the University of Cambridge (2010 and 2014), University of Oxford (2009 and 2013), Imperial College (2011) and the University of Bristol (2012). These meetings regularly attract 100 participants and have proved to be invaluable opportunities to meet colleagues and swap stories of many-core successes and challenges.


9.30 - 10.30 Andrew Richards, CEO of Codeplay Software
Keynote The Challenges of Delivering Massive Parallelism to Real-World Software

The theoretical potential of massive parallelism is huge, but the reality of delivering this capability in real-world commercial devices is a tough challenge. The biggest markets for massive parallelism now seem to be machine learning and machine vision from tiny cameras to self-driving cars and massive data centres. How can we deliver research into real products? How do we standardize the platforms? How do we deliver the tools to make this practical? How do we enable innovation?

Slides of the talk
10.30 - 11.00 Alastair Donaldson, Imperial College London
The Hitchhiker's Guide to Cross-Platform OpenCL Application Development

One of the benefits to programming in OpenCL is platform portability. That is, an OpenCL program that follows the OpenCL specification should, in principle, execute reliably on any platform that supports OpenCL. To assess the current state of OpenCL portability, we provide an experience report examining two sets of open source benchmarks that we attempted to execute across a variety of GPU platforms, via OpenCL. The talk will focus on the portability issues we encountered, where applications would execute successfully on one platform but fail on another. We classify issues into three groups: (1) framework bugs, where the vendor-provided OpenCL framework fails; (2) specification limitations, where the OpenCL specification is unclear and where different GPU platforms exhibit different behaviours; and (3) programming bugs, where non-portability arises due to the program exercising behaviours that are incorrect or undefined according to the OpenCL specification. The issues we encountered prove exciting motivation for future testing and verification efforts to improve the state of OpenCL portability.

Slides of the talk
11.00 - 11.30 Break with Tea and Coffee
11.30 - 12.00 Bill Langdon, University College London
Using evolutionary computing to optimise BarraCUDA

BarraCUDA is a Bioinformatics tool which looks up in a reference genome short noisy DNA sequences produced by the billion by next generation sequencing tools. It is a port of the BWA algorithm by six co-authors who included both experts on CUDA and experts in Bioinformatics, especially BWA and the BWT compression algorithm. BarraCUDA is open source CUDA code and is available from SourceForge. The existing code was improved by a combination of manual changes and automatic genetic evolution. The genetically improved code has been incorporated and has been available from SourceForge for a year.

As with all implementations of BWT, speed depends upon the length of the DNA reads. For shorter reads a single lowly GT 730 (£50) can be faster than BWA. The GI version of BarraCUDA is up to three times faster than the earlier version of BarraCUDA. The new version has been adopted by Lab7 and IBM (including for Power8).

Slides of the talk
12.00 - 12.30 Gheorghe-Teodor Bercea, Imperial College London
Performance Analysis of OpenMP on a GPU using a CORAL Proxy Application

OpenMP allows programmers to succinctly express parallelism in an abstract declar- ative way. Starting with the 4.0 release, OpenMP also supports automatic offloading of code regions to be executed on accelerator architectures such as Graphics Processing Units (GPUs). OpenMP has the advantage of operating at compiler level and can de- liver efficient low-level architecture-specific optimizations which would otherwise have to be hand-tuned into the application layer.

In this paper we analyze the performance of our implementation of the OpenMP 4.0 offloading constructs on an NVIDIA GPU. For the performance analysis we use LULESH, a complex proxy application provided by the Department of Energy as part of the set of CORAL benchmarks. We compare the performance of an OpenMP 4.0 version of LULESH obtained from a pre-existing OpenMP implementation with a func- tionally equivalent CUDA implementation. Based on the analysis of the performance characteristics of our application we present an extension to the compiler code synthe- sis process for combined OpenMP 4.0 offloading directives. The results obtained using our OpenMP compilation toolchain show performance within 10% of native CUDA C/C++ for application kernels that have low register counts. Some further work is needed to tackle register pressure in the most complex kernels. We make the following contributions:

  • We report on the end-to-end porting of the LULESH CORAL proxy application on NVIDIA GPUs using OpenMP 4.0 directives.
  • We describe the generic implementation mechanisms that affect performance of OpenMP for LULESH and we relate them to performance metrics. Consequently we introduce optimizations to the compiler code generation and we relate the performance improvement to metrics and features of the generated code.
  • We show how the new memory related constructs in OpenMP 4.5 improve the overall performance of LULESH by optimizing the memory traffic between host and device.

This talk will present work done in collaboration with:
Carlo Bertolli, Samuel F. Antao, Arpith C. Jacob, Alexandre E. Eichenberger, Tong Chen, Zehra Sura, Hyojin Sung, Georgios Rokos, David Applehans, Kevin O’Brien
IBM T.J. Watson Research Center, Yorktown Heights, New York, USA.

12.30 - 12.45 Juan José Fumero, University of Edinburgh
Marawacc: A Framework for Heterogeneous Computing in Java

GPUs (Graphics Processing Unit) and other accelerators are nowadays commonly found in desktop machines, mobile devices and even data centres. While these highly parallel processors offer high raw performance, they also dramatically increase program complexity, requiring extra effort from programmers. This results in difficult-to-maintain and non-portable code due to the low-level nature of the languages used to program these devices.

This talk presents Marawacc, a framework for heterogeneous computing in Java. Marawacc comprises a high-level API for simplifying programming, a just-in-time compiler for OpenCL and a runtime for data efficient management. Our goal is to revitalise the old Java slogan – Write once, run anywhere — in the context of modern heterogeneous systems. Applications written with our high-level framework are transparently accelerated on parallel devices such as GPUs using our runtime OpenCL code generator.

In order to ensure the highest level of performance, we present data management optimizations. Usually, data has to be translated (marshalled) between the Java representation and the representation GPUs use. This paper shows how marshal affects runtime and present a novel technique in Java to avoid this cost by implementing our own customised array data structure. Our design hides low level data management from the user making our approach applicable even for inexperienced Java programmers.

We evaluated Marawacc using applications from different domains, including mathematical finance and machine learning. We compared our framework with native OpenCL C++, JOCL and Aparapi on two different GPUs from AMD and Nvidia. Our framework is, on average, less than 25% slower compared to OpenCL C++. We achieve speedups of up to 645× over sequential Java code when using a GPU.

Slides of the talk
12.45 - 13.00 Rob Stewart , Heriot-Watt University
High level FPGA reconfiguration for remote image processing with a DSL and program transformations
Slides of the talk
13.00 - 14.00 Lunch
14.00 - 14.30 Jeremy Singer, University of Glasgow
Java on NUMA - could be better

Java workloads are increasingly running server-side. Typical servers are manycore NUMA platforms. We show that throughput-oriented Java apps suffer performance degradation due to NUMA effects. We quantify these performance problems for various apps, including big data analytics. If time is $$$ or Joules, then we need to increase performance urgently! We present two optimizations, which we have implemented in OpenJDK, to improve application throughput on manycore NUMA platforms. The optimizations involve (i) refining the garbage collector to improve memory locality and (ii) tuning the number of runtime threads adaptively to avoid congestion. Our results show significant improvements in application execution time.

14.30 - 15.00 Konstantina Mitropoulou, University of Cambridge
Lynx: Using OS and Hardware Support for Fast Fine-Grained Inter-Core Communication

Designing high-performance software queues for fast inter- core communication is challenging, but critical for maximiz- ing software parallelism. State-of-the-art single-producer, single-consumer (SP/SC) queues for streaming applications contain multiple sections, requiring the producer and con- sumer to operate independently on different sections from each other. While these queues perform well for coarse- grained data transfers, they perform poorly in the fine-grained case.

This paper proposes Lynx, a completely novel SP/SC queue, specifically tuned for fine-grained communication. Lynx is built from the ground up, reducing the generated code on the critical-path to just two operations per enqueue and de- queue. To achieve this it relies on existing commodity proces- sor hardware and operating system exception handling sup- port to deal with infrequent queue maintenance operations. Lynx outperforms the state-of-the art by up to 1.57× in total 64-bit throughput reaching a peak throughput of 15.7GB/s on a common desktop system. Actual applications using Lynx get a performance improvement of up to 1.4×.

Slides of the talk
15.00 - 15.30 Mozhgan Kabiri Chimeh, University of Glasgow
Architecture without explicit locks for logic simulations on SIMD machines

We propose an architecture without an explicit lock for logic simulation on SIMD multi-core machines. We evaluate its performance on the Intel Xeon Phi and 2 other machines. This software/hardware combination is compared with reported performances of logic simulation on GPU and supercomputer platforms. Comparisons are also given between the Xeon Phi and other architectures executing stereo vision algorithms. Whilst the Xeon Phi shows clear advantage for logic simulation, its performance gains for stereo vision are less apparent.

Slides of the talk
15.30 - 16.00 Break with Tea and Coffee
16.00 - 16.30 Alistair Hart, Cray
Porting the parallel Nek5000 application to GPU accelerators with OpenMP4.5

Many-core accelerators offer potentially large performance gains over traditional CPUs, coupled with increased energy efficiency. It is, however, difficult to port applications to take advantage of offload accelerators, particularly if architectural portability is a concern. Directive-based programming models avoid the need for rewriting existing applications in bespoke, low-level languages and allow the developer to focus on the more fundamental issue of managing data locality. Historically, such models have successfully addressed application portability between CPU and particular accelerator architectures. No single model has, however, enjoyed support from multiple compilers across a range of different accelerator targets.

The addition of "device constructs" in the OpenMP4.0 standard promises to address this and the recent OpenMP4.5 standard (released in November 2015) adds, significantly, the additional functionality required to efficiently execute parallel applications on hybrid supercomputers.

In this talk, I describe some first experiences in using OpenMP4.5 to port Nek5000, a Computational Fluid Dynamics simulation code for incompressible fluids based on the semi-spectral finite element method. The code is approximately 70,000 lines long, with simulation elements distributed across multiple nodes using MPI. I will discuss the challenges in porting large applications to accelerators, describing the porting process as well as the algorithmic challenges (and how to identify and overcome these). In particular, the advantages of two OpenMP4.5 features (asynchronously-launched accelerator tasks and direct MPI transfer of accelerator-resident data) will be discussed. This work takes advantage of a pre-release version of the Cray Compilation Environment compiler that supports OpenMP4.5, with the code executing on multiple Nvidia GPUs on a Cray XC30 system.

Some contrasts will be drawn with the comparable OpenACC programming model, with advice to developers wishing to use both programming models in an application or to migrate from one to the other. I will also discuss the hardware trends to provide some perspective on the future of offload accelerator programming.

Slides of the talk
16.30 - 17.00 Ralph Potter, Codeplay
A C++ Programming Model for Heterogeneous System Architecture

Heterogeneous System Architecture provides hardware specifications, a runtime API and a virtual instruction set to enable heterogeneous processors to interoperate through a unified virtual address space.

However, it does not define a high-level kernel language such as OpenCL C or CUDA. This lack of a high-level kernel language presents a barrier to the exploitation of the capabilities of these heterogeneous systems, motivating our own proposal.

We describe the language extensions, compiler and runtime implementation used to enable the offloading of parallel C++ code to heterogeneous accelerators.

Through the use of automatic call-graph duplication in the compiler, we enable code reuse and sharing of data structures between host processor and accelerator devices with fewer annotations than CUDA or C++AMP. The unified virtual address space is utilized to enable sharing data between devices via pointers, rather than requiring copies. This enables low-latency cross-device communication. Through the use of address space inference, we show how existing standard C++ code can be utilized unmodified on accelerators.

We demonstrate comparable performance to OpenCL across a range of benchmarks.

17.00 Closing


Registration fees are:

  • £30 for regular attendees
  • £10 for student attendees

Registration is now available through the Edinburgh University ePay system

Call for Presentations

We would be grateful to receive offers of 30 minute long presentations on all topics of many-core computing including software, hardware and applications from high-performance computing, embedded and mobile systems, computational science, finance, computer vision and beyond.

To offer a presentation please send an abstract describing the presentation to:
The deadline for submitting abstracts is Friday the 1st April.

Local Organisation

Steering Committee