Last Friday, I attended this year’s edition of the SiCS Multicore Day. It was smaller in scale than last year, being only a single day rather than two days. The program was very high quality nevertheless, with keynote talks from Hazim Shafi of Microsoft, Richard Kaufmann of HP, and Anders Landin of Sun. Additionally, there was a mid-day three-track session with research and industry talks from the Swedish multicore community.
I think that for next year, the organizers need to find keynote speakers that are not from the general computing multicore world. The Microsoft talk this year was a step in that direction, as it rather came from multicore programming than multicore hardware. Richard and Anders gave very interesting and good talks, no doubt about it. But it would have been nice with someone from ARM or Freescale or Tensilica or TI or ST or Ericsson or Cisco talking about the kinds of multicore embedded hardware that is being developed and used today. For example, the “next new thing” touted by the keynotes this year was GPGPU. Interesting for HPC and desktops, certainly. But pretty irrelevant for most of the people that I know. GPUs are huge, expensive, and power hungry.
GPGPU was one part of the theme this year. It is definitely catching on as the way to do number crunching in the desktop, server, and HPC world. It is not the universal panacea for any kind of parallelism, however, as Hazim and I noted in the panel discussion that ended the day. There are applications (such as parallel Simics…) that scale well on general-purpose cores, but that will never ever work on GPUs. In general, the class of problems that work on GPUs is pretty limited to massive data-parallel problems like image and video manipulation.
In the eternal homogeneous vs heterogeneous debate (follow the tags in my blog for more posts on this topic), GPGPU was grudingly accepted as a good candidate for something that will not be homogeneized with the main processors. Additionally, Richard Kaufmann gave some hints that Intel or AMD are coming out with new chips with more accelerators on board… I guess it will be security, as is already done by Sun and IBM. When I brought up the topic of more accelerators like pattern matching, compression, and the other things we see in chips from Freescale, Cavium, and others, the response was very “can only be economical for very high volume applications”.
It is striking how the GPGPU idea is bringing the classic telecommunications DSP-data plane/CPU-control plane division into the desktop and server space. Without any recognition being paid or any experience being reused from the 40 years that that has been done in telecoms and consumer electronics… as Jack Ganssle often says, us embedded folks get no respect.
In terms of programming, this year was all about general programming languages. Hazim from Microsoft talked about (and demoed) the quite pervasive addition of parallelism to both native C/C++ and managed .net code in Visual Studio 2010. Microsoft is dead serious about parallel programming, and are bringing out a whole set of different libraries and support structures to allow easier expression of parallel code. In the “LINQ” data query language subset of C#, you could add some easy modifiers to “foreach” statements to make them parallel, for example. Having a language that is your own and which you can extend at will certainly pays off in terms of innovation here. C++ moves far slower than C#, that is becoming clearer and clearer. C# and its cousins in the .net system seem to be sneaking in lots of powerful language design ideas from places like Python, and also results from Microsoft’s powerful group of language researchers.
When I tried to bring up the idea of using domain-specific languages to program parallel applications, Hazim had the wonderful comment that “that might be applicable in certain domains…” — yes, that is the idea. By being narrow in terms of target domains, you gain expressive power and semantic insight that helps move programming from “how” towards “what”. But it sounds like domain-specific is a foul word inside of Microsoft — when the audience asked whether LINQ was not a exactly a domain-specific language for data access, Hazim was a pains to point out that it is Turing-complete and that someone had managed to write a Raytracer using it… interesting. This feels more political than market-based. I guess Micro
Richard Kaufmann had some interesting notes on throughput vs TTC (time-to-completion) jobs in servers. In the “cloud computing” era, throughput is much easier to scale: just add more servers. Classic HPC is more oriented towards TTC, as you do want your results within a reasonable time. Quite often, you can most work into a throughput-oriented style by simply running lots of jobs in parallel rather than pushing through a series of jobs sequentially. Note however that we have the entire field of real-time control, real-time communications, etc., that do not work like this. But that is not the market that HP is building servers for, or that Intel and AMD are servicing.
Outside the keynotes, Per Holmberg of Ericsson gave an interesting presentation on the adoption of multicore in the control plane of the Ericsson CPP platform. The core of his talk was the observation that in these kinds of systems, multicore is not such a big revolution.
They have been distributed since the beginning. Thus, scaling by adding more processors (with local memories) is easy and multicore is only a packaging change from that. Also, most performance-intense operations are already offloaded onto DSP groups, network processors, ASICs, or FPGAs. There is not much parallelism left for the control plane to exploit. Essentially, only functions that unexpectedly become performance bottlenecks due to changes in traffic patterns are likely candidates for parallellization. Interesting point, and might be why the EETimes noted that multicore is slow to catch on in communications (the article is a bit flawed).
Patrik Nyblom from Ericsson held a talk about how the Erlang runtime engine was parallelized. From a practical perspective, the most interesting aspect was that this made applications parallel without changing a single line of code in the applications. Of course, applications had to be threaded to start with, but that is the most natural way in Erlang. He mentioned systems containing up to a quarter of a million threads — hard to do that in anything except Erlang.
He described how they had evolved from a simple implementation that worked well on synthetic benchmarks to a truly industrial-strength implementation. The difference was quite radical, as real codes feature more complex communications patterns, and make heavy use of device drivers and network stacks. This process forced the use of more and finer locks, and rethinking the balance between shared and separate heaps for threads.
They also had the opportunity to test their solution on a Tilera 64-core machines. This mercilessly exposed any scalability limitations in their system, and proved the conventional wisdom that going beyond 10+ cores is quite different from scaling from 1 to 8… The two key lessons they learned was that no shared lock goes unpunished, and data has to be distributed as well as code. Very interesting to hear this story from real software developers solving real problems.
The next multicore event taking place around here is the Second Swedish WOrkshop on Multicore Computing (MCC 2009), in Uppsala, November 26-27.
Update: note that the presentations from the event are available via http://www.multicore.se/.