The SICS Multicore Day August 31 was a really great event! We had some fantastic speakers presenting the latest industry research view on multicores and how to program them. Marc Tremblay did the first presentation in Europe of Sun’s upcoming Rock processor. Tim Mattson from Intel tried hard to provoke the crowd, and Vijay Saraswat of IBM presented their X10 language. Erik Hagersten from Uppsala University provided a short scene-setting talk about how multicore is becoming the norm.
The Rock is a very interesting piece of work. It tries to be both a throughput-oriented design like the Niagara/Ultrasparc T machines, and a single-thread high-performance design. Even though on balance, it is more skewed towards the throughput computing aspect. What is very cool is how they use additional threads to help boost the performance of a main thread using “scout threads” (a concept I saw presented back at ISCA 2004). This makes it possible to use threads to either boost single-thread performance OR do throughput, creating a more flexible design than is usually the case. It is also the first commercial implementation of transactional memory. And 16-way. And due for next year.
So far, Rock seems like a very successful and very visionary project that is trying in yet another way to gain momentum by pure hardware innovation. Just like the UltraSparc T line, Sun is trying to out-invent IBM and Intel/AMD. Who seem to be mostly progressing by just piling on more of the same old features. I really hope this play goes well, if we were down to just IBM/PPC & System Z and Intel-AMD/x86-64 on the server and desktop side, the world would just be too boring.
The Intel and IBM talks on programming were both grounded in the idea that to make people accept a new programming language/API, it has to be an evolution of what the programmers already know. Which pretty much ties us down to C/C++/Java/C# with extensions and modified semantics.
X10 is basically Java with some nicely considered features to support local and global memories and programs that can scale to BlueGene-style massively clustered machines. Tim basically tells everyone to stop inventing new languages and focus on improving existing frameworks like MPI and OpenMP in collaboration with industry. Presented in a very funny style, Tim is a great presenter, and tries hard to get the audience to react. In this crowd, most people agreed. Except the Erlang people, who feel that they do have a better solution to multithreading and multicore than any patched-up language in the C-Java family. I must agree with them, and I do feel that Erlang today is mature enough to serve that purpose.
The panel session at the end was very entertaining, where some people (including myself and Joe Armstrong) tried to ask tough questions to the keynote speakers (and Ulf Wiger of Ericsson). Quite engaging and a rare chance to directly engage with some industry heavyweights who otherwise tend to sit on the other side of the Atlantic.
I think the prize for coolest tech of the day goes to QuviQ, a spin-off from Chalmers doing automated testing tools that really work well for parallel and distributed systems. Their method of minimizing the trace of a failed test case is really interesting, and finds things that no human tester would ever find.
I also presented a talk on “Debugging Multicore Software using Virtual Hardware”, in the breakout sessions. I guess our Tools track was the least visited of the three tracks, but the audience asked some good questions. And there were some good discussions afterwards.
However, to summarize the day, I am a bit disappointed that not more is being done on the hardware side to help people debug their multicore and multiprocessor parallel programs. Transactional memory is all nice and dandy and can help simplify low-level locking primitives for threaded programs. But I would like to see much more in terms of smart tracing, hardware breakpoints and triggers, massive synchronized stops, and similar features. And instructions and features that make parallel expressions simpler. Here, the embedded folks doing things like ARM CoreSight seems to have been much more successful than the server-class designers at Sun, Intel, and IBM. But even ARM do not spend more than 10-15% of the chip area on debug support.
I think it would be interesting to see what would happen if you could spend 25-30% of the chip on some seriously powerful debug features. Full support for remote control of all cores at the same time, lots of bandwidth for debug data and commands, and fat traces of all traffic on and off the chip. Performance and event counters everywhere. That would make the peak performance of chip likely less than a competing chip not spending as much space on debug support — but it would make achieving a high utilization much easier, and that might actually make the debug-intense chip more economical. Would be interesting to try. But I guess nobody would dare to buy such a design.