For a research project I am designing a massive parallel computing system. Not 100, not 1000, but rather on the order of 5.000 - 10.000 microcontroller/-processors that work in parallel. I do not need various interfaces (who wants 5000 USB ports anyways?) - but rather computing power on each of the nodes.
We are currently evaluating which microcontroller/-processor to use. We have defined our top priorities (in ascending order):
-) price (as we need 5000-10000, it should be <= 2…3 US$)
-) 32bit, maybe maybe maybe 16bit
-) high clock rate, >=70MHz, the more the better
-) internal SRAM >= 8Kb, better 16K or 32K or …
-) >= 2 serial ports, better 4
-) >= 1 SPI port
I might need to trade some of these features against others, obviously!
I guess I need a “small but fast” microcontroller without anything fancy!
Any ideas / suggestions? Currently we are looking at NXP2103 which is available for about 2.8US$ at 5000 pcs. But it’s short on memory and we’d prefer higher clock rate. We are not committed to ARM - so any other suggestions are welcome as well!
Where would I search for such quantities? I guess Digikey is not the first choice
You may consider the new ST family with Cortex-M3 core. As example the STM32F103, 3 UARTS, 2 SPI, USB, 72MHz, 20KB RAM. It is much more efficient in interrupt handling than the ARM7.
Another option is to use softcores in a large FPGA. It would save a lot of wiring and the inter-processor comms could be nice and fast. Expansion would be easy, just add more FPGAs. Designing your own core would mean that it could be optimised for such a highly-parallel system.
How are you going to program the thing? I designed systems based on the Inmos transputer many years ago; Inmos designed the Occam language specifically for parallel MIMD machines.
i have actually been thinking recently that it would be a fun and interesting project to try and implement a small Connection Machine-style SIMD machine with single-bit processors in a FPGA. it would also be a neat project to help learn verilog, since designing a single-bit processor seems relatively simple (it’s just essentially a complex single-bit serial ALU with some extra registers)
When I was designing the dsPIC cluster, it didn’t take long to realize that just in raw processing power (aside from the issues in designing a fast communications network between the processors/microcontrollers), the performance-to-cost ratio isn’t very large for general computational problems requiring a lot of math. For instance:
a dsPIC at 30MIPS requires about 100 clock cycles to emulate one floating point operation – bringing its performance to about 0.3 MFLOPS. A cell processor (in a PS3) using vectorized code has 6 cores that can perform 4 single-precision floating point operations per second each, for (in the ideal case) about 24 FLOP clock cycle, or (running at 2.4GHz) 57600 MFLOPS (this is discluding the onboard dual-core Power-PC processor). To achieve the same level of performance on (for example) a dsPIC cluster would take about 192000 dsPIC’s, assuming you could solve the problems of communication speed. At $3/dsPIC that’s a very large sum. For an ARM cluster (I don’t know a great deal about ARM processors yet), even assuming the processors ran at 80MIPS and could execute 1 FLOP/clock cycle, making them 80MFLOPS, you would still need 720 of them to achieve the same processing performance (ignoring the communications problems, again) of a single cell processor. At say $3 each, that would be about $2100 just for processors.
I am very, very much interested in making fun clusters of microcontrollers or FPGAs for fun, learning, and just to watch them work once you’ve built them, but unless you have a specific type of computing application that would be particularly well suited to such a cluster, using it likely won’t be quicker than a modern PC. On the other hand, you would get to build it, and that would be very fun!
Thanks for the initial replies so far! Indeed, we have a very special project in mind in which we simulate a large network that only requires nearest-neighbor connections. Hence a regular grid is sufficient. Looking at the cost of at least 5000 x 3 US$ (better 10000 x 3 US$) + some support electronic I guess this will be more than a “toy” - project.
I like the idea of using FPGAs; however, we need to have a small prototype (maybe 4x4 or 8x8 nodes) working quickly, ideally within this month. I have never used FPGAs, so I think this is no option for me — maybe later.
The STM32F103 looks great! The version with 10KB SRAM is roughly as expensive as the LPC2103; the 20KB SRAM is about 4 US$ @ 5000 (according to digikey).
I am happily looking forward to further suggestions!
I am happily looking forward to further suggestions!
just curious… what would be the point for doing this ?
for this?
No, seriously… I am working in academia and we are interested in biologically plausible information processing (how brains make sense out of sensory perceptions). We have several simple distributed algorithms (check e.g. “Boltzmann-Machine” in Wikipedia) that we run on single computers - and even if we run them on a network in our institute we might have 20 or 50 computers, but it still takes a long time to compute.
With this project we want to show that the developed algorithms are absolutely parallelizable: no common memory, hardly any waiting - such that we can distribute them on a network of cheap microcontollers and get a response (an equilibrium state in the network) significantly faster as on any current computer.
If we succeed with such a test system we might receive funding for a large network of significant computing power, e.g. 16000pcs PXA320 at 800MHz — or at least LPC3180 at 200MHz.
Certainly for single quantities, em.avnet.com has better pricing for LPC chips. Doesn’t seem to be that much for the lpc2103, but a couple of bucks for the 2378.
sorry, no big news so far. We have been busy looking into SIMD setups (eg by nVidia and AMD) - and as you (or someone else here) pointed out they seem to outperform a grid of microcontroller substantially. Both, in FLOPS and in low-budget. Well, we’ll see. We’ll be discussion advantages and disadvantages of both approaches shortly and take a go/no-go decision afterwards.
The microcontroller grid has a few advantages:
-) operates asynchronously
-) can generate “randomness” much easier compared to SIMD
-) is “infinitely” extensible, given money, space, and power
-) serves as multi-purpose machine, whereas SIMD typically all run the same code
But the price…
If we continue, we’ll probably build a small system of LPC2103 (just because I already know these microcontroller well) and see how well it works. By “small” I mean in the order of 8x8 or at most 16x16 initially.
It looks as though they do currently have a very limited number of USB demo modules that they will give to people/businesses with good applications for their technology.