How FPGAs work, and why you'll buy one

Update (June 21): this article has been published at embeddedrelated.com, where I hope to publish a follow-up soon.

Today, pretty much everyone has a CPU, a DSP and a GPU, buried somewhere in their PC, phone, car, etc. Most don't know or care that they bought any of these, but they did.

Will everyone, at some future point, also buy an FPGA? The market size of FPGAs today is about 1% of the annual global semiconductor sales (~$3B vs ~$300B). Will FPGA eventually become a must-have, or will its volume remain relatively low?

We'll try to answer this question below. In order to see how popular FPGAs could become, we'll need to discuss what FPGAs are. FPGAs are a programmable platform, but one designed by EEs for EEs rather than for programmers. So for many programmers, FPGAs are exciting yet mysterious; I hope our discussion will help demystify them.

We'll start with a common explanation of FPGAs' relatively low popularity. We'll see why that explanation is wrong – and why, if we take a closer look, we actually come to expect FPGAs to blow the competition out of the water!

This will conclude today's installment, "Why you'll buy an FPGA". A sequel is in the making, titled "Why you won't buy an FPGA". There, we'll see some of the major obstacles standing between FPGAs and world domination.

The oft-repeated wrong answer

…to the question of "why aren't FPGAs more popular?" is, "FPGA is a poor man's alternative to making chips. You can implement any circuit design in an FPGA, but less efficiently than you could in an ASIC or a custom design. So it's great for prototyping, and for low-volume products where you can't afford to make your own chips. But it makes no sense for the highest-volume devices – which happen to add up to 99% of sales, leaving 1% to FPGAs."

This is wrong because programmability is a feature, not just a tax on efficiency.

Of course a Verilog program doing convolution on an FPGA would run faster if you made a chip that runs just that program. But you typically don't want to do this, even for the highest-volume products, any more than you want to convert your C programs running on CPUs into dedicated hardware! Because you want to change your code, run other programs, etc. etc.

When programmability is required – which is extremely often – then the right thing to compare FPGAs to is another programmable platform: a DSP, a GPU, etc. And, just like FPGAs, all of these necessarily introduce some overhead for programmability. So we can no longer assume, a priori, that any one option is more efficient than another – as we did when comparing FPGAs to single-purpose ASICs.

We need benchmarks – and FPGAs' performance appears very competitive in some benchmarks. Here's what BDTI's report from 2007 says:

…we estimated that high-end FPGAs implementing demanding DSP applications … consume on the order of 10 watts, while high-end DSPs consume roughly 2-3 watts. Our benchmark results have shown that high-end FPGAs can support roughly 10 to 100 times more channels on this benchmark than high-end DSPs…

So for that benchmark, FPGAs offer 10x-100x the runtime performance, and 2x-30x the energy efficiency of DSPs – quite impressive!

But wait – how are they so efficient?

FPGAs are no longer FPGAs

Aren't FPGAs Field-Programmable Gate Arrays?

Programmable gate arrays can't multiply as efficiently as dedicated multipliers, can they? A dedicated multiplier is a bunch of gates connected with wires – the specific gates that you need for multiplying, connected specifically to the right other gates as required for multiplication.

A programmable gate array is when your gates are generic. They index into a truth table (called a look-up table or LUT) with their inputs, and fetch the answer. With a 2-input LUT, you get an OR gate or an AND gate or whatever, depending on the truth table you programmed. With 3-input LUTs, you can have a single gate computing, say, (a&b)|c, but the principle is the same:

This absolutely must be bigger and slower than just an OR gate or an AND gate!

Likewise, wires go through programmable switch boxes, which connect wires as instructed by programmable bits:

There are several switch box topologies determining which wires can be connected to which. But whatever the topology, this must be bigger and slower than wires going directly to the right gates.

All this is indeed true, and a "bare" FPGA having nothing but programmable gates and routers cannot compete with a DSP. However, today's FPGAs come with DSP slices – specialized hardware blocks placed amidst the gates and routers, which do things like multiply-accumulate in "hard", dedicated gates.

So that's how FPGAs compete with DSPs – they have DSP hardware in them! Cheating, isn't it?

Well, yes and no.

It's "cheating" in the sense that FPGAs aren't really FPGAs any more – instead, they're arrays of programmable gates plus all that other stuff. A "true FPGA" would look like this:

Instead, a high-end modern FPGA looks like this:

To be competitive in DSP applications, FPGAs need DSP slices – ALUs doing things like multiply-accumulates.

To be competitive in applications needing a CPU – which is most of them – today's FPGAs have more than just specialized ALUs. They have full-blown ARM cores implemented using "hard", non-programmable gates!

So you've been "cheated" if you thought of FPGAs as "clean slates" suitable for any design. In reality, FPGAs have specialized hardware to make them competitive in specific areas.

And you can sometimes guess where they're less competitive by observing which specializations they lack. For instance, there are no "GPU slices", and indeed I don't believe FPGAs can compete with GPUs in their own domain as they compete with DSPs. (Why not simply add GPU slices then? Here the plot thickens, as we'll see in the follow-up article.)

But of course having DSP slices is more than just "cheating" – because look at just how many DSP slices FPGAs have. The cheapest FPGAs can do hundreds of mutliply-accumulates simultaneously! (My drawing above has the wrong scale – imagine hundreds of small DSP slices near a couple of much larger CPUs.)

And hundreds of MACs is a big deal, because while anyone can cram a load of multipliers into a chip, the hard part is to connect it all together, letting a meaningful program actually use these multipliers in parallel.

For instance, TI's C64 DSPs can do 8 MACs per cycle – but only if it's a dot product. TI's C66 DSPs can do 32 MACs/cycle – but only if you're multiplying complex numbers. You only get the highest throughput for very specific data flows.

To the extent that the FPGA architecture lets you actually use an order of magnitude more resources at a time, and do that in more real-life examples, it is a rather unique achievement. And this is how they actually beat dedicated DSPs with their DSP slices, not just reach the same performance.

FPGA as a programmable accelerator architecture

So what makes FPGAs such an efficient architecture? There's no simple answer, but here are some things that FPGAs can use to their advantage:

  • No need for full-blown ALUs for simple operations: a 2-bit adder doesn't need to be mapped to a large, "hard" DSP slice – it can fit comfortably in a small piece of "soft" logic. With most processors, you'd "burn" a full-blown ALU to do the simplest thing.
  • No need for a full cycle for simple operations: on FPGAs, you don't have to sacrifice a full cycle to do a simple operation, like an OR, which has a delay much shorter than a full cycle. Instead, you can feed OR's output immediately to the next operation, say, AND, without going through registers. You can chain quite a few of these, as long as their delays add up to less than a cycle. With most processors, you'd end up "burning" a full cycle on each of these operations.
  • Distributed operand routing: most processors have their ALUs communicate through register files. With all the ALUs connected to all the registers, there's a bottleneck – this interconnect grows as the product of the number of ALUs and registers, so you can't have too many of either. FPGAs spread ALUs and registers throughout the chip, and you can connect them in ways not creating such bottlenecks – say, as a long chain, as a tree, and in many other ways. Of course you can also route everything through a bottleneck, and then your design will run at a low frequency – but you don't have to. With CPUs or DSPs, they run at a high frequency – because the amount of ALUs and registers was limited to make that frequency possible. But in FPGAs you can get both high frequencies and a lot of resources used in parallel.
  • Distributed command dispatching: a 2-issue or a 6-issue processor is common, but 100-issue processors are virtually unheard of. Partly it's because of the above-mentioned operand routing, and partly it's because of command dispatching – you'd have to fetch all those commands from memory, another bottleneck. In FPGAs, you can implement command-generating logic in simple state machines residing near your ALUs – and in the simplest case, commands are constants kept in registers residing near ALUs. This lets you easily issue 100 parallel instructions.

This "distributed" business is easier to appreciate by looking at an example. Here's a schematic implementation of a 1D convolution on an FPGA – you convolve a long vector v with an N-coefficient filter f, computing, at every i, f0*v[i] + f1*v[i-1] + f2*v[i-2] + … + fN-1*v[i-N-1]:

In this drawing, N=8, but it scales easily to arbitrary N, producing results at a slightly larger latency – the summation tree depth being log(N).

The orange boxes are registers; commands like + and * are stored in registers, as are inputs and outputs. (I'm feeding the output of * to + directly without going through a register to save screen space.) Every clock cycle, inputs are fed to ALUs, and the outputs become the new register values.

Orange boxes (registers) spread amongst green boxes (ALUs) illustrate "distributed operand and command routing". If you wonder how it all looks like in code, Verilog source code corresponding to this drawing appears near the end of the article.

And here's a linear pipeline without a summation tree:

This is a little trickier, at least to me (I had a bug in my first drawing, hopefully it's fixed). The idea is, every pair of ALUs computes a product of fk with v[i-k], adds it to the partial sum accumulated thus far, and sends the updated partial sum downstream to the next pair of ALUs.

The trick is this. The elements of v are also moving downstream, together with the sums. But after v[i] got multiplied by f0, you don't want to multiply it by f1 in the next cycle. Instead, you want to multiply v[i-1] by f1 – that's the product that we need for the convolution at index i. And then you do want to multiply v[i] by f1 once cycle later – for the convolution at index i+1. I hope that my sampling of v[i] to an intermediate register, which delays its downstream motion, does the trick.

So these two examples show how FPGA programming is different from programming most kinds of processors – and how it can be more efficient. More efficient, because you can use a lot of ALUs simultaneously with little overhead spent on dispatching commands and moving inputs and outputs between ALUs. An argument can be made that:

  • FPGAs are more flexible than SIMD/SIMT. You can give different instructions to different ALUs, and you can route operands from different places. Contrast this with SIMD instructions like add_16_bytes, with byte i always coming from offset i inside a wide register.
  • FPGAs scale better than VLIW/superscalar. More instructions can be issued simultaneously, because there's no routing bottleneck near the register file, and no instruction memory bandwidth bottleneck.
  • FPGAs are more efficient than multiple cores. Multiple cores are flexible and can scale well. But you pay much more overhead per ALU. Each core would come with its own register files and memories, and then there are communication overheads.

This gives us a new perspective on LUTs and switch boxes. Yes, they can be an inefficient, cheaper-to-manufacture alternative to dedicated gates and wires. But they are also a mechanism for utilizing the "hard" components spread in between them – sometimes better than any other mechanism.

And this is how FPGAs beating DSPs with the help of DSP slices isn't "cheating". (In fact, mature DSPs "cheat" much more by having ugly, specialized instructions. Far more specialized than FPGAs' multiply-accumulate, dot product instructions being among the least ugly. And the reason they need such instructions is they don't have the flexibility of FPGAs, so what FPGAs effectively do in software, they must do in hardware in order to optimize very specific data flows.)

I/O applications

But wait – there's more! In addition to being a hardware prototyping platform and an accelerator architecture, FPGAs are also uniquely suited for software-defined I/O.

"Software-defined I/O" is the opposite of "hardware-defined I/O" – the common state of things, where you have, for instance, an Ethernet controller implementing some share of TCP or UDP in hardware. Software-defined I/O is when you have some programmable hardware instead of dedicated hardware, and you implement the protocols in software.

What makes FPGAs good at software-defined I/O?

  • Timing control: Verilog and other hardware description languages give you more precise control over timing than perhaps any other language. If you program it to take 4 cycles, it takes 4 cycles – no cache misses or interrupts or whatever will get in your way unexpectedly. And you can do a whole lot in these 4 cycles – FPGAs are good at issuing plenty of instructions in parallel as we've seen. This means you don't have to account for runtime variability by buffering incoming data, etc. – you know that every 4 cycles, you get a new byte/pixel/etc., and in 4 cycles, you're done with it. This is particularly valuable where "deep" buffering is unacceptable because the latency it introduces is intolerable – say, in a DRAM controller. You can also do things like generating a clock signal at a desired frequency, or deal with incoming clock signal at a different frequency than yours.
  • Fine-grained resource allocation: you "burn" a share of FPGA resources to handle some peripheral device – and that's what you've spent. With other processor cores, you'll burn an entire core – "this DSP handles WiFi" – even if the core is idle much of the time. (The FPGA resources are also burnt that way – but you'll often spend less resources than a full processor core takes.) Alternatively, you can time-share that DSP core – but it's often gnarly. Many kinds of cores expose a lot of resources that must be manually context-switched at an intolerably high latency. Core asymmetry gets in the way of thread migration. And with two I/O tasks, often none can tolerate being suspended for a long while, so you'll definitely burn two cores. (One solution is hardware multithreading.)

The upshot is that relatively few processors other than FPGAs are suitable for software-defined I/O. The heavily multi-threaded XMOS is claimed to be one exception. Then there are communication processors such as the hardware-threaded Qualcomm Hexagon DSP and the CEVA-XC DSPs. But these are fairly specialized; you couldn't use them to implement a memory controller or an LVDS-to-parallel video bridge, both of which you could do with an FPGA.

And of course, FPGA's I/O capabilities can be combined with computation acceleration – get pixels and enhance the image color on the fly, get IP packets with stock info and decide which stocks to trade on the fly.

Programmable, efficient, and versatile, FPGAs are starting to sound like a great delivery platform.

Summary

There are several points that I tried to make. Some are well-known truisms, and others are my own way of looking at things, which others might find debatable or at least unusually put.

  • While FPGA are a great small-scale circuit delivery platform, they can also be a large-scale software delivery platform. You can think of FPGAs as "inefficiently simulating circuits". But in other contexts, you can also think of them as "efficiently executing programs"!
  • Instead of fixed-function gates and wires connecting specific gates to each other, FPGAs use programmable gates – configured by setting a truth table of choice – and programmable switch boxes, where incoming wires are connected to some of the other wires based on configuration bits. By itself, it's very inefficient compared to a "direct" implementation of a circuit.
  • Then how can FPGAs beat, not just CPUs, but specialized accelerators like DSPs in their own game? The trick is, they're no longer FPGAs – gate arrays. Instead, they're also arrays of RAMs and DSP slices. And then they have full-blown CPUs, Ethernet controllers, etc. implemented in fixed-function hardware, just like any other chip.
  • In such modern FPGAs, the sea of LUTs and switch boxes can be used not instead of fixed-function circuits, but as a force multiplier letting you make full use of your fixed-function circuits. LUTs and switch boxes give two things no other processor architecture has. First, the ability to use less than a full-blown ALU for simple things – and less than a full clock cycle. Second, distributed routing of commands and operands – arguably more flexible than SIMD, more scalable than superscalar execution, and more efficient than multiple instruction streams.
  • FPGAs are the ultimate platform for software-defined I/O because of their timing control (if I said 4 cycles, it takes 4 cycles) and fine-grained resource allocation (spend so many registers and ALUs per asynchronous task instead of dedicating a full core or having to time-share it).

With all these advantages, why just 1% of the global semiconductor sales? One reasonable answer is that it took FPGAs a long time to evolve into their current state. Things FPGAs have today that they didn't have in the past include:

  • Fixed-function hardware essential for performance – this gradually progressed from RAM to DSP slices to complete CPUs.
  • Quick runtime reconfiguration, so that you can run convolution and then replace it with FFT – which you can't, and shouldn't be able to do, if you're thinking of FPGA as simulating one circuit.
  • Practically useable C-to-Verilog compilers, letting programmers, at least reasonably hardcore ones, who nonetheless aren't circuit designers, to approach FPGA programming easily enough.

All of these things cater to programmers as much or more than they cater to circuit designers. This shows that FPGAs are gunning for a position in the large-scale software delivery market, outside their traditional small-scale circuit implementation niche. (Marketing material by FPGA vendors confirms their intentions more directly.)

So from this angle, FPGAs evolved from a circuit implementation platform into a software delivery platform. Being a strong programmable architecture, they're expected to rise greatly in popularity, and, like other programmable architectures, end up everywhere.

Unanswered questions

As a teaser for the sequel, I'll conclude with some questions which our discussion left unanswered.

Why do FPGAs have DSP slices and full-blown "hard" CPUs? Why not the other way around – full-blown DSP cores, and some sort of smaller "CPU slices"? Where are the GPU slices? And if rationing individual gates, flip-flops and picoseconds instead of full ALUs, registers and clock cycles is so great, why doesn't everyone else do it? Why do they all break up resources into those larger chunks and only give software control over that?

Stay tuned for the sequel – "How FPGAs work, and why you won't buy one".

P.S. Programmable – how?

So how do you program the programmable gate array? Talk is cheap, and so are Microsoft Paint drawings. Show me the code!

The native programming interface is a hardware description language like Verilog. Here's an implementation of the tree-like convolution pipeline in Verilog – first the drawing and then the code:

module conv8(clk, in_v, out_conv);
  //inputs & outputs:
  input clk; //clock
  input [7:0] in_v; //1 8-bit vector element
  output reg [18:0] out_conv; //1 19-bit result

  //internal state:
  reg [7:0] f[0:7]; //8 8-bit coefficients
  reg [7:0] v[0:7]; //8 8-bit vector elements
  reg [15:0] prod[0:7]; //8 16-bit products
  reg [16:0] sum0[0:3]; //4 17-bit level 0 sums
  reg [17:0] sum1[0:1]; //2 18-bit level 1 sums

  integer i; //index for loops unrolled at compile time

  always @(posedge clk) begin //when clk goes from 0 to 1
    v[0] <= in_v;
    for(i=1; i<8; i=i+1)
      v[i] <= v[i-1];
    for(i=0; i<8; i=i+1)
      prod[i] <= f[i] * v[i];
    for(i=0; i<4; i=i+1)
      sum0[i] <= prod[i*2] + prod[i*2+1];
    for(i=0; i<2; i=i+1)
      sum1[i] <= sum0[i*2] + sum0[i*2+1];
    out_conv <= sum1[0] + sum1[1];
  end
endmodule

This example shows how "distributed routing" actually looks in code – and the fine-grained control over resources, defining things like 17-bit registers.

And it's fairly readable, isn't it? Definitely prettier than a SIMD program spelled with intrinsics – and more portable (you can target FPGAs by different vendors as well as an ASIC implementation using the same source code; it's not trivial, but not hopeless unlike with SIMD intrinsics, and probably not harder than writing actually portable OpenCL kernels.)

Incidentally, Verilog is perhaps the quintessential object-oriented language – everything is an object, as in a physical object: a register, a wire, a gate, or a collection of simpler objects. A module is like a class, except you can't create objects (called instantiations) dynamically – all objects are known at compile time and mapped to physical resources.

Verilog insists on encapsulation as strictly as it possibly could: there's simply no way to set an object's internal state. Because how could you affect that state, physically, without a wire going in? Actually, there is a way to do that – the usual instance.member syntax; hardware hackers call this "an antenna", because it's "wireless" communication with the object's innards. But it doesn't synthesize – that is, you can do it in a simulation, but not in an actual circuit.

Which means that our example module is busted, because we can't initialize the filter coefficients, f. In simulations, we can use antennas. But on an FPGA, we'd need to add, say, an init_f input, and then when it's set to 1, we could read the coefficients from the same port we normally use to read v's elements. (BTW, not that it adds much efficiency here, but the "if" test below is an example of an operation taking less than a cycle.)

always @(posedge clk) begin
  if(init_f) begin
    f[0] <= in_v;
    for(i=1; i<8; i=i+1)
      f[i] <= f[i-1];
  end
end

A triumph of encapsulation, it's also a bit of a pity, because there are now actual wires and some control logic sitting near our coefficient registers, enlarging the circuit, only to be used upon initialization. We're used to class constructors "burning" a few memory bits; who cares – the bits are quickly swapped out from the instruction cache, so you haven't wasted resources of your computational core. But Verilog module initialization "burns" LUTs and wires, and it's not nearly as easy to reuse them for something else. We'll elaborate on this point in the upcoming sequel.

Not only is Verilog object-oriented, but it's also the quintessential language for event-driven programming: things are either entirely static (these here bits go into this OR gate), or triggered by events (changes of signals, very commonly a clock signal which oscillates between 0 and 1 at some frequency). "always @(event-list)" is how you say what events should cause your statements to execute.

Finally, Verilog is a parallel language. The "static" processes, like bits going into OR gates, as well as "event-driven processes", like statements executing when the clock goes from 0 to 1, all happen in parallel. Within a list of statements, "A <= B; C <= A;" are non-blocking assignments. They happen in parallel, so that A is assigned the value of B, and C is simultaneously assigned the (old) value of A.

So, for example, prod[i]<=f[i]*v[i] sets the new value of prod, and in parallel, sums are computed from the old values of prod, making it a pipeline and not a serial computation. (Alternatively, we could use blocking assignments, "=" instead of "<=", to do it all serially. But then it would take more time to execute our series of statements, lowering our frequency, as clk couldn't switch from 0 to 1 again until the whole serial thing completes. Synthesis tools tell you the maximal frequency of your design when they're done compiling it.)

On top of its object-oriented, event-based, parallel core, Verilog delivers a ton of sweet, sweet syntactic sugar. You can write + and * instead of having to instantiate modules with "adder myadd(a,b)" or "multiplier mymul(a,b)" – though + and * are ultimately compiled down to module instances (on FPGAs, these are often DSP slice instances). You can use if statements and array indexing operators instead of instantiating multiplexors. And you can write loops to be unrolled by the compiler, generate instantiations using loop syntax, parameterize your designs so that constants can be configured by whoever instantiates them, etc. etc.

If all this doesn't excite you and you'd rather program in C, you can, sort of. There's been loads of "high-level synthesis tools" – basically C to Verilog compilers – and their quality increased over the years.

You'd be using a weird C dialect – no function pointers or recursion, extensions to specify the exact number of bits in your integers, etc. You'd have to use various #pragmas to guide the compilation process. And you'd have things like array[index++] not actually working with a memory array – and index++ not actually doing anything – because you're getting values, not from memory, but from a FIFO, or directly from the output of another module (just like in_v in our Verilog code doesn't have to come from memory, and out_conv doesn't have to go to memory.)

But you can use C, sort of – or Verilog, for real. Either way, you can write fairly readable FPGA programs.

49 comments ↓

#1 Adam Menges on 06.17.13 at 1:09 pm

Wow, long read. Well worth it.

#2 Jay K. on 06.17.13 at 1:21 pm

For someone wanting to learn FPGA, what would you recommend?

#3 Bob McFPGA on 06.17.13 at 1:31 pm

Nice. I've worked in R&D for one of the two big FPGA vendors for almost twenty years and this is one of the few articles for an uninitiated audience I've seen that's worth a damn. I'll be interested in part two. Hopfully (for my sake) it will be much less persuasive.

#4 Yossi Kreinin on 06.17.13 at 1:45 pm

@Bob McFPGA, Adam Menges: thanks!

@Jay K: I suggest you ask Bob McFPGA, as he's really much more knowledgeable than me :) In my case I never studied this stuff in a structured way. What I'd do is I'd learn Verilog, then when things work for me in say iverilog and look nice in gtkwave and I'm comfortable with the waveform viewer, I'd lay my hands on an FPGA and see if I can make it do what I want. But there might be a better way perhaps. Another approach I'd take in some cases is ask my employer to pay $1K or so for a training by Xilinx or Altera – again I'd first play with Verilog on my own and be comfortable with it to be sure that I'll be ahead of the class and not behind it. $1K trainings paid off nicely for me when I was generally on top of things but didn't know the particulars. (Of course I won't list this type of thing on my resume – it's not like academic courses, it's a joke – no exam, random curriculum, etc. but it can be effective when learning platforms and platform details while understanding things in general.)

#5 Jeff Stewart on 06.17.13 at 2:11 pm

I have been developing with FPGAs for a long time and I love the flexibility of them. I would recommend the tools and dev kits from Digilent if you want get up and running with a powerful setup. I like the Spartan 6 from Xilinx. http://www.digilentinc.com/Products/Detail.cfm?NavPath=2,400,836&Prod=ATLYS Full disclosure, I worked for Digilent when I was in college.

#6 Anthony on 06.17.13 at 2:14 pm

FWIW, there is this awesome project : https://github.com/milkymist/migen
that allow you to program in python.

#7 Zvi on 06.17.13 at 2:39 pm

The generalization of FPGA approach for CPU design is something called NISC – No Istruction Set Computer

https://en.wikipedia.org/wiki/No_instruction_set_computing

#8 Efren on 06.17.13 at 2:53 pm

Excellent article. One doubt on the consumption numbers, are they swapped for DSP?

#9 Anonymous on 06.17.13 at 3:20 pm

I just wanted to point out that your image for the "Modern FPGA" is not all that correct. Multiple FPGA companies do have dedicated Intellectual Property (IP) Blocks like USB and ethernet, but generally special, they don't exist. However, the block is fairly accurate to the Zynq Processor (www.xilinx.com/zynq).

Overall good article.

#10 Steve on 06.17.13 at 3:24 pm

Xilinx has a program called "Vivado HLS" or "Vivado High Level Synthesis" It will take C/C++/SystemC code and convert it to HDL (Verilog/VHDL) for you. So your "Practically C-to-HDL" should be changed to, "C-to-HDL Compilers"

#11 Scott M on 06.17.13 at 3:56 pm

I've been hoping for quite some time that someone would start to integrate FPGAs into the standard PC setup, and it would catch on. I thought maybe someone would make a clever Linux distro that made use of an FPGA to make super efficient IO and hopefully provide a platform for software to define chunks of (optional) Verilog to throw on the FPGA to make the software run better… Photoshop or video effect software seemed like an obvious place where gains could be found. I'm going to keep my fingers crossed, but not holding my breath after 10 years of waiting.

#12 Gerard Braad on 06.17.13 at 4:14 pm

Actually, USRobotics used FPGA's to allow your modem to be updated and remain usable when standards evolved/changed.

Besides Milkymist, also OpenRISC is an interesting project to look at.

#13 csirac2 on 06.17.13 at 6:51 pm

My graduate project at university was an FPGA image processor on Virtex2 which aimed to accelerate an edge/vectorization algorithm. Comparing the partial (but equivalent) parts of the algorithm did indeed result in a huge speedup – well over 100x vs a fast Pentium 4 IIRC – just using HandelC (an C-like HDL) over USB was huge fun.

Fast-forward many years to now, where I've spent some time in scientific and technical computing – nobody cares about FPGAs for difficult CPU-bound tasks. That's all "solved" in GPUs now. People seem to care even less about FPGAs for advanced/high-performance computing than 8 years ago.

Which is a shame – I loved working with the Virtex2, I bet the modern chips are a dream by comparison.

#14 iOne on 06.17.13 at 10:59 pm

Yes, FPGAs are powerful. However, tools suck. Really, they suck a lot. I have 5+ years of FPGA development and debuging is a completely nightmare. Plug&Play is simple an utopy and compatibility is inexistent. I hope Vivado HLS or other C-to-HDL tools could simplify the process, but I'm very skeptical. Despite of this, I love what you can do with this pieces of hardware…

#15 bayesiansandwich on 06.17.13 at 11:21 pm

FPGAs are nice if you like to wait 5 minutes for it to compile, only to find out there's a bug. 'Somewhere'.

#16 paul on 06.18.13 at 12:03 am

Very nice post. I'd be interested to know the percentage of die area used by the different stuff on the FPGA: DSP slices, ram blocks, hard cpu's, etc. Any idea? And do the notions in the post apply to lower cost FPGA's? I've been thinking of getting an entry level development kit, e.g. with Spartan-6, which don't have nearly as many resources as the big stuff. Thanks.

#17 Lelala on 06.18.13 at 12:28 am

Wow, huge explanation – that one should go to my prof for the freshmen at university!
Thanks

#18 Philip K on 06.18.13 at 2:08 am

Not sure whether this will be of interest to anyone. There was a Kickstarter project earlier this year to produce a FPGA development board aimed at the hobbyist.
http://www.kickstarter.com/projects/1106670630/mojo-digital-design-for-the-hobbyist

The Kickstarter is long finished, but I notice that he's taking orders through his website. At $75 a board, it's not exactly speculatively have-a-play cheap, but it doesn't feel like it's exactly silly money either.

His website is: http://embeddedmicro.com/

#19 c31ine on 06.18.13 at 6:34 am

Great post. I am an FPGA designer since 10 years and I have dealt with many designs, including signal processing and even running a LEON processor as an IP in an FPGA. The choice between FPGA and DSP, and now between FPGA and processors has always been there, hidden in the back of our minds or in plain sight.
I find your presentation very accurate. Looking forward for part 2 !

About FPGA design from C or other non HDL I would be very careful or you'll end up designing FPGA like I design my C programs. I am perfectly able to write a software code for a simple application to communicate with my FPGAs but I will never pretend to be a software designer and write effective code for a real-time application for instance.
Non HDL languages are great to play around but if you want to go further you'll have to understand accurately what's going on and as far as I now, only HDL can give you enough control on what you really do.

@Jay K: Xilinx sponsors a community called "All programmable Planet" that I found very instructive, for beginners as for more experienced designers. They have dedicated posts for newbies, search for "Ask Max" and "Ask Adam VHDL".
Otherwise my big reference is this:
http://vhdl.org/comp.lang.vhdl/FAQ1.html
But it is in a lot more "raw" style.
And if you give it a try and have specific problems you'll find useful help there:
https://groups.google.com/forum/?fromgroups=#!forum/comp.lang.vhdl

@iOne and @bayesiansandwich: that's what simulation is for. A standard FPGA development process always include simulation before generating any binary. It seems to me that with simulation FPGA debugging possibilities are much higher than for processors as you can create test cases that are very difficult to recreate in the real world (and so cover corner cases very easily).

#20 Thomas Fitzpatrick on 06.18.13 at 7:46 am

For any one trying to learn FPGA's – check out fpgalink – it is the easiest workflow for programming fpga's and supports most of the most popular dev boards out of the box. It has command utilities for turning vhdl to verilog – avoiding the ISE/Altera tools nightmare and another for programming your FPGA with usb – jtag is not required.

Myhdl is used for writing python that compiles to vhdl or verilog if you want to avoid C

http://www.makestuff.eu/wordpress/software/fpgalink/
http://www.myhdl.org/doku.php

#21 gus3 on 06.18.13 at 9:54 am

Wouldn't 10W for FPGA, vs. 2-3W for dedicated DSP, make the FPGA 3-5 times more power hungry than DSP? That is to say, 1/3 to 1/5 as efficient on the chip level as a DSP?

Of course, if you take a look at the watts per channel, then FPGA comes out on top: worst case, 1/2 the power per channel as the DSP; best case, FPGA uses 1/33 the power per channel.

#22 Nikolaos Kavvadias on 06.18.13 at 12:36 pm

Hi yosefk,

great article and blog overall.

It is true that C-to-HDL tools donnot intuitive GUIs (as someone put it, "they suck"). I personally hate "bloated goats", massive GUIs and libraries counting the hundreds of MBs and GBs, especially when this is not necessary. I also dislike Eclipse, cannot be tailored and fitted to specific needs without much of the bloat.

My attempt to both a new HLS engine (called HercuLeS) and a light HLS GUI, incorporating simulation and synthesis can be seen here: http://www.nkavvadias.com

HercuLeS engine: http://www.nkavvadias.com/hercules/

HercuLeS GUI tech demo (testing version, will be updated in a few days, a couple of issues reported :)

http://www.nkavvadias.com/temp

This is a commercial project, and we currently investigate a three-stage commercialization model, "Freemium", "Basic" and "Advanced/Full" version.

The tech. demo is free, no strings attached. The GUI has been written in Tcl/Tk. HercuLeS engine ports to Linux 32-bit, and Windows XP/MinGW are available. Windows 7 port under works.

#23 David on 06.18.13 at 12:44 pm

Did you ever look at this:

http://www.c-to-verilog.com/

(sadly somewhat unmaintained)

#24 Nikolaos Kavvadias on 06.18.13 at 1:03 pm

@David: yes, I'm aware of the C-to-Verilog site. What's behind the scenes is a version of SystemRacer, an HLS engine implemented as an LLVM Verilog backend.

I think that it has some frontend problems (the supported C subset is quite limited). I'm not sure if this is maintained, there are no major updates since some time. There is also no testbench generation, no script file generation, no tool integration, etc. It seems more like a proof-of-concept of SystemRacer (?)

#25 Jim on 06.18.13 at 1:18 pm

[QUOTE]

…we estimated that high-end FPGAs implementing demanding DSP applications … consume on the order of 10 watts, while high-end DSPs consume roughly 2-3 watts. Our benchmark results have shown that high-end FPGAs can support roughly 10 to 100 times more channels on this benchmark than high-end DSPs…

So for that benchmark, FPGAs offer 10x-100x the runtime performance, and 2x-30x the energy efficiency of DSPs – quite impressive!

[/QUOTE]

Re-read the first sentence. It says FPGAs consume more power than DSPs.

#26 Nikolaos Kavvadias on 06.18.13 at 1:26 pm

@Jim: I think the proposition is valid, energy-wise (Energy = Power x Time).

FPGA energy = between 10W * (1/100)T and 10W * (1/10)T
DSP energy = between 2W * T and 3W * T

So a 2x-30x (hard) estimate is reasonable. Energy consumption is important to battery-operated apparatus. On the contrary, power consumption is the metric of importance if you operate under constant power.

#27 Yossi Kreinin on 06.18.13 at 1:35 pm

Thanks for all the comments. @paul: to tell the sizes you'd need to tear it down chipworks-style :) They obviously don't publish the numbers… How enlightening it would be I don't know, because the mix of resources is reflecting someone's estimation of some averaged market needs, and it can change over time.

#28 Roy Bunce on 06.19.13 at 12:04 am

Excellent post. FPGA designs have been around for a while, but it's only recently that they have become available at a reasonable price.

For an easy introduction into FPGA design, there is an Arduino compatible development board currently on kickstarter (short URL http://kck.st/164ObLg ) . As supplied it is programmed with a 58 bit I/O expander, but is reprogrammable with Verilog or VHDL code using the free Diamond software package from Lattice. It can be used stand-alone or controlled by any host processor with an I2C port. The associated web site will include tutorials and example code to guide the new FPGA programmer through the design process.

#29 Jochen Hebeler on 06.19.13 at 4:36 am

FPGA are cool devices with great potential, but you have to acknowledge, that they are complicated to be programmed. A simple programm for generating a PWM on a AVR is just some lines in C. Generating a PWM on a FPGA is more complicated and you have to check your module for nearly every possible state and input combination. The development-cycle of a FPGA design takes much more time than the average time on a common MCU. Programming sequential operations on a FPGA is a real pain in the arse and requieres a lot of evaluation. Therefore the usage of the Devices is limited to those matters, where the qualities of the FPGAs are needed and the extra money for development can be justified, for example in high-end test gear.
Also you will come over a lot of problems concerning timing-issues and temperature-related mess.
All in all, modern FPGAs are very potent and usefull, but they are complicated and need a lot of testing.

#30 Matt on 06.19.13 at 5:31 am

Great article but, what is a FPGA?

#31 Dan Sutton on 06.19.13 at 8:17 am

This is a great article — I love the whole concept of FPGAs – back in the '80s, I did a good amount of work with PALs and GPLs; the FPGA, of course, is the logical extension of that technology… this article makes me want to get back into it again. Such fun.

#32 Mark on 06.19.13 at 9:51 am

I haven't figured out why, but the world of HDL programming (Verilog, VHDL, etc) and the world of procedural programming (C/C++/C#/Java,etc) are so different and the expectations of the tools are so different. The tools in the procedural world as so much nicer than the crappy tools that exist for FPGA development. I can crank out a user interface in C# in a few minutes, that will do sophisticated image transforms and host a TCP server. In the same amount of time, I might be able to write a Verilog program that multiplies 2 numbers together and blinks an LED if the value is zero. On a CPU processor, I can scale up everything to almost infinity and will just have to deal with everything running slower. On an HDL project, if I scale things up, then all of a sudden things start subtly breaking (ie the multiplication results starts getting intermittent bit errors), until you get to the point where it doesn't even compile.

You can use "soft cores" as part of your HDL project (like picoblaze or microblaze) but these don't run as well as a hard microcontroller core.

#33 Sicaine on 06.19.13 at 12:09 pm

thank your for that great article :)

#34 tz on 06.20.13 at 2:49 am

Very good intro, but where is a simple, inexpensive board with dev environment and "blink a LED" example. Something like the Arduino or raspberry pi.

I can handle the logic, but a tube of fine-pitched parts isn't something I can work with, nor several hundred dollars for a set of development boards – most of which are already pre-wired and not for breadboards.

I think the pieces are there. But there isn't a FPGA one yet.

#35 trombone on 06.20.13 at 5:18 am

@iOne: I found your comment rather funny, and completely wrong! I spent 8 years working with FPGA development tools, and then the last 2.5 years working with digital ASIC implementation tools. The FPGA tools are an ABSOLUTE DREAM compared to the ASIC tools. This, I think, is due to the number of users.

I expect the "debugging is a complete nightmare" because you don't simulate enough, or haven't properly constrained the design prior to the implementation.

#36 rob on 06.21.13 at 10:30 pm

@ty: http://papilio.cc/

#37 rob on 06.21.13 at 11:35 pm

(Sorry, that was supposed to be @tz.) Great article.

#38 Johan Ouwerkerk on 07.03.13 at 6:20 am

FPGA's are quite nice, but the first thing to ditch is the concept of programming them as such. Problem is: if you structure your hardware like you do your software you'll run out of "space" pretty quickly. That's the subtle issue hinted at with the talk about a "constructor" after the article summary.

Secondly the routing looks like magic, until it isn't. Lot's of seemingly inexplicable issues may crop up and there are lot's of subtle issues in that if things can't be statically verified (despite being "correct") code will not synthesize properly. An example: suppose you have a stream of data input wherein you may have two types of "objects" which need to be processed differently but yield same output types, those outputs need to be sorted in some order and some total needs to be computed (map/reduce flow in software/math). You know in advance the total number of objects, the total number of objects for each type etc.

In that example you can't simply (straightforwardly) "put" objects of type 1 in a "vector/array" of type 1 of length Y, and objects of type 2 in a "vector/array" of type 2 of length Z when the total number of objects is X, where Z != X != Y. That's because the synthesizer cannot verify in advance that the partitioning of X objects into two arrays of length Z/Y is actually correct (in a naive/software implementation it sees X assignments but it cannot know that only Z assignments go to the vector of type 2, and Y go to the vector of type 1).

#39 Oskari Teeri on 07.06.13 at 7:10 pm

Best FPGA introduction I've read so far.

#40 Mark on 08.08.13 at 7:58 am

The v2 Mojo FPGA hobby boards are on sale this month (August 2013) –since they came out with the v3 they are blowing out the v2. I bought one and started working through the VHDL examples in "HamsterNZ's" IntroToSpartanFPGABook.pdf eBook course. Obviously I had to use the mojo.ucf constraints file, but it works great so far.

#41 Lev Serebryakov on 09.21.13 at 9:32 am

FPGA have one big problem for now: they are mostly vendor locked-in.

For example, if I target "ordinary" CPU with my program, which should be re-configurable but very effective (like calculating user-configured formulas in some expression language) I could take LLVM or ORC or write several codegens by myself for several architectures (x86, x86-64, ARM — and it covers most of consumer and commodity hardware, add MIPS and a huge slab of embedded hardware is covered too). Maybe, it will be not state-of-art "compilers", but result will be much faster than interpretation. Instruction sets are known, tools (like assemblers) are available, and now, with rise of LLVM, it is rather high-level tools.

But what should I do if I want target hypothetical FPGA companion to CPU? Both Xilinx and Altera have all formats and details closed. I could not redistribute their synthesis tools with my software. Even worse, I could not get any free tools for my project! Free synthesis tools are limited to low-level devices…

GPU is somewhere in-between, IMHO.

IMHO, to see FPGAs in "commodity" hardware, available to application programmers as reconfigurable coprocessors, FPGA vendors need to change their mind about synthesis tools and bitstream formats. Without such changes, it could not become reality.

#42 Yossi Kreinin on 09.21.13 at 9:40 am

It's a valid point, and I plan to expand on this in the follow-up; for now, one possible answer is, you can distribute VHDL or Verilog code – as many, many IP vendors do and make handsome profits, and as everyone does with JavaScript, lacking any binary instruction encoding that will run in all the web browsers out there. And, certainly the vast majority of software targeting CPUs is written in portable HLLs without knowing a thing about the instruction set, not to mention its binary encoding.

#43 Lev Serebryakov on 09.21.13 at 10:23 am

What end-user will do with my verilog code? Not hardware/software designer, who buy IP, as now programmers (not end-users!) buy software libraries, but user of my software package, which I want to take advantage of configurability of FPGA.

I speak more about case "user configure filter in high-level domain-specific language (maybe, even in some graphical tool) and my software run it on CPU, GPU or FPGA — what is best available on user's hardware."
I could write such system for CPU for sure. I could write such system for GPU, as every GPU installation contains compiler in drivers (free of charge and transparently for user!). FPGA? As long as FPGA vendors think about synthesis tools as "expensive developers tools" and not "free drivers" it is not possible, as it forces user to BUY Quartus or ISE Design Suite, or what-is-the-name-of-Lattice-design-software.

Yes, JavaScript is distributed in source form, but here are several free JavaScript VMs on "each" device already. But not FPGA synthesis tool, even if user buy FPGA card, because synthesis tool is separate product, which costs a lot of money.

#44 Lev Serebryakov on 09.21.13 at 10:27 am

It looks like situation with DSP cores on ARM SoCs now: you could buy SnapDragon SoC and build your device around it, but you essentially could not use DSP companion for anything but videodecoding, because DSP is not-documented, closed, and only thing you could get for it is some binary blob ("firmware") which implements H.264 en/decoding, that's all. You want to use this DSP and mobile GPU for something else? Sorry, you don't have documentation for it, even if you are ready to sign NDA, not to mention open-source projects…
FPGAs looks even more closed, than that.

#45 Lev Serebryakov on 09.21.13 at 10:54 am

Oh! I went to this good example:

SDR. Software Defined Radio. Not exactly end-user kind of stuff, but it allows geeks to experiment with cool stuff without soldering high-freq analog schemes (so, level of geekness to enter is greatly reduced, same thing as Arduino reduces this level for embedded and robotics). You could buy one board, and get full-spectrum coverage.
Many of these boards contains FPGA, which is cool, too.
But if you want to change something to in latest stage (which is run on PC), but in FPGA, you need to use full-featured vendor design suits, and if you are lucky, limits of free versions of these suits are enough for you. Or you need to buy tools, which costs 10x-100x compared to board itself.
There is NO good way for board designers to provide something like Arduino IDE for such SDR device — because vendor synthesis tools are so restricted, closed, etc.

And I don't think FPGAs will widespread as companion devices till tools will be free, redistributable, and simple to embed and use, as C/C++ compilers for common architectures are now.

#46 uriW on 01.08.14 at 12:51 pm

Thanks Yossi.
It gives me the appetite to leave the corrupted world and start thinking again, and actually be DOING things.

Interesting as hell.
If only there was the ability to take very high level code and to compile it to something kike 20% optimized …

#47 Yossi Kreinin on 01.08.14 at 10:58 pm

Resist the temptation to leave the world of easy money for the hell of hopeless fucking with barely working machinery!

As to FPGAs – I need to write the follow-up sometime. I think hacking on FPGAs can be fun but the full picture is not half as rosy as I painted it here IMO…

#48 AlbertR on 01.14.14 at 6:48 pm

@Scott M: There is a company that does what you suggest – allow the programmer/user to download code to one or more FPGAs to accelerate certain computational kernels. And this can be done in the midst of your conventional CPU code execution. The company is Convey Computer: http://www.conveycomputer.com/. They call these kernels Personalities and you can trade out personalities on the FPGAs in fairly quick succession. However, these are rather expensive computers; definitely not at the price point that you are suggesting.

#49 abir on 09.05.14 at 5:12 pm

Great article Yosef! But your verilog code for the convolution is a bit wrong. There is a race in your always block between the first 2 statements ("v[0] <= in_v" and the for loop) as a result of which v[0] might be overwritten before it is shifted right. Just put a "#0" in front of the "v[0] <= in_v" and it will be scheduled at the end of the current time step.

Leave a Comment