Mary Sheeran - Hardware Design and Functional Programming: Still Interesting After All These Years
Thank you for the opportunity come and rant, unconstrained by science! (laughter)
What I plan to do: show you some stuff I think is interesting, in the hope of luring you into FP and hardware design. I consulted some oracles (who work at the coal-face of hardware design); this is a mixture of things I like and they told me about.
Let's start about hardware description languages. How old are they? When did it start? I found a paper from 1968 which starts nicely: it's states "specifying, documenting and controlling the design of digital systems are problems of increasing severity." That's a nice introduction! Then it goes ahead and says, "Unfortunately, their contribution is mostly oriented tward the machine that they were develping at the time and is not generally useful." Papers were snarkier in the 60s? (Another on Algol60: "it is less than satisfying"). The paper being snarkily referred to here is actually the one that introduced microprogramming. Paper 1953, presented 1951. That's cool, but I should add... my slides are full of links. Eventually you will be able to click and look at these things.
So... that's one possible oldest HDL. But there's also Reed from MIT, who wrote "Symbolic synthesis of digital computers." Unfortunately, my university doesn't think I want to read papers that are older than 1980, so I couldn't get it. But I could get the abstract: it talks about how a boolean machine is an "automatic operational filing system": information is contained in sets of elementary boxes or files, each contianing one of the symbols 0 and 1. You can think back to this time where people couldn't even agree what to call things! But it has a lnguage which describes circuits.
But I finally got back to 1937: Shannon's MSC thesis, it has the idea of using boolean algebra to reason about switching circuits. If I had known that someone had written a masters thesis and started a whole field, I would have given up! It's online at MIT; and it's fascinating. IF THERE IS ONLY ONE THING YOU LOOK AT, IT'S THIS ONE.
If we jump forward in time, for actually recognizable hardware languages... APL has a serious past as a hardware language. An entire formal description of System 360 was made at APL. He likes the idea of using the APL for systems design... they take this seriously at IBM. They hooked up APL to their synthesis language; it was their main hardware description language; in 1970, they pubbed a paper where the compared the results of a hand design of IBM 1800, versus a generated circuit from their hardware synthesis. They discovered the generated circuits were 2.6x better; but they decided that if they had more time and resources, they could get it to be x1.3 larger. In this paper, they said, "This is a good thing! APL is good for hardware generation." There was a plethora of hardware description languages.
But then, what hapened?! I became an undergrade in the 70s, and there wasn't hide or hare of an HDL: they had all gone away. There was no way of describing adders except with pictures. I did learn about formal methods... I met Henderson who advised me for my masters. He was doing art; but he was also interetsed in "Introduction to VLSI systems". It opened up circuit design to NORMAL PEOPLE (like computer scientists.) (laughter) It has more than ___ citations, it started Intel and MIPs. I looked on Amazon, and I discovered you can buy it for 77 cents... and I think somebody should do that right now!
So Peter Henderson was having his mead and con.... phase; for my master's thesis I generated the layout of a circuit, which we then implemented in the MEad and Conway way. It was really stressful; we were only thinking about the layout... and then we had denotational semantics and functional programming, and there was the answer! Also 1978 when Backus introduced an alternative form of combining forms; these were exactly what I needed. So I wrote papers where I advocated reasoning about regular circuits: Reed and Conway advocated... using the combining forms. We had to draw pictures by hand back then! This is one of those algebraic laws Backus talked about. On the left is a reduce of G; on the right is a reduce of F composed with G, and the algebraic law says if you can push the .... (etc etc) and the Fs will appear on the carry. This law helps you reason about things like pipelining. Strange as it may seem, we actually got users. Most of the work was done by .... and a design team of Plessey, making a motion detection array, used this to think about the flow of data in the circuit. Tihs was great because previously, the way to design the circuit was to draw it on large pieces of graph paper in a room.... you could see the joy in the faces having a simulator. They wrote a very nice paper about their array: "Using MuFP the array processing element was described in just one line of code..." It goes on to talk about the importance of having a language to play with your circuit. I din't do this: it was done by G. Jones...
This was a success! But then bad things happened. Our partner Plessey was bought by GEC, and they closed down the design team, and every thing stopped. And if you thought we could convince GEC to use MuFP... no, that didn't happen.
So let's talk about the reality today of hardware descriptions.
I took, a couple of years ago, a picture of the Wikipedia page for HDL for Swedish. I had to use the Swedish one, because English is different. It says, "There are two pages: Verilog, VHDL". This is correct. The reality is that hardware design at the low level is completly dominated by 25-year old and not very nice PLs. So maybe you'd say, "We can have coffee, we failed" but I don't want to say that; I want to say something different. So I want to say, despite this reality, I'm still interested in hardware description. Even if we can't persuade Intel to change how they do the lowest level of hardware design...and I don't think we can do it, I still think it's interesting to think about it!
First, let's think about circuits which operate on arrays. I draw them as boxes. I'm reading them in column-major. I'm going to think about combinators for plugging together circuits on these arrays. Here is "interleave" (ilv), which applies a function on every second element of the array. "Two" applies a function on the first half and the second half. And it turns out two and ilv commute with each other. Once you have things like ilv, you can describe well known networks, like a butterfly network. It's made out of ilv: it's made my interleaving two half size network: I call it evens, applies f to adjacent elements of the array. If we draw a picture of the butterfly network: the interleaves causes these rifflings... it's a bit hard to see what's happening... so what I'll do is take the wires and pull it, so the little wires are stretched, and we end up with a different rep. Each vertical line corresponds to the two input two output function. And this is a butterfly network. And this the shape used in making FFT... but it is also Batcher's bitonic merge. Batcher in 1969 wrote a great paper explaining how to do sorting in a hardware network. He introduced a bitonic sequence: one that starts off increasing, and then is decreasing, or some cyclic rotation. His eureka was, if you take such a bitonic sequence, and push it through the first column of the two sorters, you end up with two smaller bitonic sequences: all the elementson the top one are greater than the bottom one. So if you keep going, you get a sorted sequence. So how do we make a bitonic sequence? We can use recursion! So we can take two half size sorters... reverse one of them, and then put that into a butterfly! (Diagram of the half size sorters.) For sixteen inputs, this uses 80 two-sorters.
Satnam convinced me to be interested in median networks... it doesn't need the values to be sorted but you just want the middle thing. Median filtering. It's an interesting function in relaity. I wrote ap aper in 98, where I got down to 98 comparators. That was the piece of code I was willing to show. I did get down to 96, but it was too embarassing to show in the paper. On the web, there's a famous piece of C code with a 99 compare and swaps, and the folklore was that you couldn't do better. I wrote to Paeth but he didn't respond.
It looks very unsymmetrical to me! I thought it would be symmetric about the middle line but it wasn't. What I've done is I've done the usual running of a parallel sorter, but I've also done computation about what i know about wires, so I only include the wires that I actually need. This synthesis hackery allows me to beat numbers. This kind of hackery is interesting, not only if you want to make circuits, but pieces of sequential C code that do such comparisons.
Now you might be thinking, "Search?" I have played with search. My second most proud moment, was finding this sentence in a paper from 2013: "Recently, a sequence of 2^n input prefix circuits of depth n and complexity L(2^n) (at least for n <= 25) was discovered by Sheeran, VIA COMPUTER PROGRAMMING." (laughter) This sentence was written by a matehmatician. Two days after it went up on JFP, he said, "You found prefix networks that match my lower bound! Unfortunately all my papers are in Russian!" But he has written a lower bound... 3.5 * 2^n... and so on. In my JFP paper, I found the prefix networks that exactly matched. I did not find them by search: I invented them, but search was an aid to the invention. I searched with lazy dynamic programming; I pored over the graphics, and then I had a Eureka moment and invented them. So having a language to describe things and play with them is important
I told him we should write a paper together, and he said "Nooo, computer scientists will not understand!" It is a heroic proof... I can't say that I really grasp it.
(Another network.) The whole key is: with a notation, you can play; and if you can play, you can invent!
Search is a fantastic help in discovering new things. Some more examples: SPIRAL, an approach in which you have a small DSL for expressing networks and the algebra of networks and they search for good implementations by applying algebraic laws. It's the kind of search which relies on the algebra; they generate circuits, low-level code. They have a great website that should be guiding all of us about how to present research on a website. And now it turns out... I didn't know this, but there are a bunch of people using search for sorting networks. These people (Valsalam and Miikkulainen), they are trying to push the limits of small sorting networks. Up for 16 inputs, they couldn't make improvements, but in their SAT based approach, reduced the best-known 17 input from 73 to 71. That as a result by Van Voorhis can reduce best known for others. This is heroic search... I think some of these numbers are up for grabs. I think 71 is quite high. Maybe I can get it down. If you get it down from 71 to 70, you can get your name on a book by Knuth. That was my favorite career moment, though for another reason.
Also, you can use search to prove optimality by searching through all possible 24 comparator nine-input sorting networks, to show you can't do better than 25. At the top of the line, it shows what we know, and known lower bounds. So for 9 and 10 inputs, we've made the lower bound match. That's where we are! These are very small numbers... but also very hard searchers. Only as far as 10 do we know what the smallest number of comparators there are.
And why are people are are not actually interested in hardware design playing with sorting networks? Well we have another reason for needing circuits: we can use these circuits for doing verification. This papers from MiniSAT is for translating pseudo-boolean constraints to SAT... and they use Batcher's sorting network! So they are interested in designing better sorting networks, median networks, etc.... so there's a reason you'd be interested in this.
By the way, I want to produce some of these numbers, and I want to understand these structures in the large, to replace the structures from 60s and 70s. We haven't made progress designing them since hten. So this is my take on FP and cirucit like things. But you might be thinking, ugh, that's not real circuits! So let's talk about FP and hardware in the real world...?
In 1994, Intel released a faulty Pentium 4, and then they screwed up on dealing with the flack that came out with. Afterwards, they did reduce a keyring of these fualty things, ahnded it out to each employee, on the back from andy Grove it said, "Great companies are improved by crises. good companies survive them." The effect of this Pentium bug, was that half my friends of formal verification were working for Intel. Many of them still do. I consulted Carl Sieger about the current status of FP and FV in Intel. He gave me some slides...
In Intel, they use a system called Forte to do formal verification of "computational structures"; these might be called the data path parts of processors; the algorithmic parts; the floating point units. This Forte system has thousands of users, sittin gin inside Intel, doing a system based on a lazy FP called fl. It has built-in BDDs, decision procedures, and a HW symbolic simulator (a ckind of model checking). FL is used in this context in many ways... ways that are familiar (design language, high-level spec, implementation language...) This is a success; a quiet success, that's going inside Intel. Carl gave m examples of fl in various contexts. He also told me about two tools that are used inside Intel: IDV, which is 280k fl plus tcl/tk; this is the nearest tool to vision I had as a doctoral student of using algebraic transformations on circuits. It allows you to start on a high level, start applying transformations, and then eventually deal with physical layout. I tried to persuade them to give it to me for teaching, but it didn't get out; and Intel stopped using it. in order to move desings from one process to another, they did the opposite of refinement: up to a spec, and then back down to another. And then we talked about STEP on it. Some papers describing them. These are huge success stories...
There is another x86 provider called Centaur; they also do heroic formal verification. In hardware, the price of getting it wrong is very high. They took a 47 million dollar loss. He said, "We can afford one more bug, but two will kill the company." So formal verification in that context is very important. The work at Centaur is verification: it starts with a spec of what an instruction should do, and compares the two to make sure instructions are implemented correctly. It's all based on ACL2. They're increasing the areas of trust: the parts of the chip which are verified. He has also spectactular results for formalizing entire x86 instruction sets, building verified compiler.... but it's not about design; it's about post hoc verification.
My next Oracle was Nikhil; CTO of Bluespec. I said to him, I have to give a keynote, what should I say about Bluespec. I wouldn't have said "still" interesting, I would say "even more" interesting. Over the last two years, the interest in Bluespec, has greatly increased. Two reasons in the upsurge in another way to do hardware: the rise of FPGAs. FPGAs are being designed by people without a hardware background; and the other reason is malware and hacking scares... in the FPGA world, there is an interest in end-to-end because fears of bad things.
BSV is a combination of structural hardware description language for us who did Lava things. It's like guarded rewrite rules: Lennart is one of the developers. And Nikhil gave me some slides... the slides also contain butterflies. One of the things he argued for in the slides he gave me was with Bluespec it is possible to explore very many implementations of te same function; e.g. inverse FFT. There's many choices: a combinational circuit and add pipelining, or you can feed data back around. And the combination of a library for doing structural descriptions with the rules gives you something extremely poweful. Adding rules to structural hardware description gives you a very attractive approach.
Just like we saw from IBM 1800, here's an example where they compared BSV generated with .... now the njumbers look a lot better. For hand coded, the BSV generated an implementation which is one third the size! for the larger size, it's 0.81x. These are spectacular results. Not just competitive, but far superior. Is this real? Nikhil's answer is... BSV often beats handcoded RTL code. Why? Algorithmically superior designs, because it's easy to make changes, make major architectural changes. You have a much better chance of finding a good implementation.
Now, they also have...another tool in their arsenal. This is someting new: Bluecheck. It's related to Quickcheck, which some of you might have heard of. It hasn't appeared yet, but it's a generic synthesiable test bench; let's FPGA designers provide an executable specification, and for free get an entire test bench included shrinking. Apparently this is blowing minds! The point is that the shrinking an diterative deepening, the approach for generating inputs, CAN HAPPEN on the fpga. You change from doing 350 test/s to 100000 test/s. They've linked it up to a software tool for studying. There in a project making memory system for multicore. They worry about sequential consistency; this quickcheck thing... they end up with much shorter failing test cases than they do in any other way. WE've all heard this before: shrinking is a huge win. Fantastic work! bluespec is making a lot of use of this.
And around Bluespec, there's interesting work on formal verification. I picked on two papers by Adam Chlipala: Formal Verification of Hardware Synthesis; not verifying resulting circuits of fixed size, but the whole generation of parametric circuits. That paper also has a bitonic sorter in it! There's a theme arising here. There's a follow-on paper from CAV this year which uses Coq to do the first machine verification of sequential consistency for a multicore hadware design that includes caches and spec processors. bluespec is not only winning, it's a basis for really interesting work on FV.
What about Lava and some other things? Chisel is a last... implemented in Scala. With lots of users in an arch design group. You should look at it.
And there's Cryptol, for crypto algorithm with aa route to FPGA. That book from 2010 has a lot of interesting papers. I like this paper, it says how they get to FPGA; and the notion of undelay (antidelay); the use of thinking about circuits this way.
And there are opportunities in OpenSPL. There's a company Maxselor, a gigantic FPGA, and when you buy it you also get a programmer, because it's just so hard to program. They're trying to pretend it's "spatial programming" but it's really just hardware. Open spatial language: they want input, we should provide it!
Final oracle is Andreas Olofsson; Adepteva; I could #include a rant from Satnam, or the #include of what machines will look like in the future. These slides, he told his whole story of developing Parallela board, whch contains a multicore chip, plus two ARM cores. Credit card sized, $99; it's very low power, very interesting, and very difficult to program. And he wants help. His story was how he went from Kickstarter, being invested in by Erikkson, at the end, "it's the software stupid!!!" He can do the hardware design... but we need to help him on how to program these things! Programming these is not only going to be difficult because it's not ones... but many.
so the division between hardware and software: those days are gone. This is what achines look like now: FPGAs, multicores, CPUs, GPUs. If I asked Carl Sieger: is there any hope for us to influence what happened? He said: FPGAs are moving into the processors. There's an announcement today of a XEON chip of an FPGA in th eprocessor. So we can't make this disinction. We're going to have a large number of tese things to program. It needs to deal with heterogeneity/massive parallelism. Lots of relevant work.
BUT STILL. I kept... I was positive until the very end. STILL, I lack a high level language that allows me to think about playing with time and space the way hardware designers do. The research I started in PhD: I have not succeeded in. There's still a lot of work to do. Work by ... I need help. I'm hoping to lure somebody into this. How do we think about the tradeoffs of space and time designers do? I've been involved in setting up workshops to think about this. It used to think that this conference was FP and comparch... but we've gone away from it. Maybe we should come back.
Programming future machines will be more like hardwar design than is comfortable. So not only FP + HW is interesting, th eideas might be important for software.
Q: (Ragde) My grad work, which was done about the same time, was lower bounds for || computation. The big result was AKS log depth sorting result. Have you been thinking about that result?
A: To get the constants down? (Not just get it down, but log depth networks which seem to be hard.) I have not really thought about it.
Q: (Bodik) On the note of how abstractions ar echanging; substrate changing; the sorting network you mentioned is to count the comparators, minimize that. But if you want to implement in software, FPGU, the wires are not free; you need to implement them as indexing to arrays. There, minimizing comparators may lead to complicated index. Are there thoughts on a different cost model for networks.
A: I haven't played with this for sorting, but I have for parallel prefix, because when i DP search, I sometimes didn't minimize comparators, but length of wires. You might also minimize the size of largest fnaout. I'm lucky I have a VLSI group who helped with deciding what are reasonable cost models. I think it's possible to do work in that direction, we should do more.
Q: (Svenningson) We often sort to search with binary search. On modern hardware, that has caches, binary search is not the most efficient way... often it's better to have a Btree or other order. What is known about circuits that produce this kind of order?
A: I don't know! Sounds very interesting.
Q: (Gershom) As I understand parallel prefix, it's lovely derivation... you jump ahead and you have syntehsize a bunch of stuff, think hard, test it, get better ones. There's a very big gap there. Is there a point where program derivation fall off?
A: For prefix netowrks; we know very well what the look like, we know how they decompose, but what we don't know what the size of the components should be. We understand very well... it's not a good example of where program derivation would be of value. Sorting would be a better example.