-
Notifications
You must be signed in to change notification settings - Fork 248
Neko
As part of demonstrating the basic soundness and correctness of MIAOW's design, it was implemented onto a Virtex7 FPGA using the VC707 evaluation board. This variation is dubbed Neko and the following is a guide to bootstrapping and using it.
Due to MIAOW's sheer size, there are several limitations imposed upon Neko. The biggest is that instead of the four SIMD and SIMF units available in a standard compute unit, Neko is restricted to one of each. Even in this configuration, Neko accounts for well over 60% of the available logic and attempts to include one more SIMD/F unit resulted in the FPGA router giving up due to the density of the design.
The other major limitation is that currently Neko does not possess a functioning interface to memory. Initial attempts to create one were stymied by an architectural decision made early in MIAOW's development. To read more about it, please refer to the current limitations section of the Architecture Overview page.
Before attempting to synthesize Neko, one should have a passing familiarity with the Xilinx toolchain, otherwise one is likely to suffer some significant frustrations.
The current iteration of Neko requires Vivado 2015.1 at a minimum. It is also recommended that users remain on Windows 7 or 8/8.1 at this time.
One final caveat is the reminder that the FPGA project is effectively an export of the repository source. If you make changes or fixes, make sure that they are also applied to the original repository source, otherwise they will get lost the next time you do a new export.
Neko is composed of a Microblaze acting as the scheduling unit and memory controller and a MIAOW compute unit connected via an AXI-lite interface. Control of the compute unit is via a series of command registers that are documented below.
The registers as exposed by the AXI-lite interface are memory mapped based on how the FPGA is configured by the user. The following represent offsets from the start of whatever address range was reserved for the compute unit. At present the register addresses are somewhat haphazardly assigned. In the future a more coherent schema will likely be put together.
The control interface is used to fill in the configuration values needed by the compute unit to actually execute a program. It is also how one starts execution of a kernel and determines when execution has been completed. The offsets are based on the region offsets specified above.
Register | Address Offset | Purpose |
---|---|---|
Execute | 0x00000000 | Write to to begin kernel execution, read from to check for kernel completion |
Wave ID | 0x00000004 | Wave ID |
Base VGPR | 0x00000008 | Starting address inside vector register file |
Base SGPR | 0x0000000C | Starting address inside scalar register file |
Base LDS | 0x00000010 | Starting address for memory operations |
Wave Count | 0x00000014 | TODO: DOCUMENT |
PC Start | 0x00000018 | Starting value of program counter |
To load a program, first set the address and then write the instruction. Writing a value to the instruction value register will automatically see it written to the address previously specified.
Register | Address Offset | Purpose |
---|---|---|
Instruction Address | 0x0000001C | Address to load instruction to |
Instruction Value | 0x0000001C | Instruction to load |
For testing purposes, the GPR interface provides the ability to fill in the scalar register file. The interface provides the ability to preload four 32bit words at a time to write out to a specific base address. Via the same set of control registers you can also read out four 32bit words. Note that the addresses used must be aligned. To read content from the register file, simply read from the data register addresses.
Register | Address Offset | Purpose |
---|---|---|
command | 0x00000028 | Writing to this address will trigger a write to the SGPR with the current address and data values |
quadBaseAddr | 0x0000002C | SGPR base address to use for next operation |
dataReg0 | 0x00000030 | First data register |
dataReg1 | 0x00000034 | Second data register |
dataReg2 | 0x00000038 | Third data register |
dataReg3 | 0x0000003C | Fourth data register |
MIAOW has a series of hooks built into the compute unit for generating kernel execution traces. These hooks are exposed as performance counters in the Neko implementation. The performance counters have limited storage and for that reason the granularity at which they capture values can be controlled. The performance counters also only run when a kernel is being executed.
NOTE: The above is half lies, the intent is to implement granularity support but right now it only works on a per cycle basis for recording.
NOTE: Performance counters were removed during the Vivado migration and the code that adds them back in has not yet been committed.
NOTE: This interface is still undergoing development and will grow over time.
Register | Address Offset | Purpose |
---|---|---|
Cycle Counter | 0x000000C0 | Cycles the current kernel has used to execute |
PC Value | 0x000000C4 | Current PC value |
Because of the way Xilinx's toolchain works a distinct build process was added to the project's Makefile to create a copy of the Verilog files ready to use with them. Some of the differences include injecting additional includes and defines into the top of the copies made. The command itself is more or less self-explanatory. Be sure to do this in the src/verilog/rtl directory.
make fpga
This will create a directory called fpga_core whose contents will ultimately go into the source folder of the Vivado project directory.
The following is a guide to creating a Vivado project suitable for embedding MIAOW's compute unit into. Note that a few steps require manually editing files or copying content from the template files provided in the scripts/xilinx folder. Screenshots have been attached to help walk through the process.
- Create a new Vivado project. Note for our purposes we are assuming this project is being built for the VC707 evaluation board. This is the only Xilinx board we support, primarily because it is the only development board we have access to that is large enough to support Neko. If one wishes to donate a VC709 or VCU108 board, please do get in touch with us at miaowgpu@cs.wisc.edu and we can talk.
The following images indicate the relevant options to select in the Vivado project creator.
NOTE: Images need to be replaced.
There are a couple of things we need to modify from the base block design.
- Note that we run the system at 50MHz. This is both because attempts to synthesize MIAOW at 100 MHz results in timing violations.
- At the same time, in the Board tab, change the Board Interface for CLK_IN1 to Custom instead of sys diff clock.
- The MicroBlaze template project as created by Vivado does not include a MIG for controlling the DDR3 memory on the VC707 board. This needs to be added manually. Do so with the IP adder but do not run block or connection automation just yet. Based off of the documentation in chapter 4 of Xilinx's UG898 the MIG needs to be what the external clock is connected to. Disconnect the input clock signal from the clock generator and connect it instead to the MIG as shown in the screenshot below. Then run block automation.
-
Note that out of sheer stubbornness Vivado might attempt to add another input clock and connect it to the MIG. If it does so, remove the new input clock and stick with the old sys_diff_clock signal and reconnect it to the MIG.
-
Now that the MIG configuration is complete you should see a ui_clk port on its block. Connect that to the clk_in1 port of the clocking wizard like so. After this run the connection automation. Once done, you will have the base MicroBlaze project that MIAOW will plug into.
- As a final note, the peripheral_aresetn and clk_out1 signals must be attached to external output ports. You can do this by right clicking on their sources in the processor reset system and clocking wizard blocks respectively and choosing "Create Port" in the context menu that pops up. These signals are absolutely necessary as they drive the main reset and clock signals used throughout the compute unit.
-
The MIAOW compute unit is incorporated into the Vivado block design as a separate IP. For our purposes it is recommended that one view the following tutorial from Xilinx to get a basic idea of how to create AXI peripherals.
-
For our purposes, we named the AXI peripheral as axi_slave_v1_0 and recommend for simplicity you do so as well. Only a single AXI slave channel is required. Creation of the peripheral is more or less straightforward, the only real complication is the setting up of appropriate ports. This peripheral serves only as a bridge. The actual compute unit is kept entirely separate. We provide our instantiation of the AXI peripheral with all of the necessary ports and internal registers in the scripts/xilinx folder as axi_slave_v1_0.v and axi_slave_v1_0_S00_AXI.v. The latter has the actual memory mapped registers. Assuming you used the same name as us, the simplest way of creating the custom IP core is to overwrite the files generated by Vivado and repackage the core.
-
Once the IP is packaged, you can return to your original project. Add the path to the the IP repo in the IP settings in the block manager and then add in the AXI peripheral like so.
- Run connection automation to get the AXI connections set up. All of the non-connected ports must be made external so that they can be hooked up to the compute unit. The method to instantiate the compute unit correctly is demonstrated in scripts/xilinx/base_microblaze_design_wrapper.vhd. Note that for this to work you need to create a HDL wrapper for the block design like shown in the following image.
- Once all this is done, you can take the contents of the src/verilog/rtl/fpga_core build folder and drop them somewhere into the Vivado project and import them. With everything loaded you should see something similar to the following.
-
Before you attempt to synthesize you will need to add one last IP to the project, a block RAM. Under Project Manager open up the IP catalog.
-
Again our recommendation is that you adhere to the naming convention established in MIAOW's source code to make life easier. We named the component block_ram. Make sure your selections match the following screenshots.
- Synthesize, implement, and generate a bitstream, wait for about two hours give or take depending on your computer, and assuming no bugs, you will have a working bitfile that you can program to the FPGA. Next is software support.
The following guide assumes a basic understanding of how to do things like export a Xilinx SDK project from Vivado and generating BSPs. Once this is done, create a hello world software project and replace the helloworld.c file with the main.c file located in src/xilinx_sdk. This template serves as the FPGA side of the software to control compute unit and dynamically load kernels to it. Before building make sure to regenerate the linker script, past experience has demonstrated that you need to do this manually otherwise the final bitfile will not be loadable onto the FPGA.
The other half of the code is in a separate repository located here. This is a Windows application that communicates over serial to send both instructions and data to the FPGA in order to bootstrap a kernel. The only thing one needs to do to run a kernel is to replace the contents of the instr_mem and data_mem arrays with the respective instructions and data you desire. And of course change the COM_PORT define to whatever serial port Windows assigned the evaluation board.
And that's it. If you have further questions on how to bootstrap Neko, please send them to miaowgpu@cs.wisc.edu and we'll get back to you as quickly as we can manage.
- Implement a proper memory interface for the compute unit to talk to the DDR3.
- Implement support for PCIe.
- UG470 7 Series Config
- UG472 7 Series Clocking
- UG743 7 Series Memory Resources
- UG898 Vivado Embedded Design
- UG1037 Vivado AXI Reference Guide