Skip to content
zwabbit edited this page Nov 23, 2015 · 20 revisions

Background

As part of demonstrating the basic soundness and correctness of MIAOW's design, it was implemented onto a Virtex7 FPGA using the VC707 evaluation board. This variation is dubbed Neko and the following is a guide to bootstrapping and using it.

Due to MIAOW's sheer size, there are several limitations imposed upon Neko. The biggest is that instead of the four SIMD and SIMF units available in a standard compute unit, Neko is restricted to one of each. Even in this configuration, Neko accounts for well over 60% of the available logic and attempts to include one more SIMD/F unit resulted in the FPGA router giving up due to the density of the design.

The other major limitation is that currently Neko does not possess a functioning interface to memory. Initial attempts to create one were stymied by an architectural decision made early in MIAOW's development. To read more about it, please refer to the current limitations section of the Architecture Overview page.

Before attempting to synthesize Neko, one should have a passing familiarity with the Xilinx toolchain, otherwise one is likely to suffer some significant frustrations.

The current iteration of Neko requires Vivado 2015.1 at a minimum. It is also recommended that users remain on Windows 7 or 8/8.1 at this time.

One final caveat is the reminder that the FPGA project is effectively an export of the repository source. If you make changes or fixes, make sure that they are also applied to the original repository source, otherwise they will get lost the next time you do a new export.

Architecture

Neko is composed of a Microblaze acting as the scheduling unit and memory controller and a MIAOW compute unit connected via an AXI-lite interface. Control of the compute unit is via a series of command registers that are documented below.

Command Registers

The registers as exposed by the AXI-lite interface are memory mapped based on how the FPGA is configured by the user. The following represent offsets from the start of whatever address range was reserved for the compute unit. At present the register addresses are somewhat haphazardly assigned. In the future a more coherent schema will likely be put together.

Control Interface

The control interface is used to fill in the configuration values needed by the compute unit to actually execute a program. It is also how one starts execution of a kernel and determines when execution has been completed. The offsets are based on the region offsets specified above.

Register Address Offset Purpose
Execute 0x00000000 Write to to begin kernel execution, read from to check for kernel completion
Wave ID 0x00000004 Wave ID
Base VGPR 0x00000008 Starting address inside vector register file
Base SGPR 0x0000000C Starting address inside scalar register file
Base LDS 0x00000010 Starting address for memory operations
Wave Count 0x00000014 TODO: DOCUMENT
PC Start 0x00000018 Starting value of program counter

Program Interface

To load a program, first set the address and then write the instruction. Writing a value to the instruction value register will automatically see it written to the address previously specified.

Register Address Offset Purpose
Instruction Address 0x0000001C Address to load instruction to
Instruction Value 0x0000001C Instruction to load

GPR Interface

For testing purposes, the GPR interface provides the ability to fill in the scalar register file. The interface provides the ability to preload four 32bit words at a time to write out to a specific base address. Via the same set of control registers you can also read out four 32bit words. Note that the addresses used must be aligned. To read content from the register file, simply read from the data register addresses.

Register Address Offset Purpose
command 0x00000028 Writing to this address will trigger a write to the SGPR with the current address and data values
quadBaseAddr 0x0000002C SGPR base address to use for next operation
dataReg0 0x00000030 First data register
dataReg1 0x00000034 Second data register
dataReg2 0x00000038 Third data register
dataReg3 0x0000003C Fourth data register

PC Interface

MIAOW has a series of hooks built into the compute unit for generating kernel execution traces. These hooks are exposed as performance counters in the Neko implementation. The performance counters have limited storage and for that reason the granularity at which they capture values can be controlled. The performance counters also only run when a kernel is being executed.

NOTE: The above is half lies, the intent is to implement granularity support but right now it only works on a per cycle basis for recording.

NOTE: Performance counters were removed during the Vivado migration and the code that adds them back in has not yet been committed.

NOTE: This interface is still undergoing development and will grow over time.

Register Address Offset Purpose
Cycle Counter 0x000000C0 Cycles the current kernel has used to execute
PC Value 0x000000C4 Current PC value

Sources

Because of the way Xilinx's toolchain works a distinct build process was added to the project's Makefile to create a copy of the Verilog files ready to use with them. Some of the differences include injecting additional includes and defines into the top of the copies made. The command itself is more or less self-explanatory. Be sure to do this in the src/verilog/rtl directory.

make fpga

This will create a directory called fpga_core whose contents will ultimately go into the source folder of the Vivado project directory.

Vivado

The following is a guide to creating a Vivado project suitable for embedding MIAOW's compute unit into. Note that a few steps require manually editing files or copying content from the template files provided in the scripts/xilinx folder. Screenshots have been attached to help walk through the process.

Project Creation

  • Create a new Vivado project. Note for our purposes we are assuming this project is being built for the VC707 evaluation board. This is the only Xilinx board we support, primarily because it is the only development board we have access to that is large enough to support Neko. If one wishes to donate a VC709 or VCU108 board, please do get in touch with us at miaowgpu@cs.wisc.edu and we can talk.

The following images indicate the relevant options to select in the Vivado project creator. NOTE: Images need to be replaced.

Block Design

There are a couple of things we need to modify from the base block design.

  • Note that we run the system at 50MHz. This is both because attempts to synthesize MIAOW at 100 MHz results in timing violations.

  • At the same time, in the Board tab, change the Board Interface for CLK_IN1 to Custom instead of sys diff clock.

  • The MicroBlaze template project as created by Vivado does not include a MIG for controlling the DDR3 memory on the VC707 board. This needs to be added manually. Do so with the IP adder but do not run block or connection automation just yet. Based off of the documentation in chapter 4 of Xilinx's UG898 the MIG needs to be what the external clock is connected to. Disconnect the input clock signal from the clock generator and connect it instead to the MIG as shown in the screenshot below. Then run block automation.

  • Note that out of sheer stubbornness Vivado might attempt to add another input clock and connect it to the MIG. If it does so, remove the new input clock and stick with the old sys_diff_clock signal and reconnect it to the MIG.

  • Now that the MIG configuration is complete you should see a ui_clk port on its block. Connect that to the clk_in1 port of the clocking wizard like so. After this run the connection automation. Once done, you will have the base MicroBlaze project that MIAOW will plug into.

  • As a final note, the peripheral_aresetn and clk_out1 signals must be attached to external output ports. You can do this by right clicking on their sources in the processor reset system and clocking wizard blocks respectively and choosing "Create Port" in the context menu that pops up. These signals are absolutely necessary as they drive the main reset and clock signals used throughout the compute unit.

AXI Peripheral

  • The MIAOW compute unit is incorporated into the Vivado block design as a separate IP. For our purposes it is recommended that one view the following tutorial from Xilinx to get a basic idea of how to create AXI peripherals.

  • For our purposes, we named the AXI peripheral as axi_slave_v1_0 and recommend for simplicity you do so as well. Only a single AXI slave channel is required. Creation of the peripheral is more or less straightforward, the only real complication is the setting up of appropriate ports. This peripheral serves only as a bridge. The actual compute unit is kept entirely separate. We provide our instantiation of the AXI peripheral with all of the necessary ports and internal registers in the scripts/xilinx folder as axi_slave_v1_0.v and axi_slave_v1_0_S00_AXI.v. The latter has the actual memory mapped registers. Assuming you used the same name as us, the simplest way of creating the custom IP core is to overwrite the files generated by Vivado and repackage the core.

  • Once the IP is packaged, you can return to your original project. Add the path to the the IP repo in the IP settings in the block manager and then add in the AXI peripheral like so.

  • Run connection automation to get the AXI connections set up. All of the non-connected ports must be made external so that they can be hooked up to the compute unit. The method to instantiate the compute unit correctly is demonstrated in scripts/xilinx/base_microblaze_design_wrapper.vhd. Note that for this to work you need to create a HDL wrapper for the block design like shown in the following image.

MIAOW Source Code

  • Once all this is done, you can take the contents of the src/verilog/rtl/fpga_core build folder and drop them somewhere into the Vivado project and import them. With everything loaded you should see something similar to the following.

  • Before you attempt to synthesize you will need to add one last IP to the project, a block RAM. Under Project Manager open up the IP catalog.

  • Again our recommendation is that you adhere to the naming convention established in MIAOW's source code to make life easier. We named the component block_ram. Make sure your selections match the following screenshots.

  • Synthesize, implement, and generate a bitstream, wait for about two hours give or take depending on your computer, and assuming no bugs, you will have a working bitfile that you can program to the FPGA. Next is software support.

Xilinx SDK

The following guide assumes a basic understanding of how to do things like export a Xilinx SDK project from Vivado and generating BSPs. Once this is done, create a hello world software project and replace the helloworld.c file with the main.c file located in src/xilinx_sdk. This template serves as the FPGA side of the software to control compute unit and dynamically load kernels to it. Before building make sure to regenerate the linker script, past experience has demonstrated that you need to do this manually otherwise the final bitfile will not be loadable onto the FPGA.

PC Code

The other half of the code is in a separate repository located here. This is a Windows application that communicates over serial to send both instructions and data to the FPGA in order to bootstrap a kernel. The only thing one needs to do to run a kernel is to replace the contents of the instr_mem and data_mem arrays with the respective instructions and data you desire. And of course change the COM_PORT define to whatever serial port Windows assigned the evaluation board.

And that's it. If you have further questions on how to bootstrap Neko, please send them to miaowgpu@cs.wisc.edu and we'll get back to you as quickly as we can manage.

Future Work

  • Implement a proper memory interface for the compute unit to talk to the DDR3.
  • Implement support for PCIe.

Xilinx Datasheets Used

  • UG470 7 Series Config
  • UG472 7 Series Clocking
  • UG743 7 Series Memory Resources
  • UG898 Vivado Embedded Design
  • UG1037 Vivado AXI Reference Guide
Clone this wiki locally