INTRODUCTION

The increased integration of integrated circuits brings new challenges and chip area has become an important issue ^{1}. An alternative to decrease chip area, of binary digital circuits design, is to increase the digital representation to the Multiple-Valued Logic (MVL) base L, with domain D: {0, 1, 2. . ., L-1} in order to increase the entropy for each connection. MVL was introduced first by Lukasiewicz and Post for ternary logic ^{2}^{-}^{4}. The main advantage of the MVL digital processing is the reduced chip area, due to the fact that the interconnections contribute to about 70% of the Integrated Circuit (IC) area ^{13}. The IC area decreases by means of fewer interconnections, less quantity of IC pins and pads. Additionally, the processing performance might be faster for the MVL circuits in comparison to the binary circuits for the same resolution ^{4}^{-}^{5}. To design an actual quaternary CPU (eCPU), a quaternary algebraic system is needed. There are many proposed MVL algebras ^{6}^{-}^{8} and a suitable algebra needs to define a universal set of operators which must permit: 1) the development of minimization tools; 2) the operators' implementation with the IC gates; 3) the low complexity learning curve to facilitate designer's work. A suitable quaternary algebra that copes with all above-mentioned characteristics has been presented in ^{(}^{9} and it is comprised of Extended products, Successor and Maximum operators. Therefore, the aim of this paper is to design sixteen instructions, four stages pipelined quaternary processor architecture, that allow to demonstrate correct functionality with respect to the specifications in which the design criteria is the simplicity of implementation, not taking into consideration any possible optimization, nor specific criteria to define each stage pipeline functionality.

The methodology utilized here to the eCPU design is performed by extending the binary logic CPU architecture which facilitates the quaternary design. The eCPU follows the Harvard architectural model, a stack segment and handles hazards with hybrid techniques. Most data hazards are managed with static techniques through the utilization of a Java compiler which translates eCPU assembly to quaternary machine code, in which RAW and WAW data hazards are handled along with dynamic technique as forwarding between Instruction Execution and Instruction Decode. The control hazards are not fully handled, however, an interrupt mechanism is built into the eCPU for future implementation of complete control hazards. The required control signals are implemented within each pipeline stages. The description of the processor is made using the hardware description language VHDL and Quartus II development environment version 13.1 of Altera®. To validate the design, simulations are performed in Modelsim® version 10.1. Tests programs verify each instruction execution in the four stages pipeline and the whole eCPU execution is illustrated by means of calculating the ten first terms of the Fibonacci sequence in quaternary logic: {000, 001, 001, 002, 003, 011, 020, 031, 111, 202...}.

The rest of this paper is organized as follows: The second section presents the related work; The third section addresses the basic concepts; The fourth section describes the eCPU architecture; The fifth section presents the simulation utilizing the chosen technology; The sixth section discusses the limitations and the advantages of the design; The seventh section summarizes the concluding remarks and future work.

MVL RELATED WORK

There exist in the literature circuits to processing digital signals and different architectures that use MVL. In ^{(}^{4} the situation and the future of multiple-valued logic are analyzed and ^{10}^{-}^{12} discuss a ternary calculating machine, ternary computer, and the ternary optical computer architecture, respectively. There are two works exploiting a pipelined quaternary processor for image processing: 1) in ^{13} the authors claim that using the MVL logic in conventional digital systems has potential advantages of which one of the most important is the reduction of interconnections; 2) in ^{(}^{14} the authors discuss an NMOS pipelined image processor using quaternary logic for applications in medical image processing and robotics; Adders are a fundamental circuit for the MVL CPU design and are reviewed in ^{15} by proposing quaternary adders in voltage mode and in ^{16} demonstrating the design of High Performance Quaternary Adders. In ^{17} a high-speed, low-power and compact processing element (PE) using quaternary differential logic is proposed for a multi-core single-instruction-multiple-data (SIMD) processor implemented by using the multiple-valued current mode differential logic circuitry. In ^{18} the authors discuss content addressable memory in order to improve network quality of service (QoS) proposing a ternary/quaternary CAM architecture using DRAM technology processing.

Of particular interests in this work is the universal set of operators for the quaternary logic utilized for the design of MVL CPU architectures in [9], which takes advantage of the binary world knowledge, including minimization tools, synthesis methodology and implemented in a CMOS integrated circuit, demonstrating the feasibility of the hardware implementation ^{19}.

MVL BASIC CONCEPTS

Basic concepts are related to the universal set of quaternary gates proposed in [19] that permit to design any quaternary circuit taking advantage of the binary world knowledge. To design sequential circuits a memory element is also needed, as the D flip-flop proposed later. The architectural concepts needed to design the quaternary eCPU are similar to the binary world, being necessary only the utilization of the quaternary instead of the binary algebra, therefore decreasing the learning curve of the eCPU project.

To properly perform arithmetic operations, it was necessary to define the equivalent to the bit in the binary world that is the qbit or quaternary digit {0, 1, 2, 3}, and the operations are implemented based on the MVL algebra acting on the qbit, as presented next.

MVL universal set of quaternary operators

The chosen universal set of quaternary gates consists of the following operators: Extended products eAND_{1}, eAND_{2}, eAND_{3}, Successor, and Maximum with gates representation, as shown in Figure 1, and replicated briefly next, for clarity purposes ^{(}^{9}.

Extended MVL product term, shown in Table 1 is a binary operator, denoted by the symbol *^{i}, eANDi (a_{1}, a_{2}) = a_{1}*^{i}a_{2} = i, iff a_{1} = a_{2} = i; otherwise a_{1}*^{i}a_{2} = 0. For instance, if a_{1} = 2, a_{2} = 0, then a_{1}*^{1}a_{2} = 0; if a_{1} = 1, a_{2} = 1, then a_{1}*^{1}a_{2} = 1.

Successor (a_{1}) is a unary operator, denoted by the symbol a_{1} ^{1}, a_{1} ^{1}= (a_{1}+1) mod 4, where in this case + stands for the arithmetic addition and mod stands for the modulo operator. The a_{1} ^{1} is the next element in the ordered cyclic domain D: (0, 1, 2, 3) ^{10} when the Successor is applied once to (a_{1}). The a_{1} ^{2} is the second next element in the ordered cyclic, etc., for instance, if (a_{1} = 2) a_{1} ^{1} = 3, a_{1} ^{2} = 0, a_{1} ^{3} = 1.

Maximum (a_{1}, a_{2}) is a binary operator, denoted by the symbol +, defined, as usual, being the larger of the two values, for instance: if a_{1} = 2, a_{2} = 0, then a_{1} + a_{2} = 2. From now on + always stands for Maximum (not arithmetic addition) operator, unless stated otherwise.

By utilizing these operators any quaternary circuit can be designed based on the concept of MVL functions: F (a_{1}, a_{2}..., a_{n},), where n is the number of quaternary variables. In the chosen quaternary algebra, functions are defined as composed of subfunctions: F_{1} (a_{1}, a_{2}..., a_{n}), F_{2} (a_{1}, a_{2}..., a_{n}), F_{3} (a_{1}, a_{2}..., a_{n}), i.e., given a quaternary function each Fi (a_{1}, a_{2}., a_{n}), i = 1,2,3 copes with the minterms equal to i and takes two possible values i or 0. An MVL minterm is a product term (eANDi) that has all the variables presented in any form of the Successor operator, analogously to the binary minterm.

The sum of extended products form (SOPE) is a sum (maximum operator) of MVL subfunctions. This process yields the function in equation (1) in the sum of extended product form (SOPE) ^{9} for the example shown in Table 1, analogously to the sum of product form (SOP) in binary circuits' synthesis.

Analogously to the binary base, there is another universal set of MVL operator comprised of Extended OR operator, Successor, and Minimum. Extended OR operator is a binary operator for i = 0,1,2, denoted by the symbol +^{i}, eORi (a_{1}, a_{2}) = a_{1} + ^{i}a_{2} = i, iff a_{1} = a_{2} = i; otherwise a_{1} + ^{i}a_{2} = 3. For instance, if a_{1} = 2, a_{2} = 3, then a_{1} + ^{1}a_{2} = 3; if a_{1}=1, a_{2} = 1, then a_{1} + ^{1}a_{2} = 1.

The Minimum is a binary operator, as usual, being the minimum between the two operands.

This is the duality by complementation, named the product of extended sums form (POES), which is a product (minimum operator) of MVL subfunctions, based on maxterms. An MVL maxterm is a sum term (eORi (a_{1}, a_{2})), that has all variables presented in any form of the Successor operator. For a detailed explanation, please refer to ^{(}^{9}.

To define the MVL delay latch: D-latch, there are many alternatives, but in the same way as in the binary methodology, in the MVL D-latch independently of the current state Q, the next state Q* is set to the input data D after a delay. The State equation Q* for the quaternary D-latch is shown in equation (2). For the interested reader, definitions of types of quaternary latches are found in ^{(9, 20)}. With the two sets of universal operators along with the memory element, it is possible to design the proposed eCPU as presented next.

The design criteria for the eCPU architecture to prove concepts are based on the simplicity of the implementation. The four stages pipeline eCPU executes sixteen instructions chosen to demonstrate the feasibility of the architecture, letting write any program in the eCPU assembly language working in the domain of natural numbers for arithmetic operations. The eCPU memory consists of Harvard architecture, a stack segment, and the Register File. In the Harvard memory, the instruction and data modules are addressed by qbytes, defined as a set of four qbits, and each unit has a capacity of two hundred fifty-six words. The Register File contains a bank of sixteen registers for general use.

Table 2 shows the three types of implemented instructions its syntax and semantics: 1) type R: ADD, SUBT, MULT, DIV, PUSH, POP, GT, LS, MOV; 2) type I: ADDI (immediate addition), LOAD, STORE, NOP; and 3) type J: JUMP. The instruction's format has eight qbits size. Table 3 shows the OpCode and encoding, where tt, ss, dd are the source registers, destination, and CCCC immediate value, respectively.

The instruction set type R presents the first two qbits as OpCode, the next four qbits as source registers address and the last two, the destination register. Type R has one or three registers as operands. The instruction set type I presents the first two qbits as OpCode, the next two qbits as the destination or source register address and the last four an immediate value (immediate) or an address (LOAD and STORE). Type I has varying numbers of operands, from zero (NOP) to two, using the immediate value.

In the instruction set, type J is the conditional branch which presents the first two qbits as OpCode, the next two qbits is an address to check the condition and the last qbits is the address to jump.

The eCPU modules are detailed as follows:

1) Instruction Fetch Unit: in Figure 2 three functionalities are implemented in this module: a) the usual Instruction Fetch handled via the PC block that chooses between the next instruction address (PC+1) or the jump instruction address (PCin); b) the managing of the stack segment via the Stack Pointer Block which inputs to the unified two segments memory, one for the instructions and one for the stack; and c) the managing of the interrupts via the Interrupt Unit block. Table 4 defines its input and output signals. Note that the unified memory (Instruction and Stack) is a two-port memory placed in this module for ease of implementation. PC points to the address to the next instruction in the instruction memory to fetch the instruction. With the PCSrc signals (source PC) and PCin (PC input), both generated in the Instruction Execution Unit, the processor determines whether follows its normal cycle, that is PC+1, or if a jump instruction occurs. The Stack Pointer Block handles the POP and PUSH instructions execution in the stack that is the basic mechanism to cope with the control hazards.

2) Instruction Decode: in Figure 3, two functio nalities are implemented in this module: a) the usual decoding of the instruction and data fetch from the register file REG FILE block; b) the managing of the forwarding implementation via FORWARDING CONTROL block. Table 5 defines its input and output signals.

3) Instruction Execution: in Figure 4 three functionalities are implemented in this module: a) the execution of the instructions via the ALU; b) the managing of the forwarding implementation via the MUX blocks; and c) managing of the interrupt implementation via the INTERRUPT TRIGGER block.

Table 6 defines its input and output signals. Within this stage, the ALU (arithmetic and logic unit) is placed. In order to implement quaternary logic operations qbit and qbit vector types are defined which correspond to the quaternary digit and quaternary digit vector types. The ALU executes operations between them. In the R2 register input, the first possible choice named MemAddress comes from the quaternary MUX with two inputs (the other two are not used). The control signal in this MUX, named (3) in the datapath, refers to MemAddress ^{(}^{3} qbit of the register. Due to the fact that the ALU executes operations with 8 qbits registers, the extend 3 block extends from 4 qbits to 8 qbits filling with the value 3 the additional 4 qbits and the extend 0 block extends from 4 qbits to 8 qbits filling with the value 0 the additional 4 qbits.

Figure 5 shows the interrupt module which is placed at this stage and it is responsible for identifying arithmetic (i.e. overflow) and stack exceptions and their management. For the context switch, a routine is implemented that records the first ten (0.9) registers. Table 7 defines its input and output signals.

1) Write Back: in Figure 6 two functionalities are implemented in this module: 1) to save the data into the destination register in the file register; 2) to direct the data to Read or Write into the data memory. Table 8 defines its input and output signals. Note carefully, that in the implementation, Write Back has direct connections to the instruction fetch module running on the same clock cycle due to ease of implementation.

The Hazards of the processor are resolved as follows:

1) Structural Hazards: to cope with memory contention the Harvard architecture along with the file register are implemented. In the Harvard architecture, the instruction memory (placed in IF block) and data memory (placed in WB block) are separate units.

The access to the file register permits read and write in the same cycle by taking advantage of the different phases of the clock.

2) Data Hazards: to cope with RAW (Read After Write), WAW (Write After Write) data Hazards, an static technique provided by the compiler along with a forwarding mechanism (dynamic technique) are implemented. The forwarding mechanism takes place between ID and IE stages.

3) Control Hazards: to cope with changes in the PC from the PC+1 normal execution, a signal called flush was implemented to implement the flush operation in the pipeline for the sake of simplicity. If this signal is set to logic level three, then all units will be reset (flush), as well as the backup registers. The fact that this signal is generated at the execution stage ensures that only the correct instructions are executed, that is, the units are reset before the Jump instruction and all previous instructions to the Jump instruction are fully executed. There are not a complete treatment of control hazards, in this eCPU only the mechanism to handle these hazards are implemented to demonstrate correct functionality and feasibility of the implementation. Even, if there is not any interrupt vector neither interrupt handlers, this mechanism will allow to implement those operations which will be considered for future work. Figure 7 shows the eCPU datapath with the four-stage pipeline comprised of IF, ID, IE, WB functional units.

SIMULATION RESULTS

The simulations are performed in Modelsim® as follows: 1) To check each instruction, programs are simulated to verify the Instruction Fetch, Decode, Execution and Write Back units; 2) To check the whole project each instruction, in the instruction set, is simulated; and 3) To check a whole program, the Fibonacci sequence is simulated to verify the microprocessor as a whole, in which the first ten terms of the Fibonacci sequence are calculated. Table 9 shows the quaternary assembly program and the machine code generated by the compiler. The comments (i.e. - RAW, - WAR ) demonstrate some data hazards that happen in the code.

Figure 8 shows in the registers (left column) /u2/fr/ tmp_rf{0}, {1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}, from top to bottom, the ten first terms of the Fibonacci sequence: {0, 1, 1, 2, 3, 5, 8, 13, 21, 34} in decimal that corresponds to {000, 001, 001, 002, 003, 011, 020, 031, 111, 202} in quaternary logic (last column).

In order to check the interrupt mechanism, the results are popped from the stack generating an exception in the stack (two many pops) and triggering the context switch and pushed the results previously saved in the registers.

To illustrate final hardware results, some Quartus statistics are provided as shown in Figure 9.

SIMULATION DISCUSSION

The simulation results show that the quaternary processor is executing instructions in accordance with the design specifications. The Jump instructions are the most critical since generates control Hazards. The method for solving this problem used in this project causes a delay in the pipeline, but it is defined this way to simplify implementation, as the eCPU optimization in this proposal is not considered. To properly perform arithmetic operations, it is necessary to implement a package called qbit.vhd file. In this package is defined the qbit, the qbit vector types and the way the operations are implemented. The method of implementation of the arithmetic operations is simplified because the objective is to deal with aspects of MVL eCPU architecture and not with the implementation of the quaternary arithmetic operations which are left for future work. The main difficulty of this work for the design and simulation of the current quaternary eCPU is the lack of an environment that synthesizes the VHDL in quaternary logic, therefore, to demonstrate the feasibility of the hardware, the actual implementation of the quaternary ports should be used. Note that, ^{2} conceptually, the binary architecture is similar to the MVL eCPU, with the difference that by increasing the base of representation, the instructions, data, and ^{3} control signals are represented in a more compact way (quaternary qbits), which simplifies the design, because it reduces the number of signals involved ^{4} in the implementation and reduces chip area as the interconnections are also decreased.

CONCLUSION AND FUTURE WORK

In this work, a MVL sixteen instructions set four stages pipeline eCPU Harvard architecture working properly with respect to specification has been shown. The architectural design criterion is being assumed to be the simplicity of implementation, as the main purpose is to demonstrate correct functionality and implementation feasibility. The correct simulated behavior demonstrates the correctness with respect to the specification and the quaternary hardware feasibility must be complemented with an actual implementation utilizing the quaternary gates. To facilitate user programming a two phase’s MVL compiler written in Java has been coded. The MVL architecture resembles, clearly, the binary CPU implementation decreasing the learning curve to the design of MVL CPUs. Future work is related to the implementation of the architecture developed here in terms of the actual hardware implementation of the chosen quaternary gates, by developing some synthesis tools; the complete control hazards treatment; and to develop quaternary circuits’ minimization tools for the chosen MVL algebra and circuits to perform quaternary arithmetic.