The TD4 4-bit CPU: A Comprehensive Architectural and Programming Tutorial

I. Introduction to Minimalist CPU Design: The TD4 Mandate

The TD4 (Tiny Digital 4-bit CPU) represents a foundational piece of computer architecture, specifically designed to expose the intricate operation of a Central Processing Unit (CPU) using straightforward, discrete logic components. This machine, largely based on the designs of Kaoru Tonami in his book How to Build a CPU , serves as an unparalleled pedagogical tool for students and hobbyists seeking a deep understanding of hardware fundamentals.

A. Origins, Educational Context, and Core Philosophy

The primary objective of the TD4 architecture is transparency and simplicity. By constructing the CPU entirely from standard 74-series Transistor-Transistor Logic (TTL) chips—the original TD4 utilizes only 12 core ICs—the complexity inherent in modern microprocessors is stripped away. This design methodology forces users to grasp the register-transfer level (RTL) operations that define computational processes. The physical kits often feature an extensive array of Light Emitting Diodes (LEDs), sometimes totaling nearly 200, used to visually represent the real-time state of the program counter, registers A and B, input/output ports, and control signals. This "blinkenlight" feature makes the traditionally abstract fetch-decode-execute cycle a tangible, observable sequence of events, fulfilling its core educational mandate.
The TD4 ecosystem has generated several variants, including the physical TTL kit (often referred to as the Wuxx/Watanabe version) and software emulators like Tiny4CPU (T4C). While the core principles remain constant, the T4C emulator introduces minor structural differences, such as the naming of general-purpose registers (X and Y instead of A and B) and abstracting some of the strict hardware limitations. However, the fundamental machine architecture is defined by the rigid constraints imposed by the minimal TTL implementation.

B. Architectural Constraints and Data Limitations

The TD4 is defined by its extreme constraints, which drive every aspect of its design, particularly the Instruction Set Architecture (ISA). The CPU operates on a 4-bit word size, meaning the native data unit, the nibble, restricts all registers and immediate values to numerical ranges of 0 through 15.

Program Memory and Instruction Encoding

The most stringent limitation is the instruction memory. The original TD4 design utilizes a bank of 16 Dual In-line Package (DIP) switches, where each switch array represents one 8-bit instruction. This hardwired Read-Only Memory (ROM) defines the maximum program size: 16 total instruction slots. This minuscule address space is a direct consequence of the 4-bit Program Counter (PC) register.
Each instruction is fixed-length, spanning 8 bits, and is consistently partitioned into two functional fields:

OpCode (High Nibble, Bits 5-8): These 4 bits select the operation to be performed, such as ADD, MOV, JMP, or IN/OUT.
Immediate Value (Imm) (Low Nibble, Bits 1-4): These 4 bits provide a constant data value (or memory address, in the case of jumps) that is always used as an operand during execution.

This rigid format necessitates careful programming; for instructions that logically do not require an immediate value (like register-to-register moves), the programmer must explicitly set the immediate field to zero.
The core architectural specifications are summarized below:
Table I: Core TD4 Architectural Specifications

Feature	Specification
Architecture	4-bit (Nibble-based)
Program Memory	16 Locations (8 bytes total)
Program Storage	DIP Switches (Hardwired ROM)
Data Registers	2 (Register A, Register B)
System Clock	Single-Cycle Design (CPI=1)
Logic Family	74-series TTL

II. The TD4 Hardware Architecture: Datapath and Control Unit

A CPU is conceptually separated into the datapath, which handles data storage and manipulation, and the control unit, which dictates the sequencing and timing of operations. The TD4’s design exemplifies how these two units are highly specialized for minimalism.

A. Functional Blocks and The Datapath Structure

The TD4 datapath is fundamentally organized around a single internal bus structure, connecting the state elements (registers) to the functional unit (the Adder/ALU) via dedicated multiplexing logic. This configuration determines the limited set of operations the CPU can perform in a single clock cycle.
The hardware implementation relies on specific TTL components:

Registers (A, B, PC): Implemented using integrated circuits that function as 4-bit latches or counters, such as the 74LS161.
Arithmetic Logic Unit (ALU): Consists of a dedicated 4-bit Adder chip, typically the 74LS283, which is the sole computational unit.
Control Unit and Decoder: This combinatorial logic (made up of various logic gates and decoder chips like 74LS138s) translates the 4-bit OpCode into all necessary control signals that orchestrate data flow throughout the datapath.

B. Detailed Analysis of Registers and State Elements

The TD4 uses a handful of key registers to manage program state and flow.

General Purpose Registers and System State

Registers A and B are the two general-purpose 4-bit registers, serving as primary sources and destinations for all data manipulation instructions. Their implementation leverages readily available counter chips. While these chips possess counting functionality, Registers A, B, and the Output Register (OUT) utilize them exclusively as parallel-load latches. This efficient reuse of a single type of IC (the counter chip) across multiple system functions significantly reduces the total component count and simplifies the supply chain, which is a common strategy in minimalist TTL designs. Data is loaded into these registers when the corresponding active-low signal (not_LOADA or not_LOADB) is asserted (set low) on the rising or falling edge of the clock signal.
The Program Counter (PC) is also a 4-bit register, but it is distinct in its utilization. It holds the address (0 to 15) of the instruction currently being fetched from the ROM. Unlike Registers A and B, the PC utilizes the inherent counter functionality of its underlying TTL chip. After an instruction is fetched, the PC automatically increments (PC \leftarrow PC + 1) to point to the next sequential instruction.
The Carry Flag (C) is a crucial single-bit state element, typically implemented as a flip-flop, directly connected to the Carry-Out pin of the 4-bit Adder. This flag captures overflow conditions resulting from addition operations and is the only mechanism available for conditional branching via the JNC (Jump if No Carry) instruction. Finally, the Output Register (OUT) is a 4-bit latch that holds the data currently being displayed on the 4 output LEDs.

C. The Arithmetic Logic Unit (ALU) and Unified Execution

The design of the TD4’s computational core revolves around simplifying the ALU to its absolute minimum: a dedicated 4-bit adder. This strategic decision eliminates the need for complex control logic required to switch between operations like subtraction, AND, or OR. Instead, the TD4 relies on data routing and the identity property of addition to perform all necessary functions, including data movement.
The Adder chip (IC8) accepts two primary 4-bit inputs, Input_A and Input_B, and outputs the 4-bit sum along with the Carry flag.

Input_A: This input is fixed and always receives the Immediate Value (Imm), which is derived directly from the low nibble (bits 1-4) of the currently fetched instruction.
Input_B: This input is determined by the Data Selector, a multiplexing unit composed of chips like IC6 and IC7, which selects one of four possible operands.

The selection logic is controlled by two signals, SEL_A and SEL_B, generated by the command decoder.
Table III: Datapath Source Operand Selection for ALU Input

SEL_A	SEL_B	Selected Operand (Input_B)	Origin / Use Case
0	0	Register A Contents	Primary ALU input
0	1	Register B Contents	Secondary ALU input
1	0	Input Port Data (IN)	External 4-bit input
1	1	Hardwired Zero (0000)	Used for X + Imm operations when X=0

The Adder Trick and Design Implication

The ability of the TD4 to perform data transfer (MOV) and input operations (IN) using only an adder is a demonstration of clever architectural simplification. The fundamental principle employed is the addition identity property, X + 0 = X.
To execute an instruction like MOV A, Imm, the control unit asserts SEL_A=1 and SEL_B=1, causing the Data Selector to select the Hardwired Zero signal. The adder then calculates 0 + Imm, and the result is clocked into Register A (A \leftarrow 0 + Imm). Similarly, to perform a register-to-register move like MOV A, B, the control unit selects Register B, and the programmer must ensure the instruction's immediate value is set to zero. The operation becomes A \leftarrow B + 0. This unified execution strategy eliminates the need for complex ALU circuitry that would require separate paths for data movement or specialized functional units. However, this minimalist design imposes a critical limitation: the TD4 cannot perform direct register-to-register addition (e.g., A \leftarrow A+B). Since the Immediate Value is fixed as Input_A, and the Data Selector can only supply one register (A or B) as Input_B, the architecture is physically incapable of simultaneously feeding two general-purpose register contents into the adder.

III. Control Flow: Instruction Format and Single-Cycle Execution

The TD4 manages its limited program flow and operational timing through a simple, single-cycle control unit.

A. The Control Unit and Signal Generation

The Control Unit is responsible for interpreting the 8-bit instruction fetched from the ROM and generating the requisite control signals necessary to govern the datapath for that specific instruction.
The Instruction Decoding process focuses on the 4-bit OpCode (bits 5-8). This OpCode is fed into the combinatorial decoder logic, which translates the instruction into the active-low not_LOADx signals and the Data Selector control bits (SEL_A and SEL_B). The low component count (only 12 core chips) required that the OpCodes be specifically mapped to minimize the complexity of this decoding logic. The designer, Iku Watanabe, reportedly mapped 14 of the 16 possible OpCodes to maximize instruction density while maintaining decoder simplicity.
The output of the command decoder includes the four not_LOAD signals: not_LOAD0 (Register A), not_LOAD1 (Register B), not_LOAD2 (Output Register), and not_LOAD3 (Program Counter). When a specific signal is pulled low during a clock cycle, the output of the adder is latched into the corresponding register.

B. Single-Cycle Instruction Timing

The TD4 utilizes a single-cycle CPU design, a critical architectural feature where the entire process of instruction execution—Fetch, Decode, Execute, and Write Back—is completed within one single clock pulse. This structure yields a Cycles Per Instruction (CPI) value of CPI=1 for every operation.
The operational sequence within one clock cycle is strictly sequential:

Instruction Fetch and Decode: The PC addresses the ROM, the 8-bit instruction is retrieved, and the Control Unit immediately generates stable control signals based on the OpCode.
Execute: The control signals stabilize the Data Selector's output (Input_B) and the Adder performs the arithmetic operation (Result \leftarrow Input_B + Imm).
Write Back: On the clock transition, the calculated result is simultaneously written to the designated destination register(s) if the corresponding not_LOADx signal is asserted low.
PC Update: Concurrently, the Program Counter is updated. For sequential execution, the PC is incremented. For jump instructions, the ALU output (the jump address) is loaded into the PC, overriding the increment.

The simplicity of the single-cycle design makes it easy to implement, but it imposes a significant constraint on performance. The time allocated for one clock cycle must be sufficient for the critical path—the longest time required for signals to propagate from the ROM, through the decoder and multiplexers, and finally through the adder before the result is stable at the register input. Because this propagation delay involves multiple levels of TTL logic, the maximum operational frequency of the physical TD4 kit is necessarily slow, often operating at speeds that allow manual stepping or in the millisecond range, which is appropriate for its didactic purpose.

IV. Instruction Set Architecture (ISA) Deep Dive and Implementation Nuance

The TD4 ISA consists of a limited set of instructions, all of which must adhere to the physical datapath constraint that every operation passes through the adder, always incorporating the immediate value.

A. The Complete Mapped OpCode List

The following table details the most common TD4 instructions, showing how each high-level instruction is physically executed at the Register Transfer Level (RTL) using the fixed calculation pattern Result \leftarrow Input_B + Imm.
Table IV: TD4 Core Instruction Set (Implementation Focus)

Mnemonic	OpCode (Bits 8-5)	Operand Type (Bits 4-1)	Data Selector Input	Execution Detail (RTL)
MOV A, Imm	0011	Immediate (Immm)	Zero (0000)	A \leftarrow 0 + Imm
MOV B, Imm	0111	Immediate (Immm)	Zero (0000)	B \leftarrow 0 + Imm
MOV A, B	0001	Requires Zero (0000)	Register B	A \leftarrow B + 0
MOV B, A	0100	Requires Zero (0000)	Register A	B \leftarrow A + 0
ADD A, Imm	0000	Immediate (Immm)	Register A	A \leftarrow A + Imm (Sets C Flag)
ADD B, Imm	0101	Immediate (Immm)	Register B	B \leftarrow B + Imm (Sets C Flag)
IN A	0010	Requires Zero (0000)	Input Port (IN)	A \leftarrow Input + 0 (Sets C Flag)
OUT Imm	1011	Immediate (Immm)	Zero (0000)	OUT \leftarrow 0 + Imm
OUT B	1001	Requires Zero (0000)	Register B	OUT \leftarrow B + 0
JMP Imm	1111	Immediate (Immm)	Zero (0000)	PC \leftarrow 0 + Imm
JNC Imm	1110	Immediate (Immm)	Zero (0000)	If C=0, PC \leftarrow 0 + Imm; else PC \leftarrow PC + 1

This table clearly illustrates the critical design compromise: for instructions that transfer data from a source (like Register B or the Input Port) to a destination, the 4-bit Immediate field must be zeroed by the programmer to avoid unintended arithmetic side effects.

B. Implementation of Control Flow: Jumps

Control flow in the TD4 is rudimentary, managed solely by two jump instructions that directly manipulate the Program Counter.

Unconditional Jump (JMP Imm)

The JMP Imm instruction is implemented using the standard ALU path. The OpCode causes the control unit to select the Hardwired Zero input for the adder's Input_B, resulting in the calculation 0 + Imm. Crucially, the control unit simultaneously asserts the not_LOADPC signal. When the clock edge arrives, the immediate value, which represents the target address, is loaded directly into the PC register, overriding the sequential increment.

Conditional Jump (JNC Imm)

The JNC Imm (Jump if No Carry) is the TD4’s only means of decision-making. This instruction relies on the state of the single Carry Flag (C), which is set if the previous arithmetic operation resulted in an overflow (a value greater than 15, or 1111_2). The address calculation itself is identical to the unconditional jump (0 + Imm). However, the control unit gates the not_LOADPC signal using the C flag.

If C=0 (No Carry), the not_LOADPC signal is allowed to assert low, and the immediate value is loaded into the PC (PC \leftarrow Imm).
If C=1 (Carry Set), the not_LOADPC signal is disabled, preventing the jump address from being loaded. The PC proceeds with its sequential increment (PC \leftarrow PC + 1).

V. Practical Programming Tutorial and Assembly Examples

Programming the TD4, especially within the 16-instruction limit, is an exercise in extreme optimization, requiring the programmer to fully understand the architectural constraints.

A. Assembly Conventions and Register Mapping

While the TTL TD4 kit uses Registers A and B, software emulators such as Tiny4CPU (T4C) often use mnemonics like LDX and ADY corresponding to Registers X and Y. Regardless of the naming convention, the instruction set relies on an 8-bit, fixed-format machine code. The most vital programming convention, derived directly from the unified execution strategy, is the Zero Immediate Rule: any instruction not designed to perform arithmetic on an immediate constant (including MOV A, B, IN B, OUT A, and all jump operations) must have the low nibble of its instruction set to 0000_2. Failure to adhere to this rule results in unwanted addition, which corrupts the destination register.

B. Essential Programming Constructs

The following examples, while conceptualized in assembly, demonstrate the underlying machine code requirements necessary for execution on the physical TD4.

Example 1: Looping and Incrementation

This simple program continuously increments Register A and outputs its value to the LEDs, creating a running counter effect. This closely mirrors the T4C example provided in the documentation.

Address (PC)	Instruction	Mnemonic	Comment (RTL)
0000	0011 0000	MOV A, #0	A[span_24](start_span)[span_24](end_span) \leftarrow 0 + 0. Initialize A to zero.
0001	0000 0001	ADD A, #1	A \leftarrow A + 1. Increment A.
0010	1001 0000	OUT A	OUT \leftarrow A + 0. Output A content.
0011	1111 0001	JMP 1	PC \leftarrow 0 + 1. Loop back to address 1.

Example 2: Implementing Subtraction

The absence of a dedicated subtractor means that subtraction, A - B, must be achieved through two's complement addition, $A + (\bar{B} + 1)$ . In a 4-bit environment, finding the two's complement $(\bar{Imm} + 1)$ of an arbitrary register value requires additional arithmetic logic gates not present in the base TD4.
If the goal is merely to subtract a small immediate value (e.g., A \leftarrow A - 1), the negative immediate complement must be pre-calculated and loaded. For instance, subtracting 1 (which is 0001_2) requires adding the two's complement of 1, which is 1111_2 (15).

Operation: A \leftarrow A - 1
Implementation: ADD A, #15 (OpCode 0000 1111)

This illustrates the challenge: complex operations require pre-computation and consume multiple instruction slots, a severe constraint given the 16-instruction limit. Modern T4C emulators often abstract this by defining dedicated subtraction mnemonics like SUX and SUY.

Example 3: Conditional Termination

Writing a program that executes a limited number of steps (e.g., count to 10 and stop) is challenging because the only conditional operation is JNC (Jump if No Carry).
To check if Register A has reached a target value, T, the standard procedure would be to perform A - T and check the resulting Carry/Zero flags. Since true subtraction is difficult, a workaround must be used. For example, to check if A = 10 (1010_2), one might add a constant X such that A + X causes an overflow only when A \ge 10. If A = 10, then adding X = 6 (0110_2) yields 1010 + 0110 = 10000, setting C=1.
The typical flow is:

ADD A, #6 (A now holds a potentially modified value, and C is set or cleared)
JNC Loop (Jump back to continue counting if C=0)
JMP Halt (If C=1, jump to halt)
Correction: Because the ADD instruction corrupted A’s value, subsequent instructions would be required to subtract 6 (i.e., add 10) to restore A before looping, further consuming precious ROM space.

This architectural reality demonstrates that while the TD4 is Turing-complete in theory, its practical application is limited by the constraint of the 16-instruction ROM and the reliance on a single Carry flag for control logic.

VI. Extended Applications and Architecture Modification

The TD4 serves as a strong foundation for exploring digital design concepts, both through simulation and physical modification. The inherent limitations of the basic design naturally lead to advanced projects focused on expansion and modernization.

A. Simulation and Modeling in the Modern Context

The minimalist and well-defined structure of the TD4 makes it an ideal target for modeling using modern computer engineering tools.

HDL Implementation and Verification

The entire TD4 design, being based on simple combinatorial logic and flip-flops, can be precisely translated into a Hardware Description Language (HDL) such as Verilog or VHDL. This conversion process allows researchers and students to use simulation environments (like Icarus Verilog and GTKWave) to verify the logical correctness of the design against the physical TTL schematic. Once validated, the HDL model can be synthesized and loaded onto Field-Programmable Gate Arrays (FPGAs), enabling the TD4 architecture to run on modern, high-speed hardware. This transition from discrete TTL chips to programmable fabric demonstrates how digital systems scale without changing the underlying architecture.

Visual Modeling

Tools like Logisim are commonly used to create visual, circuit-level simulations of the TD4. This visual modeling provides an interactive, digital equivalent to the "blinkenlight" experience, allowing users to observe signal flow and register contents during each clock cycle, reinforcing the concepts learned from the physical hardware.

B. Addressing Core Hardware Limitations

As soon as a programmer attempts any task beyond a simple counter, two major architectural restrictions become immediately apparent:

Lack of General Data RAM: The original TD4 contains no dedicated writable data memory. All intermediate state must be stored in the two general-purpose registers (A and B) or the output latch. This limitation prohibits data array processing, subroutine calls requiring a stack, or any form of complex memory management.
Inability to Perform Register-to-Register Arithmetic: As established, the datapath architecture, which pipes only one selected operand (Input_B) and the immediate value (Input_A) into the adder, makes operations like A \leftarrow A+B impossible. This restriction forces all arithmetic to involve an immediate constant.

C. Design Extensions and Future TD4 Variants

The TD4 is often modified by developers seeking to overcome its limitations, transitioning its structure toward that of a conventional microcomputer. The following extensions are frequently explored:

Program Counter Expansion

The 16-instruction limit is the most restrictive boundary for programming. To expand the program space, the Program Counter (PC) can be increased from 4 bits to 8 bits. This requires replacing or chaining the 4-bit counter ICs used for the PC and modifying the address decoding logic to access 2^8 = 256 memory locations. This requires a significant modification to the external instruction memory, moving from DIP switches to an external ROM chip or a switchable memory bank.

Data RAM Integration

The addition of writable data memory is critical for practical use. Implementing this requires two key changes: introducing a dedicated Memory Address Register (MAR) and creating new instruction OpCodes (e.g., LOAD and STORE). These new instructions must utilize the immediate field not as data, but as a 4-bit memory address, controlling the read/write cycles of an external RAM chip. This also requires incorporating tristate buffers to manage the bidirectional data flow between the CPU registers and the RAM data bus.

Bus Width Expansion

While more complex, the entire datapath (registers, ALU, buses) can be upgraded from 4 bits to 8 bits. This involves replacing all 4-bit components with 8-bit equivalents (e.g., using pairs of 74LS283 adders) and redesigning the control logic to handle the wider data paths. This significantly increases the computational power, allowing numbers up to 255 to be processed, but substantially increases the chip count and design complexity.
The trajectory of modifying the TD4 demonstrates a fundamental principle in computer architecture: the limitations of the initial Finite-State Machine with Data Path (FSMD) structure necessitate complexity—such as multi-cycle execution or a larger register file—to gain practical utility. The TD4’s greatest value is thus realized when it forces the student to confront these trade-offs and engineer solutions to its minimal architecture.

Conclusions

The TD4 4-bit CPU stands as a masterpiece of constraint-driven TTL logic design, successfully achieving its core mission as a premier educational tool for computer architecture. Its exhaustive detail and observable operations demystify the core fetch-decode-execute instruction cycle.
The entire architecture is fundamentally dictated by minimizing components, a design choice leading to a unified execution unit based solely on a 4-bit adder. This strategic decision necessitates that all data movement and arithmetic operations conform to the form Result \leftarrow Input_B + Imm. Consequently, the system operates as a single-cycle CPU (CPI=1), whose maximum frequency is determined by the propagation delay of the TTL components along the critical datapath.
The architectural compromises are pronounced: an extremely limited 16-instruction program ROM and the physical inability to perform operations requiring simultaneous access to two general-purpose registers (e.g., A+B). Programming requires meticulous adherence to the OpCode format, demanding that non-arithmetic instructions use a zero immediate value to avoid data corruption.
For advanced understanding, the TD4 provides a fertile ground for expansion. Overcoming the limits of address space, memory access, and computational complexity demands architectural evolution, such as expanding the PC to 8 bits or integrating dedicated data RAM. These modifications require a transition away from the purest minimalist design, reinforcing the real-world trade-offs between simplicity, chip count, and computational power.

引用的文献

1. at master · wuxx/TD4-4BIT-CPU - GitHub, https://github.com/wuxx/TD4-4BIT-CPU?search=1 2. TD4 DELUXE THE SIMPLE TTL CPU Make your own CPU and learn how computers work! - Budgetronics, https://www.budgetronics.eu/data/mediablocks/TD4%20building%20kit%20manual.pdf 3. digitsensitive/Tiny4CPU: Tiny4CPU is a lightweight and ... - GitHub, https://github.com/digitsensitive/Tiny4CPU 4. Guide to the TD4 4-bit DIY CPU | Hey There Buddo! - Philip Zucker, https://www.philipzucker.com/td4-4bit-cpu/ 5. TD4 CPU - Hackaday.io, https://hackaday.io/project/26215-td4-cpu 6. Datapath - Wikipedia, https://en.wikipedia.org/wiki/Datapath 7. Remapping the TD4 OpCodes | Details | Hackaday.io, https://hackaday.io/project/26215-td4-cpu/log/64600-remapping-the-td4-opcodes 8. Instruction cycle - Wikipedia, https://en.wikipedia.org/wiki/Instruction\_cycle 9. Week 3: Single Cycle CPU, https://cseweb.ucsd.edu/\~j2lau/cs141/week3.html 10. chapter4 - single cycle, https://gab.wallawalla.edu/\~curt.nelson/cptr380/lecture/chapter4%20-%20single%20cycle.pdf 11. Comparing Two Verilog CPU Implementations using EBMC - Philip Zucker, https://www.philipzucker.com/td4\_ebmc/ 12. upaengineering/TD4_SV: TD4 (4bit CPU) written by System Verilog - GitHub, https://github.com/upaengineering/TD4\_SV 13. Converting My CPU to Verilog Via Logisim Evolution (for Eventual FPGA Board?) - YouTube, https://www.youtube.com/watch?v=zh\_X6\_6jCik 14. 4 Bit CPU build in Logisim Evolution, with Compiler and IDE. - GitHub, https://github.com/keithallatt/logisim-cpu 15. 4-bit CPU (TD4 once again) - Hackaday.io, https://hackaday.io/project/161708-4-bit-cpu-td4-once-again