HEVC 2D-DCT architectures comparison for FPGA and ASIC implementations

,


Introduction
The latest video coding standard for video compression was developed by Joint Collaborative Team-Video Coding (JCT-VC) is known as high-efficiency video coding (HEVC).HEVC is the replacement of the previous standard that is Advanced Video Coding (AVC/H.264)[1].The architecture of HEVC is based on block-based hybrid video coding approach [2].HEVC uses variable transform unit (TU) sizes for DCT that is 4x4, 8x8, 16x16 and 32x32, and also the discrete sine transforms (DST) with size 4x4.HEVC achieves double rate better video compression efficiency compared to AVC/H.264 standard [1,3].However, the TU sizes for DCT in AVC/H.264 is only 8x8, as implemented in [4][5][6][7].The function of DCT is to reduce the redundancies by transforming the spatial domain into the spectral domain and it is widely used in image and video compression technology.In this paper, the main goal is to design the 2D-DCT architecture for transform sizes of 4x4, 8x8, and 16x16 and evaluate the most suitable architectures for FPGA and ASIC platforms.
Meher et al. [8] proposed reusable architecture of integer DCT which provides the same throughput with 32 output coefficient per cycle for all TU, but produces a higher gate count.It also proposed folded and parallel HEVC 2D-DCT architectures for 8K and 4K video applications.The work by Basiri et al. [9] propose a multiplier unit with configurable carry save adder (CSA) tree implemented in 32-point 1D integer DCT architecture.The 1D-DCT architecture also uses the parallel and folded design.The 32x32-point parallel architecture gives good improvement by using 45nm CMOS TSMC library.Another work by Mehul Tiketar at el. [10], which utilizes a multiplierless multiple constant multiplication (MCM) instead of regular multipliers to reduce area overhead and applied data-gating to improve the power efficiency.The folded and parallel architectures, and other specialized design techniques described in these papers are implemented in the present paper.Apart from that, separability architecture has implemented for variable-length DCT HEVC where it uses a register and transposition memory as the block structure in 2D-DCT [11].

ISSN: 1693-6930
There is relatively limited work for DCT implementation on FPGA [12].One such work is given in [13] that implements and explores the design space of the full HEVC DCT.The design covers all valid DCT sizes and also the 4x4 DST [14,15].However, it implements at high-level and includes various architectural optimizations such as actor merging, pipelining, etc [16].
Based on [17], the gap between the FPGAs and ASICs are measured on area, performance and power consumption.In term of dynamic power consumption, FPGAs achieve approximately 14 times more than ASICs.While this gap is generally well known, suitable DCT architectures for FPGA or ASIC has not been studied in detail.Most of the works in literature implements on ASIC technologies where it is shown that a parallel architecture results in highest performance, with tradeoff on area and power.Folded architecture on the hand has shown to be able to reduce size and power at the expense of performance.Due to the high performance parallel architecture in ASIC, energy efficiency is also generally better.However, for FPGA a different case is expected due to the unpredictable placement and routing compared to ASIC, especially for large size transforms.Thus the present paper provides experimental results on comparing energy efficiency for small (4x4 and 8x8) and large size (16x16) transforms for FPGA and ASIC implementations.This paper is organized as follows.In section 2, the theory of DCT are described.In section 3, the parallel and folded architectures are presented.In section 4, the results of ASIC and FPGA are assessed in terms of energy per block, maximum frequency, throughput, power consumption, area and gate count.Section 5 concludes the paper.

DCT Theory
The (1) and ( 2) is the basic equation of N-point one-dimensional (1D) DCT transform as defined in [18,19]: where () is the input data and () is the output data?For N=4, the equation 1D-DCT can be written in matrix form as given in (3), where c is the constant matrix: One of the properties of 2D-DCT is separable it in two ways, a column-wise 1D-DCT and followed by a row-wise 1D-DCT or vice-versa [2]. Figure 1 shows the example for row and column process of 4x4 2D-DCT [9].It starts with the row process first, each row of the input matrix the 1D-DCT is performed and the intermediate results are stored in transposition buffer matrix row by row.Next, the 1D-DCT is performed again column by column from the transposition buffer matrix for the column process.The results of the column process are required for 2D-DCT.
The elements of the forward transform matrix are denoted in the HEVC standard [2,20].The smaller size of transform matrices can be derived from the 32x32 matrix as shown in (4).
Let  44  denote the 4x4 transform matrix.Based on [2,21], the elements of  44  can obtain as follows: By follow the symmetry property of the transform matrix, the number of necessary computations can be reduced by broken down (5) into Even and Odd part [22].For the even matrix, the 0 th and 2 nd line of the horizontal and vertical is selected for row and column respectively.For the odd matrix, the row and column are selected from the 1 st and 3 rd line of the horizontal and vertical.The calculation of Even and Odd part for the 4x4 matrix used in this work can be computed in matrix form as shown in ( 6) and ( 7), and the output 1D-DCT in (8).
By using the same method, the equations for 8x8, 16x16 and 32x32 transform block also can be derived [2].

2D-DCT Architecture
This section describes the 2D-DCT for the 4x4 TU block size.Basically, this work is designing the 4x4 1D-DCT by using the combination of four 4-point DCT, where each N-point module is located horizontally.The structure of 1D-DCT is depicted in Figure 2. The complete 1D-DCT for 4x4 TU consists of 16 input/output signal.For the larger TU size such as 8x8, 16x16 and 32x32, it consists of 64, 256 and 1024 input/output signal respectively.Note that a serializer-deserializer (SERDES) modules can be used to stream the inputs and outputs.In addition, for 8x8, 16x16, and 32x32, it needs eight 8-point, sixteen 16-point DCT, and thirty two 32-point DCT respectively to perform the complete 1D-DCT.The HEVC 2D-DCT architecture in this work are parallel and folded 2D-DCT.The details on the architectures can be found in [8][9][10] and [23][24][25][26].The architectures are also discussed briefly in this section.

Parallel Architecture
Figure 2 shows the parallel 4x4 2D-DCT architecture.It consists of two 1D-DCT modules, where the first 1D-DCT module is used to perform the row process and the other 1D-DCT module is used to perform the column process.In this architecture, the register-based ◼ ISSN: 1693-6930 TELKOMNIKA Vol.17, No. 5, October 2019: 2457-2464 2460 transpose memory is not used.All the input signal is fed simultaneously into the first 1D-DCT module and 1D-DCT module will do some operation of the adder, subtractor, and multiplication.The results from first 1D-DCT that is the row process will directly transpose to the next 1D-DCT module that is column process values.The second 1D-DCT module will do the same process as the first one.The results of the 2D-DCT module are obtained from the column process in the second 1D-DCT module.

Folded Architecture
The folded 4x4 2D-DCT architecture is shown in Figure 3.This architecture consists of one 1D-DCT module, a register-based transpose memory, and some multiplexer.The block of 1D-DCT is used to perform both processes of row and column.The multiplexer is used as a control signal in this circuit.All the input signal is fed simultaneously into the multiplexer.

Results and Analysis
The 2D-DCT architecture for 4x4, 8x8 and 16x16 TU size has been designed using Verilog HDL and implemented in Silterra 180nm technology process for ASIC and Xilinx Kintex Ultrascale for FPGA.The simulation results obtained from Xilinx Vivado for FPGA is compared to the one obtained from Synopsys DC compiler to ensure the results are correct.SERDES modules have been used to serialize and deserialize the parallel input and output.The word length of each input/output pixels of 1D-DCT and 2D-DCT is 16 bits.This section discusses energy per block and other performance on ASIC and FPGA designs for both parallel and folded architectures.

Energy per Block
The graph in Figure 5 and Figure 6 shows the plot of energy per block in FPGA and ASIC respectively.Energy is calculated using the formula E=Pt, where P is the total power, and t is the time to process a single block.The energy per block is increased slightly in parallel with an increase in TU size.In FPGA, it can be seen that there is minimal difference for small and medium sized blocks.For large 16x16 block however, parallel architecture consumes 150.22nJ, while folded architecture consumes 98.84nJ.This results in roughly 34% less energy for folded architecture compared to parallel architecture.The main reason for this is due to ◼ ISSN: 1693-6930 TELKOMNIKA Vol.17, No. 5, October 2019: 2457-2464 2462 the significantly more wiring in parallel architecture, which is generally known to have a negative effect on its performance and power for FPGAs [17].
As shown in Figure 6 for ASIC implementation, a parallel architecture for all TU size yields lower energy compared to the folded architecture.Similar to FPGAs, for small and medium sized transforms, the results are almost similar.However, for large 16x16, it can be seen that the parallel architecture consumes 2.61nJ while the folded architecture consumes 3.72nJ.This results in roughly 30% less energy for the parallel architecture compared to the folded architecture.Another interesting observation is the energy comparison between ASIC and FPGA.It can be seen that implementation on Silterra 180nm results in roughly 150x less energy compared to implementation on Xilinx Kintex Ultrascale (using 14nm technology).This is mainly attributed to the intrinsically high power of FPGAs compared to ASICs.

Other performance on ASIC and FPGA
Table 1 shows the results of others performance comparison in terms of maximum frequency, throughput, power, area and gate count.For the ASIC implementation, the clock period in this work is set at 20ns.The maximum frequency has been obtained using the folded architecture, which is twice of the parallel architecture.This is because the critical path is roughly twice longer in the parallel architecture.Therefore, in terms of throughput, parallel architectures have higher throughput compared to the folded architecture.Note that clock cycle latency is almost similar between the two architectures since a SERDES is used on the inputs In terms of power for ASIC, the folded architecture of 4x4, 8x8 and 16x16 consumes 1.4 times more power consumption than the parallel architecture.Besides, the total core area of parallel architecture is about twice bigger than folded architecture, due to the more registers were used in parallel architecture.Total gate count of 16x16 block size is 7 times more than 8x8 transform size for both architecture due to many blocks (adders, multiplication, and shifters) were used in the design.
For FPGAs, an interesting observation from the results is the maximum frequency is higher on the parallel architecture.Furthermore, power consumption is also higher on the parallel architecture.The parallel architecture also has higher throughput and the 16x16 block size in FPGA is roughly 2 times more throughput compared to ASIC.The resources used by these design also reported in Table 1 which include the Look-Up Table (LUT), Flip-Flop (FF) and DSP.In terms of area, the ASIC is that the resultant circuit is permanently drawn into silicon whereas in FPGAs the circuit is made by connecting a number of configurable blocks.This is difficult to compare more details about the area between them.As mentioned, this is possibly due to the significantly more wiring on the parallel architecture.Because of this, the folded architecture generally results in better energy efficiency in FPGAs.

Conclusion
In this paper, a comparison study has been performed for 2-D HEVC DCT for ASIC and FPGA implementations.The aim is to determine suitable architectures for these implementation platforms.Furthermore, overall energy efficiency comparison between FPGAs and ASICs have also been determined.The study includes the design and implementation of two commonly used 2-D DCT architectures which are the parallel and folded.Three DCT sizes have been designed and compared, which are the 4x4, 8x8, and 16x16.Results show that parallel architecture is most suitable for ASIC due to a more predictable instance placement and routing; while the significantly more wiring in parallel architecture results in relatively poor performance in FPGAs.Results also show that using the Silterra 180nm technology achieves roughly 58x less energy compared to using the Xilinx Kintex Ultrascale at 14nm technology.Future work is to complete the HEVC DCT for size 32x32 and 4x4 DST.

Figure 1 .
Figure 1.Example for row and column process of 4x4 2D-DCT

Figure 4 .
Figure 4. State diagram of control signal for folded architecture

Table 1 .
The Results of the Two Architecture and Sizes in ASIC and FPGA