arjun.a
rename files
6c7b14a
Ticket Name: RTOS/TDA2: Using vlib for resizing a image effficiently
Query Text:
Part Number: TDA2 Tool/software: TI-RTOS Hello, I am using existing vlib for resizing a image (i.e. ti_components/algorithms/vlib_c66x_3_3_0_3/packages/ti/vlib/src/VLIB_image_rescale). This library works fine, meaning gives me resized image. Currently it gives around 48 output fps and DSP is utilizing 87%. I want this library results in 60 or more than 60 fps. Until now, as part of optimization i did following things, 1. transferred chunk of data from DDR to internal memory of DSP using edma and processing inside SRAM only. 2. data transfer and processing in ping pong fashion. -------------- I have below doubts/query, 1. At what rate VLIB_image_rescale() can execute? 2. What else I need to do to get min. 60 fps ? Regards, Kajal
Responses:
What is the input and output resolution and image format? Is it possible you are hitting the limit of the DDR bandwidth? Is there anything else being transferred in the system that would contribute to DDR being maxed out in the use case?
Hello Jesse, Input resolution : 1920x1080 Output resolution : 960x540 ( resized one) Data format :- SYSTEM_DF_YUV420SP_UV there is nothing transfered in system except output frames( count is 15) . How will I know weather DDR bandwidth ? Because at compile or run time it doesn't give me any error or warning kind of. Regards, Kajal
Any update? regards, Kajal
Kajal, Based on the information you have given, the DDR requirement for this transfer is about 222 Mbytes/sec, which is well within the range of DDR throughput limitation, so that should not be the problem. (By the way, if you hit DDR limit, there will be know way for compiler or run-time to know. It is software architect for application development to understand what the limitations of the hardware throughput are when constructing the use case. You simply add up all the reads and writes of data from/to DDR to the chip and make sure that they are within the limits described in the system documentation). Some more questions: 1. Which RAM are you transferring to, L1 or L2? 2. What are the dimensions of the input blocks you are transferring to the SRAM? If they are too small, too much time may be spent on DMA reconfiguration. 3. Are you also writing output to SRAM and transferring to DDR using EDMA, or just writing output of vlib to DDR directly through cache? 4. You said the color format is SYSTEM_DF_YUV420SP_UV. This format is not fully supported by the VLIB_image_rescale function. The rescale function can work on just the luma channel. What is the value of color_format you are using on luma and chroma planes? 5. Since you are doing a half scale in each direction, have you considered the VXLIB_halfScaleGaussian_5x5_i8u_o8u function from VXLIB. It does a gaussian blur followed by a half scale in each direction. It may be faster since it assumes half scale and not arbitrary one. VXLIB is another library which is the kernel library for OpenVX. Jesse
Hello Jesse, Please find my answers, 1. Which RAM are you transferring to, L1 or L2? >> transferring chunk of data to L2 SRAM from DDR. 2. What are the dimensions of the input blocks you are transferring to the SRAM? If they are too small, too much time may be spent on DMA reconfiguration. >> 12 lines of 1920 width i.e. 1920x12 size of block is transferring into L2SRAM. Also transferring small block of size 1920x4 gives difference of 1ms compare to 1920x12 block, ultimately not much difference in fps. 3. Are you also writing output to SRAM and transferring to DDR using EDMA, or just writing output of vlib to DDR directly through cache? >> processed output is written in L2SRAM only and then that buffer transferred back to DDR using EDMA.. 4. You said the color format is SYSTEM_DF_YUV420SP_UV. This format is not fully supported by the VLIB_image_rescale function. The rescale function can work on just the luma channel. What is the value of color_format you are using on luma and chroma planes? >>I mean SYSTEM_DF_YUV420SP_UV data format is of incoming frame. For VLIB_image_rescale function only luma plane is given for processing i.e. color_format = 3. Regards, kajal.
Kajal, When I looked closer at this code for LUMA format, the optimized code is the same as the natural C, which operates at 17 cycles per input pixel, which is very poor performance. I suggest you use VXLIB. I personally wrote the optimizations for VXLIB, and it is much better. For example, the halfscalegaussian that I suggested operates at roughly 4.6 cycles per pixel when operating from L2SRAM. If you don't want half scale, then there is also scaleImage functions which give different scale ratios. Please let me know if you can try this instead of VLIB image_rescale. Jesse
Hello Jesse, I am trying use to vxlib >> VXLIB_halfScaleGaussian_5x5_i8u_o8u instead of vlib. But I'm not getting resized frame. I have set below params, please have a look and let me know if anything I'm missing. srcParam.data_type = VXLIB_UINT8; srcParam.dim_x = pInputChInfo->width; //1920 srcParam.dim_y = pInputChInfo->height; //1080 srcParam.stride_y = pInputChInfo->height; //1080 dstParam.data_type = VXLIB_UINT8; dstParam.dim_x = pOutputChInfo->width; //960 dstParam.dim_y = (pOutputChInfo->height); //540 dstParam.stride_y = pOutputChInfo->height; //540 After checking params( using VXLIB_halfScaleGaussian_5x5_i8u_o8u_checkParams()) results in VXLIB_ERR_INVALID_DIMENSION status. I checked params and changed srcParam.dim_y to (1080 + 4). Though getting same status and not resized frame. Is all params correct?? Regards, Kajal
Hi Jesse, I'm able to make progress. Issue was with stride_y of both src and dst params. Comments/guidelines given in VXLIB_bufParams.h file for VXLIB_bufParams2D_t structure, it says /*!< \brief Stride in Y dimension in bytes. */ That's why giving stride_y param value incorrect. ---------------------------- Anyways, so now params as, srcParam.data_type = VXLIB_UINT8; srcParam.dim_x = 1920; srcParam.dim_y = 1080; srcParam.stride_y = 1920; dstParam.data_type = VXLIB_UINT8; dstParam.dim_x = 960; dstParam.dim_y = 540; dstParam.stride_y = 960; Are above params are okay? Because might be last two rows and columns are having some garbage data. Kindly suggest for the same. Regards, Kajal.
From the API header: * @par Assumptions: * - I/O buffer pointers are assumed to be not aliased. * - Input width should be >= (Output width + 2) * 2 * - Input height should be == (Output height + 2) * 2 * - When breaking input image processing into blocks, be sure to fetch enough overlap pixels from the input * for interior edges for the rescale, or else the function may put a false border within the block edge of * of the output image. For each dimension, the required fetch amount should be: * - input block width to fetch = (output block width + 2) * 2 * - input block height to fetch = (output block height + 2) * 2 * * And the amount of left/top overlap to refetch should be: * - left edge overlap = 2 * - top edge overlap = 2 Since this function is using prefiltering (gaussian), this function does not do border replicate or assuming some value, meaning all output pixels are produced from true input pixels, meaning that the output is smaller. This gives the best performance if it is acceptable to have 1 line of border on the output which is not written to (assuming that you center the output. So, assuming you have the full buffer you asked for in the output, you can set the output params to: dstParam.data_type = VXLIB_UINT8; dstParam.dim_x = 960; dstParam.dim_y = 540-2; dstParam.stride_y = 960; If you give the pointer to top of full buffer, then the last 2 lines will not be written to. Alternatively, you can give pointer to beginning of second line to have centered output (top line does not get written to, and last line does not get written to). Also, 2 border pixels per row will be garbage. If you want to avoid the reduction in size of image, you can use the VXLIB_scaleImageNearest_i8u_o8u (which doesn't give preprocessing and may lead to aliasing artifacts). Alternatively, you can use VXLIB_scaleImageBilinear_br_i8u_o8u, which uses interpolation as part of arbitrary scale, and replicates the border. This additional functionality costs extra cycles per pixel. It is about 9-10 cycles per pixel. So if that is still within your budget, it offers the most flexibility. Finally, another option is to use the VXLIB_halfScaleGaussian_5x5_br_i8u_o8u_o8u. It does a half scale with handling of border. However, it gives a second "full sized" output for a special OpenVX case. However, you can try to change the code to remove the writes to dstFull buffer to see if the performance improves, but this may require some more work and testing. Hope this helps. Jesse
Hello Jesse, Adding 2 in both dim_x and dim_y of dstParams solved my problem For now I'm closing this thread, will come back to you if any issues facing while doing same thing in slices with edma. Regards, Kajal.