memcpy is too slow? Master this technology to double the efficiency of memory copy

memcpy is too slow? Master this technology to double the efficiency of memory copy

Cover from: Chestnut is too lazy

memcpy is a standard function of C/C++. The prototype void *memcpy(void *dest, const void *src, size_t n) is used to copy n bytes from the starting position of the memory address pointed to by the source src to the target In the starting position of the memory address pointed to by dest. neon is a 128-bit SIMD (Single Instruction, Multiple Data) extended structure suitable for ARM Cortex-A series processors. neon supports one command to process multiple data, such as 8 8-bit, 4 16-bit, 2 32-bit or 1 64-bit. It is this feature that can be used to speed up memory copying. Under normal circumstances, the performance of memcpy is sufficient, but when we encounter bottlenecks in copying large memory for some reason, we can consider using neon to speed up memory copying. For example, when I used glMapBufferRange to map PBO from GPU memory to CPU memory, I encountered a time-consuming problem. It took 30ms to copy 921600 bytes of data. After using neon, the memory copying time was directly reduced to 4ms, a difference of nearly 8 times. In fact, using neon instructions on the arm platform can efficiently improve data parallel processing performance, not just limited to memory copy. Google's open source libyuv also uses neon instructions to process data in parallel.

Use neon command

#ifdef  __ARM__static void neon_memcpy(volatile unsigned char *dst, volatile unsigned char *src, int sz){    if (sz & 63)        sz = (sz & -64) + 64;    asm volatile (    "NEONCopyPLD:/n"            " VLDM %[src]!,{d0-d7}/n"            " VSTM %[dst]!,{d0-d7}/n"            " SUBS %[sz],%[sz],#0x40/n"            " BGT NEONCopyPLD/n"    : [dst]"+r"(dst), [src]"+r"(src), [sz]"+r"(sz) : : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "cc", "memory");}#endif 

Since not all armv7 architecture CPUs support neon, the cpufeatures library is added here to determine whether to support neon. The following is the correct way to use it.

#ifdef  __ARM__    if (android_getCpuFamily() == ANDROID_CPU_FAMILY_ARM &&        (android_getCpuFeatures() & ANDROID_CPU_ARM_FEATURE_NEON) != 0){//NEON        neon_memcpy(destBuffer, src, length);    }else{        memcpy(destBuffer, src, length);    }#else//memcpy     memcpy(destBuffer, src, length);#endif 

Android mk open neon

#arm neon ifeq ($(TARGET_ARCH_ABI),armeabi-v7a)LOCAL_CFLAGS := -D__cpusplus -g -mfloat-abi=softfp -mfpu=neon -march=armv7-a -mtune=cortex-a8 -DHAVE_NEON=1endif# neon x86 neon sse ifeq ($(TARGET_ARCH_ABI),$(filter $(TARGET_ARCH_ABI), armeabi-v7a x86))LOCAL_ARM_NEON := trueendifLOCAL_STATIC_LIBRARIES := cpufeaturesinclude $(BUILD_SHARED_LIBRARY)$(call import-module,android/cpufeatures) 

Cmake opens neon

#  cpufeatures include_directories(${ANDROID_NDK}/sources/android/cpufeatures)if (${ANDROID_ABI} STREQUAL "armeabi-v7a")    set_property(SOURCE ${SOURCES} APPEND_STRING PROPERTY COMPILE_FLAGS " -mfpu=neon")    add_definitions("-DHAVE_NEON=1")elseif (${ANDROID_ABI} STREQUAL "x86")    set_property(SOURCE ${SOURCES} APPEND_STRING PROPERTY COMPILE_FLAGS            " -mssse3  -Wno-unknown-attributes/                   -Wno-deprecated-declarations/                   -Wno-constant-conversion/                   -Wno-static-in-inline")    add_definitions(-DHAVE_NEON_X86=1 -DHAVE_NEON=1)endif ()add_library(        yourLibrary        SHARED        ${ANDROID_NDK}/sources/android/cpufeatures/cpu-features.c) 

In fact, SIMD is not only supported by arm architecture, x86 is also supported (SSE), and Android also provides NEON_2_SSE.h for x86. x86 does not directly support neon instructions, but converts them to sse instructions through this header file to provide the same api as neon. Interested students can study it.