Fast C++ function "is_utf8": checks if the input is valid UTF-8. Made of a single source file. Optimized for ARM NEON, x64 SSE, AVX2 and AVX-512.
Most strings online are in unicode using the UTF-8 encoding. Validating strings quickly before accepting them is important.
This is a simple one-source file library to validate UTF-8 strings at high speeds using SIMD instructions. It works on all platforms (ARM, x64).
Build and link is_utf8.cpp
with your project. Code usage:
#include "is_utf8.h"
char * mystring = ...
bool is_it_valid = is_utf8(mystring, thestringlength);
It should be able to validate strings using less than 1 cycle per input byte.
cmake -B build
cmake --build build
cd build
ctest .
Visual Studio users must specify whether they want to build the Release or Debug version.
To run benchmarks, build and execute the bench
command.
cmake -B build
cmake --build build
./build/benchmarks/bench
Instructions are similar for Visual Studio users.
This C++ library is part of the JavaScript package utf-8-validate. The utf-8-validate package is routinely downloaded more than a million times per week.
If you are using Node JS (19.4.0 or better), you already have access to this
function as
buffer.isUtf8(input)
.
If you want a wide range of fast Unicode function for production use, you can rely on the simdutf library. It is as simple as the following:
#include "simdutf.cpp"
#include "simdutf.h"
int main(int argc, char *argv[]) {
const char *source = "1234";
// 4 == strlen(source)
bool validutf8 = simdutf::validate_utf8(source, 4);
if (validutf8) {
std::cout << "valid UTF-8" << std::endl;
} else {
std::cerr << "invalid UTF-8" << std::endl;
return EXIT_FAILURE;
}
}
See https://github.com/simdutf/
This library is distributed under the terms of any of the following licenses, at your option: