Switch to PCRE2 for regular expressions#3432
Conversation
55d7ade to
8f73733
Compare
This comment was marked as outdated.
This comment was marked as outdated.
475cbff to
ab80536
Compare
|
@wader maybe just remove Oniguruma and switch to PCRE2? But if you want to have a choice for a while, that's fine too. |
|
@nicowilliams remove is fine for me 👍 does the changes look ok otherwise? i will try to get some time to fixup the last things |
| uses: actions/checkout@v6 | ||
| with: | ||
| submodules: true | ||
| submodules: recursive |
There was a problem hiding this comment.
Need as pcre2 has a sumodule, sljit, not sure used but at least make dist will need it as pcre2 dist depends on the sljit files
| make clean # if upgrading from a version previously built from source | ||
| git submodule update --init --recursive # if building from git to get pcre2 | ||
| autoreconf -i # if building from git | ||
| ./configure # build with builtin PCRE2 |
There was a problem hiding this comment.
--with-pcre2=builtin is default so remove it
|
Now switch PR to replace oniguruma instead. See TODO things left and of course CI for macos (submodule gets dirty) and windows (i think line ending or msys2 does not play will with pcre2 tests) has issues |
|
for some reason on macos autoreconf decices to regenerate some files making the repo dirty failing the CI workflow diff test hmm seems to be because pcre2 adds auto* generated files to tags https://github.com/PCRE2Project/pcre2/blob/release/pcre2-10.47/Makefile.in |
c7fe29e to
7731e97
Compare
|
Played around with using pcre2 JIT compiler but didn't notice much difference for common jq use cases and not much for big text input either, maybe i was testing wrongly? Anyways code is here https://github.com/wader/jq/tree/pcre2-jit and remember to build jq with |
Work on oniguruma (https://github.com/kkos/oniguruma) has sadly been discontinued so we need to find another regex implementation. PCRE2 seem like a good candidate. vendor/pcre2 commit is at tag pcre2-10.47 --with-pcre2 works like this: (similar to --with-oniguruma) builtin - build vendored pcre2 no - no regexp support yes - use pcre2-config to find installed pcre2 (default) * - use value as install prefix to find installecd pcre2 if fail to find installed pcre2 then use vendored Known TODOs: - compile-ios.sh - Needs to be a 1.x release as docs relay on that - libjq usage? dont builtin and require libpcre2? (seems to be what debian does) - Update docs - Breaking changes - Drop "l" modifier? - Could possibly reimplement - Empty pattern and multi-byte code points behave differntly. I think pcre2 is more correct? - onig: jq -n '["🚀" | match(""; "g")] | length' -> 5 (per byte it seems) - pcre2: jq -n '["🚀" | match(""; "g")] | length' -> 2 (per code point) Good references: - https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2demo.c Pre pcre2_next_match usage: - https://github.com/PCRE2Project/pcre2/blob/eb3bd3cf1418cb1a0eabf984b0b1e80b6bdd9314~1/src/pcre2demo.c (pre pcre2_next_match usage) - pcre2test cli tools is a good Playaround for pcre2 Notes: ".+?\\b" test disbled in pcre2.test seems to be behaves differently depending on pcre2 version. I suspect this fix in 10.43: PCRE2Project/pcre2@0a55280 I noticed a clang -fsanitize=memory use-of-uninitialized-valu issue but it seems to go awa with pcre2 master Related to jqlang#3313
| // use a lower regex parse depth limit than the default (4096) to protect | ||
| // from stack-overflows | ||
| // https://github.com/jqlang/jq/security/advisories/GHSA-f946-j5j2-4w5m | ||
| onig_set_parse_depth_limit(1024); |
There was a problem hiding this comment.
Just noticed that pcre2 has pcre2_set_parens_nest_limit but unsure about pcre2's stack usage during parsing. Also default limit is 250 so maybe not a problem https://github.com/PCRE2Project/pcre2/blob/ac0eb7122a0ac04d6717585f132d82aec3adc8d3/CMakeLists.txt#L344
Work on oniguruma (https://github.com/kkos/oniguruma) has sadly been discontinued
so we need to find another regex implementation. PCRE2 seem like a good candidate.
vendor/pcre2 commit is at tag pcre2-10.47
Known TODOs:
Good references:
Pre pcre2_next_match usage:
Notes:
".+?\b" test in onig.test seems to be behaves differently depending on pcre2 version.
I suspect this fix in 10.43:
PCRE2Project/pcre2@0a55280
I noticed a clang -fsanitize=memory use-of-uninitialized-valu issue but it seems to go awa with pcre2 master
Related to #3313