Skip to content

Switch to PCRE2 for regular expressions#3432

Open
wader wants to merge 1 commit intojqlang:masterfrom
wader:pcre2
Open

Switch to PCRE2 for regular expressions#3432
wader wants to merge 1 commit intojqlang:masterfrom
wader:pcre2

Conversation

@wader
Copy link
Copy Markdown
Member

@wader wader commented Nov 5, 2025

Work on oniguruma (https://github.com/kkos/oniguruma) has sadly been discontinued
so we need to find another regex implementation. PCRE2 seem like a good candidate.

vendor/pcre2 commit is at tag pcre2-10.47

Known TODOs:

  • compile-ios.sh
  • Needs to be a 1.x release as docs relay on that
  • libjq usage? dont builtin and require libpcre2? (seems to be what debian does)
  • Update docs
    • Breaking changes
      • Drop "l" modifier?
        • Could possibly reimplement
      • Empty pattern and multi-byte code points behave differntly. I think pcre2 is more correct?
        • onig: jq -n '["🚀" | match(""; "g")] | length' -> 5 (per byte it seems)
        • pcre2: jq -n '["🚀" | match(""; "g")] | length' -> 2 (per code point)

Good references:

Notes:
".+?\b" test in onig.test seems to be behaves differently depending on pcre2 version.
I suspect this fix in 10.43:
PCRE2Project/pcre2@0a55280

I noticed a clang -fsanitize=memory use-of-uninitialized-valu issue but it seems to go awa with pcre2 master

Related to #3313

@wader wader marked this pull request as draft November 5, 2025 23:49
@wader wader force-pushed the pcre2 branch 13 times, most recently from 55d7ade to 8f73733 Compare November 7, 2025 16:19
@wader

This comment was marked as outdated.

@wader wader force-pushed the pcre2 branch 15 times, most recently from 475cbff to ab80536 Compare November 8, 2025 11:22
@nicowilliams
Copy link
Copy Markdown
Contributor

@wader maybe just remove Oniguruma and switch to PCRE2? But if you want to have a choice for a while, that's fine too.

@wader
Copy link
Copy Markdown
Member Author

wader commented Jan 12, 2026

@nicowilliams remove is fine for me 👍 does the changes look ok otherwise? i will try to get some time to fixup the last things

@wader wader changed the title Add PCRE2 support Switch to PCRE2 for regular expressions Jan 21, 2026
Comment thread .github/workflows/ci.yml
uses: actions/checkout@v6
with:
submodules: true
submodules: recursive
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need as pcre2 has a sumodule, sljit, not sure used but at least make dist will need it as pcre2 dist depends on the sljit files

Comment thread README.md Outdated
make clean # if upgrading from a version previously built from source
git submodule update --init --recursive # if building from git to get pcre2
autoreconf -i # if building from git
./configure # build with builtin PCRE2
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--with-pcre2=builtin is default so remove it

@wader
Copy link
Copy Markdown
Member Author

wader commented Jan 21, 2026

Now switch PR to replace oniguruma instead. See TODO things left and of course CI for macos (submodule gets dirty) and windows (i think line ending or msys2 does not play will with pcre2 tests) has issues

@wader
Copy link
Copy Markdown
Member Author

wader commented Jan 21, 2026

for some reason on macos autoreconf decices to regenerate some files making the repo dirty failing the CI workflow diff test

...
2026-01-21T21:18:10.7757010Z autoreconf: configure.ac: adding subdirectory vendor/pcre2 to autoreconf
2026-01-21T21:18:10.7783150Z autoreconf: Entering directory 'vendor/pcre2'
2026-01-21T21:18:10.7791930Z autoreconf: configure.ac: not using Gettext
2026-01-21T21:18:11.7871500Z autoreconf: running: aclocal -I m4
2026-01-21T21:18:12.9170970Z autoreconf: configure.ac: tracing
2026-01-21T21:18:13.3068070Z autoreconf: running: glibtoolize --copy
2026-01-21T21:18:14.8455980Z glibtoolize: putting auxiliary files in '.'.
2026-01-21T21:18:14.8456290Z glibtoolize: copying file './ltmain.sh'
2026-01-21T21:18:14.9095160Z glibtoolize: putting macros in AC_CONFIG_MACRO_DIRS, 'm4'.
2026-01-21T21:18:14.9248080Z glibtoolize: copying file 'm4/libtool.m4'
2026-01-21T21:18:14.9923450Z glibtoolize: copying file 'm4/ltoptions.m4'
2026-01-21T21:18:15.1428890Z glibtoolize: copying file 'm4/ltversion.m4'
2026-01-21T21:18:15.3492360Z autoreconf: configure.ac: not using Intltool
2026-01-21T21:18:15.3493350Z autoreconf: configure.ac: not using Gtkdoc
2026-01-21T21:18:15.3494090Z autoreconf: running: aclocal -I m4
2026-01-21T21:18:16.2805980Z autoreconf: running: /opt/homebrew/Cellar/autoconf/2.72/bin/autoconf
2026-01-21T21:18:16.7517550Z autoreconf: running: /opt/homebrew/Cellar/autoconf/2.72/bin/autoheader
2026-01-21T21:18:17.1253510Z autoreconf: running: automake --add-missing --copy --no-force
2026-01-21T21:18:17.5641110Z autoreconf: Leaving directory 'vendor/pcre2'
...

hmm seems to be because pcre2 adds auto* generated files to tags https://github.com/PCRE2Project/pcre2/blob/release/pcre2-10.47/Makefile.in

@wader wader force-pushed the pcre2 branch 9 times, most recently from c7fe29e to 7731e97 Compare January 22, 2026 23:57
@wader wader requested review from itchyny and nicowilliams January 23, 2026 00:08
@wader
Copy link
Copy Markdown
Member Author

wader commented Jan 25, 2026

Played around with using pcre2 JIT compiler but didn't notice much difference for common jq use cases and not much for big text input either, maybe i was testing wrongly?

Anyways code is here https://github.com/wader/jq/tree/pcre2-jit and remember to build jq with --enable-jit to pass --enable-jit to pcre2 configure.

Work on oniguruma (https://github.com/kkos/oniguruma) has sadly been discontinued
so we need to find another regex implementation. PCRE2 seem like a good candidate.

vendor/pcre2 commit is at tag pcre2-10.47

--with-pcre2 works like this: (similar to --with-oniguruma)
  builtin - build vendored pcre2
  no - no regexp support
  yes - use pcre2-config to find installed pcre2 (default)
  * - use value as install prefix to find installecd pcre2
  if fail to find installed pcre2 then use vendored

Known TODOs:
- compile-ios.sh
- Needs to be a 1.x release as docs relay on that
- libjq usage? dont builtin and require libpcre2? (seems to be what debian does)
- Update docs
  - Breaking changes
    - Drop "l" modifier?
      - Could possibly reimplement
    - Empty pattern and multi-byte code points behave differntly. I think pcre2 is more correct?
      - onig: jq -n '["🚀" | match(""; "g")] | length' -> 5 (per byte it seems)
      - pcre2: jq -n '["🚀" | match(""; "g")] | length' -> 2 (per code point)

Good references:
- https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2demo.c
Pre pcre2_next_match usage:
- https://github.com/PCRE2Project/pcre2/blob/eb3bd3cf1418cb1a0eabf984b0b1e80b6bdd9314~1/src/pcre2demo.c (pre pcre2_next_match usage)
- pcre2test cli tools is a good Playaround for pcre2

Notes:
".+?\\b" test disbled in pcre2.test seems to be behaves differently depending on pcre2 version.
I suspect this fix in 10.43:
PCRE2Project/pcre2@0a55280

I noticed a clang -fsanitize=memory use-of-uninitialized-valu issue but it seems to go awa with pcre2 master

Related to jqlang#3313
Comment thread src/main.c
// use a lower regex parse depth limit than the default (4096) to protect
// from stack-overflows
// https://github.com/jqlang/jq/security/advisories/GHSA-f946-j5j2-4w5m
onig_set_parse_depth_limit(1024);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed that pcre2 has pcre2_set_parens_nest_limit but unsure about pcre2's stack usage during parsing. Also default limit is 250 so maybe not a problem https://github.com/PCRE2Project/pcre2/blob/ac0eb7122a0ac04d6717585f132d82aec3adc8d3/CMakeLists.txt#L344

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants