diff options
| author | Craig Jennings <c@cjennings.net> | 2024-04-07 13:41:34 -0500 |
|---|---|---|
| committer | Craig Jennings <c@cjennings.net> | 2024-04-07 13:41:34 -0500 |
| commit | 754bbf7a25a8dda49b5d08ef0d0443bbf5af0e36 (patch) | |
| tree | f1190704f78f04a2b0b4c977d20fe96a828377f1 /devdocs/elisp/regexp-problems.html | |
new repository
Diffstat (limited to 'devdocs/elisp/regexp-problems.html')
| -rw-r--r-- | devdocs/elisp/regexp-problems.html | 12 |
1 files changed, 12 insertions, 0 deletions
diff --git a/devdocs/elisp/regexp-problems.html b/devdocs/elisp/regexp-problems.html new file mode 100644 index 00000000..b82d4182 --- /dev/null +++ b/devdocs/elisp/regexp-problems.html @@ -0,0 +1,12 @@ + <h4 class="subsection">Problems with Regular Expressions</h4> <p>The Emacs regexp implementation, like many of its kind, is generally robust but occasionally causes trouble in either of two ways: matching may run out of internal stack space and signal an error, and it can take a long time to complete. The advice below will make these symptoms less likely and help alleviate problems that do arise. </p> <ul> <li> Anchor regexps at the beginning of a line, string or buffer using zero-width assertions (‘<samp>^</samp>’ and <code>\`</code>). This takes advantage of fast paths in the implementation and can avoid futile matching attempts. Other zero-width assertions may also bring benefits by causing a match to fail early. </li> +<li> Avoid or-patterns in favor of character alternatives: write ‘<samp>[ab]</samp>’ instead of ‘<samp>a\|b</samp>’. Recall that ‘<samp>\s-</samp>’ and ‘<samp>\sw</samp>’ are equivalent to ‘<samp>[[:space:]]</samp>’ and ‘<samp>[[:word:]]</samp>’, respectively. </li> +<li> Since the last branch of an or-pattern does not add a backtrack point on the stack, consider putting the most likely matched pattern last. For example, ‘<samp>^\(?:a\|.b\)*c</samp>’ will run out of stack if trying to match a very long string of ‘<samp>a</samp>’s, but the equivalent ‘<samp>^\(?:.b\|a\)*c</samp>’ will not. <p>(It is a trade-off: successfully matched or-patterns run faster with the most frequently matched pattern first.) </p> </li> +<li> Try to ensure that any part of the text can only match in a single way. For example, ‘<samp>a*a*</samp>’ will match the same set of strings as ‘<samp>a*</samp>’, but the former can do so in many ways and will therefore cause slow backtracking if the match fails later on. Make or-pattern branches mutually exclusive if possible, so that matching will not go far into more than one branch before failing. <p>Be especially careful with nested repetitions: they can easily result in very slow matching in the presence of ambiguities. For example, ‘<samp>\(?:a*b*\)+c</samp>’ will take a long time attempting to match even a moderately long string of ‘<samp>a</samp>’s before failing. The equivalent ‘<samp>\(?:a\|b\)*c</samp>’ is much faster, and ‘<samp>[ab]*c</samp>’ better still. </p> </li> +<li> Don’t use capturing groups unless they are really needed; that is, use ‘<samp>\(?:…\)</samp>’ instead of ‘<samp>\(…\)</samp>’ for bracketing purposes. </li> +<li> Consider using <code>rx</code> (see <a href="rx-notation">Rx Notation</a>); it can optimize some or-patterns automatically and will never introduce capturing groups unless explicitly requested. </li> +</ul> <p>If you run into regexp stack overflow despite following the above advice, don’t be afraid of performing the matching in multiple function calls, each using a simpler regexp where backtracking can more easily be contained. </p><div class="_attribution"> + <p class="_attribution-p"> + Copyright © 1990-1996, 1998-2022 Free Software Foundation, Inc. <br>Licensed under the GNU GPL license.<br> + <a href="https://www.gnu.org/software/emacs/manual/html_node/elisp/Regexp-Problems.html" class="_attribution-link">https://www.gnu.org/software/emacs/manual/html_node/elisp/Regexp-Problems.html</a> + </p> +</div> |
