summaryrefslogtreecommitdiff
path: root/devdocs/elisp/regexp-problems.html
blob: b82d41828a00c545e6b8dce8a8bf7c26dd7d7133 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
 <h4 class="subsection">Problems with Regular Expressions</h4>    <p>The Emacs regexp implementation, like many of its kind, is generally robust but occasionally causes trouble in either of two ways: matching may run out of internal stack space and signal an error, and it can take a long time to complete. The advice below will make these symptoms less likely and help alleviate problems that do arise. </p> <ul> <li> Anchor regexps at the beginning of a line, string or buffer using zero-width assertions (‘<samp>^</samp>’ and <code>\`</code>). This takes advantage of fast paths in the implementation and can avoid futile matching attempts. Other zero-width assertions may also bring benefits by causing a match to fail early. </li>
<li> Avoid or-patterns in favor of character alternatives: write ‘<samp>[ab]</samp>’ instead of ‘<samp>a\|b</samp>’. Recall that ‘<samp>\s-</samp>’ and ‘<samp>\sw</samp>’ are equivalent to ‘<samp>[[:space:]]</samp>’ and ‘<samp>[[:word:]]</samp>’, respectively. </li>
<li> Since the last branch of an or-pattern does not add a backtrack point on the stack, consider putting the most likely matched pattern last. For example, ‘<samp>^\(?:a\|.b\)*c</samp>’ will run out of stack if trying to match a very long string of ‘<samp>a</samp>’s, but the equivalent ‘<samp>^\(?:.b\|a\)*c</samp>’ will not. <p>(It is a trade-off: successfully matched or-patterns run faster with the most frequently matched pattern first.) </p> </li>
<li> Try to ensure that any part of the text can only match in a single way. For example, ‘<samp>a*a*</samp>’ will match the same set of strings as ‘<samp>a*</samp>’, but the former can do so in many ways and will therefore cause slow backtracking if the match fails later on. Make or-pattern branches mutually exclusive if possible, so that matching will not go far into more than one branch before failing. <p>Be especially careful with nested repetitions: they can easily result in very slow matching in the presence of ambiguities. For example, ‘<samp>\(?:a*b*\)+c</samp>’ will take a long time attempting to match even a moderately long string of ‘<samp>a</samp>’s before failing. The equivalent ‘<samp>\(?:a\|b\)*c</samp>’ is much faster, and ‘<samp>[ab]*c</samp>’ better still. </p> </li>
<li> Don’t use capturing groups unless they are really needed; that is, use ‘<samp>\(?:…\)</samp>’ instead of ‘<samp>\(…\)</samp>’ for bracketing purposes. </li>
<li> Consider using <code>rx</code> (see <a href="rx-notation">Rx Notation</a>); it can optimize some or-patterns automatically and will never introduce capturing groups unless explicitly requested. </li>
</ul> <p>If you run into regexp stack overflow despite following the above advice, don’t be afraid of performing the matching in multiple function calls, each using a simpler regexp where backtracking can more easily be contained. </p><div class="_attribution">
  <p class="_attribution-p">
    Copyright &copy; 1990-1996, 1998-2022 Free Software Foundation, Inc. <br>Licensed under the GNU GPL license.<br>
    <a href="https://www.gnu.org/software/emacs/manual/html_node/elisp/Regexp-Problems.html" class="_attribution-link">https://www.gnu.org/software/emacs/manual/html_node/elisp/Regexp-Problems.html</a>
  </p>
</div>