Why does this regular expression kill the Java regex engine?

I have this naive regex "<([\s]|[^<])+?>" (excluding the quotation marks). It seems so straightforward but it is indeed evil when it works against the below HTML text. It sends the Java regular expression engine to an infinite loop.

I have another regex ("<.+?>"), which does somewhat the same thing, but it doesn't kill anything. Do you know why this happens?

<script language="JavaScript" type="text/javascript">
        var numDivs, layerName;
        layerName = "lnavLayer";
        catLinkName = "category";
        numDivs = 2;
        function toggleLayer(layerID){
            if (!(navigator.appName == "Netscape" && navigator.appVersion.substr(0, 1) < 5)){
                thisLayer = document.getElementById(layerName + layerID);
                categoryLink = document.getElementById(catLinkName + layerID);
                closeThem();
                if (thisLayer.className == 'subnavDefault'){
                    thisLayer.className = 'subnavToggled';
                    categoryLink.className = 'leftnavLinkSelectedSection';
                }
            }
        }
        function closeThem(){
            for(x = 0; x < numDivs; x++){
                theLayer = document.getElementById(layerName + (x
+ 1));
                thecategoryLink = document.getElementById(catLinkName + (x + 1));
                theLayer.className = 'subnavDefault';
                thecategoryLink.className = 'leftnavLink';
            }
        } var flag = 0; var lastClicked = 0
    //-->
    </script>

it even keeps looping with an online Java regex tool (such as www.fileformat.info/tool/regex.htm) or a utility like RegexBuddy.


Asked by: Charlie816 | Posted: 21-01-2022






Answer 1

The reason the Java regex engine crashes is that this part of your regex causes a stack overflow (indeed!):

[\s]|[^<]

What happens here is that every character matched by \s can also be matched by [^<]. That means there are two ways to match each whitespace character. If we represent the two character classes with A and B:

A|B

Then a string of three spaces could be matched as AAA, AAB, ABA, ABB, BAA, BAB, BBA, or BBB. In other words the complexity of this part of the regex is 2^N. This will kill any regex engine that doesn't have any safeguards against what I call catastrophic backtracking.

When using alternation (vertical bar) in a regex, always make sure the alternatives are mutually exclusive. That is, at most one of the alternatives may be allowed to match any given bit of text.

Answered by: Carlos385 | Posted: 22-02-2022



Answer 2

The regex ([\s]|[^<]) in plain terms means any single character that IS white-space or IS NOT a < character, which is redundant because white-space characters are NOT a < character. It appears to me that what you really mean is:

`"<([^<])+?>"`

I am not sure if this will solve the infinite loop, but I thought I'd point this out.

Answered by: Miller739 | Posted: 22-02-2022



Answer 3

Another problem (in addition to what Jan said) is that you're matching one character at a time inside the parentheses, equivalent to this simplified example:

(.)+

Each time this part of the regex is executed, the regex engine has to save the start and end positions of whatever was matched by the subexpression inside the parens, in case it needs to backtrack. This would be true even if it were a non-capturing group, i.e.,

(?:.)+

...but because it's a capturing group, even more information has to be saved. Going through all that for one character at a time gets really expensive. It's almost never correct to match a single character inside a parenthesized group with a * or + quantifier on the group. Also, you should use capturing groups only when you need to capture something; otherwise, use the non-capturing variety.

Answered by: Patrick843 | Posted: 22-02-2022



Similar questions

java - Regular expression to parse option string

I'm using the Java matcher to try and match the following: @tag TYPE_WITH_POSSIBLE_SUBTYPE -PARNAME1=PARVALUE1 -PARNAME2=PARVALUE2: MESSAGE The TYPE_WITH_POSSIBLE_SUBTYPE consists of letters with periods. Every parameter has to consist of letters, and every value has to consist of numerics/letters. There can be 0 or more parameters. Immediately after the last parameter value comes...


regex - Regular Expression (Java)

How do I match the following string? http://localhost:8080/MenuTest/index.action The regex should return true if it contains "MenuTest" in the above pattern. Cheers


regex - Why doesn't this Java regular expression work?

I need to create a regular expression that allows a string to contain any number of: alphanumeric characters spaces ( ) &amp; . No other characters are permitted. I used RegexBuddy to construct the following regex, which works correctly when I test it within RegexBuddy:


java - co. corp. inc. regular expression

This is my first time working with regular expressions and I've been trying to get a regular expression working that would match the following: apple apple inc. apple co. apple corp. but would not match: inc. apple co. apple apple co. inc. apple corp. inc. apple inc. corp. and so on... This is...


regex - Java Regular Expression Problem

I have a group of lines like the following: tb-set-node-recipe $vpn1 W2K3_SP2_VPN_SRV tb-set-node-os $vpn2 I_W2K3_SP2_VPN_SRV tb-set-node-os $xpcli1 I_XP_SP3_VPN_CLI tb-set-node-os $xpcli2 I_XP_SP2_VPN_CLI tb-set-node-os $xpcli3 I_XP_SP1_VPN_CLI tb-set-node-recipe $ftp1 FC8_KS_FTP_SRV tb-set-node-os $smb1 XP_SP3-STD tb-set-node-recipe $web1 FC8_KS_WEB_SRV I am...


regex - What Java regular expression do I need to match this text?

I'm trying to match the following using a regular expression in Java - I have some data separated by the two characters 'ZZ'. Each record starts with 'ZZ' and finishes with 'ZZ' - I want to match a record with no ending 'ZZ' for example, I want to match the trailing 'ZZanychars' below (Note: the *'s are not included in the string - they're just marking the bit I want to match). ZZanycharsZZZZanycharsZZZZan...


regex - Regular Expression problem in Java

I am trying to create a regular expression for the replaceAll method in Java. The test string is abXYabcXYZ and the pattern is abc. I want to replace any symbol except the pattern with +. For example the string abXYabcXYZ and pattern [^(abc)] should return ++++abc+++, but in my case it returns ab++abc+++.


java - Regular expression, replace all commas between double quotes

I have this string: 1001,"Fitzsimmons, Des Marteau, Beale and Nunn",109,"George","COD","Standard",,109,8/14/1998 8:50:02 What regular expression would I use to replace the commas in the "Fitzsimmons, Des Marteau, Beale and Nunn" with a pipe | so it is: "Fitzsimmons| Des Marteau| Beale and Nunn" Should have clarified, I am doing a ...


java - Regular expression with an = and a ;

I'm trying to use a regular expression to find all substrings that start with an equals sign (=) and ends with a semicolon (;) with any number of characters in between. It should be something like this =*; For some reason, the equals is not registering. Is there some sort of escape character that will make the regex notice my equals sign? I'm working in Java if that ha...


java - Regular Expression to Match " | "

I am trying to use Java's useDelimiter method on it's Scanner class to do some simple parsing. Basically each line is a record delimited by &quot; | &quot;, so for example: 2 | John Doe 3 | Jane Doe 4 | Jackie Chan The...


java - Escape path separator in a regular expression

I need to write a regular expression that finds javascript files that match &lt;anypath&gt;&lt;slash&gt;js&lt;slash&gt;&lt;anything&gt;.js For example, it should work for both : c:\mysite\js\common.js (Windows) /var/www/mysite/js/common.js (UNIX) The problem is that the file separator in Windows is not being properly escaped : pattern...


regex - Negating literal strings in a Java regular expression

So regular expressions seem to match on the longest possible match. For instance: public static void main(String[] args) { String s = "ClarkRalphKentGuyGreenGardnerClarkSupermanKent"; Pattern p = Pattern.compile("Clark.*Kent", Pattern.CASE_INSENSITIVE); Matcher myMatcher = p.matcher(s); int i = 1; while (myMatcher.find()) { System.out.println(i++ + ". " + myMatcher.group()); ...


java - How would you use a regular expression to ignore strings that contain a specific substring?

How would I go about using a negative lookbehind(or any other method) regular expression to ignore strings that contains a specific substring? I've read two previous stackoverflow questions: java-regexp-for-file-filtering


java - Regular expression to parse option string

I'm using the Java matcher to try and match the following: @tag TYPE_WITH_POSSIBLE_SUBTYPE -PARNAME1=PARVALUE1 -PARNAME2=PARVALUE2: MESSAGE The TYPE_WITH_POSSIBLE_SUBTYPE consists of letters with periods. Every parameter has to consist of letters, and every value has to consist of numerics/letters. There can be 0 or more parameters. Immediately after the last parameter value comes...


java - el expression in jsp:invoke

I'm trying to use the following snippet inside my tag file: &lt;%@ attribute name="content" fragment="true"%&gt; ... &lt;c:set var="part" value="content"/&gt; &lt;jsp:invoke fragment="${part}" var="partValue"/&gt; ... and compiler gives me the following error: Syntax error, insert ") Statement" to complete IfStatement so as I understand it's not permitted ...


java - String delimited regular expression

I'm trying to build a bbcode parser, but I'm having quite some problems figuring out how to avoid matching too widely. For example I want to implement a [list] to conversion like this: \[list\](.*)\[/list\] would be replaced by this: &lt;ul&gt;$1&lt;/ul&gt; This works fine, except if I have two lists where the regular expression matches the beginning tag ...


regex - Using Java to find substring of a bigger string using Regular Expression

If I have a string like this: FOO[BAR] I need a generic way to get the "BAR" string out of the string so that no matter what string is between the square brackets it would be able to get the string. e.g. FOO[DOG] = DOG FOO[CAT] = CAT


java - Why does this code cause an "illegal start of expression" exception?

These are my questions: I'm getting a couple of errors on the line "public static boolean validNumCheck(String num){" - "illegal start of expression", "';' expected", and "')' expected". How can I give the user 3 tries in total for each number? I believe right now the programme asks the user for 3 numbers and gives them 3 tries in total to get the numbers correct (My explanations suck... re...


regex - How do I avoid the implicit "^" and "$" in Java regular expression matching?

I've been struggling with doing some relatively straightforward regular expression matching in Java 1.4.2. I'm much more comfortable with the Perl way of doing things. Here's what's going on: I am attempting to match /^&lt;foo&gt;/ from "&lt;foo&gt;&lt;bar&gt;" I try: Pattern myPattern= Pattern.compile("^&lt;foo&gt;"); Matcher myMatcher= myPattern.matcher("&lt;foo&gt;&lt;bar&gt;"); System....


java - Regular expression to allow a set of characters and disallow others

I want to restrict the users from entering the below special characters in a field: œçşÇŞ ğĞščřŠŘŇĚŽĎŤČňěž ůŮ İťı —¿„”*@ Newline Carriage return A few more will be added to this list but I will have the complete restricted list eventually. But he can enter certain foreign characters like äöüÄÖÜÿï etc in addition to alphanumeric chars, usual...






Still can't find your answer? Check out these amazing Java communities for help...



Java Reddit Community | Java Help Reddit Community | Dev.to Java Community | Java Discord | Java Programmers (Facebook) | Java developers (Facebook)



top