I found a “regular expression” which matches all strings consists of non-prime number of 1s. ^1?$|^(111*)\1+$

This is quite surprising because it can be proven that non-prime number of 1s is not a regular language. That is to say, no regular expression can match that language.

However, this regular expression works. Following python code will print out all prime numbers within [0,99]

import re filter
(lambda n:not re.match(r"^1?$|^(111*)\\1+$","1"*n),range(100))

So why it works? Let’s look at the regular expression: ^1?$ matches “” and “1”, which is trivial. In ^(111*)\1+$, (111) matches 2 or more 1s, so (111)\1+ matches n*k 1s, k>=2, n>=2, which is exactly all composite numbers.

Then what’s wrong with the theory? Actually, nothing is wrong. The “regular expression” in most programming language is much more expressive than the formal regular expression defined in computational theory. In this case, we do not allow “\1” in formal regular language. Actually, this “\1” may potentially require unbounded amount of memory, because the program need to remember the matched string in “\1”, which can be arbitrarily long. In formal regular language, the memory cost should be bounded when regular expression is given: bounded by the size of corresponding DFA. Thus, “\1” kind of stuff gives regular expression much more power.

Next question is: is “regular expression” in programming language equivalent to Turing Machine?