If we want to match a specific number of repetations of the group.

In [1]:
import re

?

The ? says match the preceding group zero or one times. i.e. its and optional group - it can appear once not appear at all.

In [2]:
batRegex = re.compile(r'Bat(wo)?man')   #pattern I want is r'Batman|Batwoman', but we made it bit shorter.
In [3]:
mo = batRegex.search('The adventures of Batman')
In [4]:
mo.group()
Out[4]:
'Batman'
In [5]:
mo = batRegex.search('The adventures of Batwoman')
In [6]:
mo.group()
Out[6]:
'Batwoman'
In [7]:
mo = batRegex.search('The adventures of Batwowowowowoman')
In [8]:
mo == None
Out[8]:
True

In the examples of phone number, let's say the message we have may or may not contain the area code. In that case we use following to find the phone number

In [9]:
phoneRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d')
In [10]:
mo = phoneRegex.search('This is my number 333-2323')
In [11]:
mo.group()
Out[11]:
'333-2323'
In [12]:
mo = phoneRegex.search('This is my number 444-333-2323')
In [13]:
mo.group()
Out[13]:
'444-333-2323'

We can escape the question mark with a backslash infront of it like \?

*

(Asterisk Character - Zero or More

In [14]:
batRegex = re.compile(r'Bat(wo)*man')
In [15]:
batRegex.search('The adventures of Batman')
Out[15]:
<_sre.SRE_Match object; span=(18, 24), match='Batman'>
In [16]:
batRegex.search('The adventures of Batwoman')
Out[16]:
<_sre.SRE_Match object; span=(18, 26), match='Batwoman'>
In [17]:
batRegex.search('The adventures of Batwowowowowowowowowowowowowoman')
Out[17]:
<_sre.SRE_Match object; span=(18, 50), match='Batwowowowowowowowowowowowowoman'>

The character can be escaped by using escape character(/) before it. Like \

+

One or More

In [18]:
batRegex = re.compile(r'Bat(wo)+man') #wo group is required to appear one or more times. i.e. it must appear atleast once.
In [19]:
batRegex.search('The adventures of Batman') == None
Out[19]:
True
In [20]:
batRegex.search('The adventures of Batwoman')
Out[20]:
<_sre.SRE_Match object; span=(18, 26), match='Batwoman'>

+ can be escaped with using escape character \. like \+

Let's take an example which has use for the escape character

In [21]:
regex = re.compile(r'(\+\*\?)+')    # It creates regex for +*? and the +*? can appear one or more than one time
In [22]:
regex.search('I learned about +*? regex syntax')
Out[22]:
<_sre.SRE_Match object; span=(16, 19), match='+*?'>
In [23]:
regex.search('I learned about +*?+*?+*?+*? regex syntax')
Out[23]:
<_sre.SRE_Match object; span=(16, 28), match='+*?+*?+*?+*?'>

If we want a pretty output, we can create a matchobject and use the group() method on it.

In [24]:
mo = regex.search('I learned about +*?+*?+*?+*? regex syntax')
In [25]:
mo.group()
Out[25]:
'+*?+*?+*?+*?'

{x}

What if we want to match with a specific number, like hahaha has three ha.

In [26]:
haRegex = re.compile(r'(Ha){3}')
In [27]:
haRegex.search('The joke was so funny he said HaHaHa')
Out[27]:
<_sre.SRE_Match object; span=(30, 36), match='HaHaHa'>

Above we can see that the string has matched the regular expression.

In [28]:
phoneRegex = re.compile(r'((\d\d\d-)?\d\d\d-\d\d\d\d(,)?){3}')

Above code creates a regex object which finds three phones with or without area code and the numbers can be separated by comma or not.

In [29]:
phoneRegex.search('The numbers are 333-3333,333-333-3333,111-3333')
Out[29]:
<_sre.SRE_Match object; span=(16, 46), match='333-3333,333-333-3333,111-3333'>

Instead of using {3}, the other way was like following

In [30]:
phoneRegex = re.compile(r'((\d\d\d-)?\d\d\d-\d\d\d\d(,)?)((\d\d\d-)?\d\d\d-\d\d\d\d(,)?)((\d\d\d-)?\d\d\d-\d\d\d\d(,)?)')

Which gets confusing.

{x,y}

Matching a range of possible repetition - at least x, at most y

In [31]:
haRegex = re.compile(r'(Ha){2,5}')
In [32]:
haRegex.search('everyone on hearing the joke said HaHa')
Out[32]:
<_sre.SRE_Match object; span=(34, 38), match='HaHa'>
In [33]:
haRegex.search('everyone on hearing the joke said HaHaHaHaHaHa')
Out[33]:
<_sre.SRE_Match object; span=(34, 44), match='HaHaHaHaHa'>

Above only matched to the first five Ha

{,y} means we can have zero to y matches, and {x,} means we can have x or more that x matches (unbound case).

Greedy/Non-Greedy

By default, regular expressions in python do greedy matches.

This means that they try to matche the longest possible string that matches the pattern

In [34]:
digitRegex = re.compile(r'(\d){3,5}')
In [35]:
digitRegex.search('123456789')
Out[35]:
<_sre.SRE_Match object; span=(0, 5), match='12345'>

If we want to do non greedy match, specify ? after the curly brace.

In [36]:
digitRegex = re.compile(r'(\d){3,5}?')
In [37]:
digitRegex.search('123456789')
Out[37]:
<_sre.SRE_Match object; span=(0, 3), match='123'>



-The ? says the group matches zero or one time.

-The * says the group matches zero or more times.

-The + says the group matches one or more times.

-The curly braces can match a specific number of time.

-The curly braces with two numbers matches a minimum and maximum number of times

-Leaving out the first or second number in the curly braces says there is no minimum or maximum

-Greedy matching match the longest string possible, nongreedy matching match the shortest string possible

-Putting a question mark after the curly braces makes it do a nongreedy match