In [1]:
import re
In [2]:
phoneRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
In [3]:
message = 'Remember this is a message with the first number as 4444-444-4444, 343-333-3333 and the final number is 333-222-3333'
In [4]:
phoneRegex.search(message)
Out[4]:
<_sre.SRE_Match object; span=(53, 65), match='444-444-4444'>
In [5]:
phoneRegex.findall(message)
Out[5]:
['444-444-4444', '343-333-3333', '333-222-3333']

search() returns Match object
findall() returns a list

If the regular expression string (here r'\d\d\d-\d\d\d-\d\d\d\d') does not have more than one group, then the findall method just reuturns a list of strings. This is the behavious for regex objects that have zero or one groups in them.

Let's do an example with a regex object that have two or more groups in it.

In [6]:
phoneRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
In [7]:
phoneRegex.findall(message)
Out[7]:
[('444', '444-4444'), ('343', '333-3333'), ('333', '222-3333')]

Instead of returning a list of strings, it returns a list of tuples

Notice that there is no - in the tuples above, so if the - is needed to appear, it can be put in one of the groups, and it will then appear in one of the element of tuple.

If the entire number is needed as one string, the whole pattern can be one group

In [8]:
phoneRegex = re.compile(r'((\d\d\d)-(\d\d\d-\d\d\d\d))')
In [9]:
phoneRegex.findall(message)
Out[9]:
[('444-444-4444', '444', '444-4444'),
 ('343-333-3333', '343', '333-3333'),
 ('333-222-3333', '333', '222-3333')]

The order of group is decided by opening parentheses, left to right

Character Classes

\d is a character class - represent a digit - 1 to 9

In [10]:
digitRegex = re.compile(r'(0|1|2|3|4|5|6|7|8|9)')
In [11]:
digitRegex = re.compile(r'\d')

Above two lines of code are equivalent

Standard Character ClassRepresents
\dAny numeric digit from 0 to 9
\DAny character that is not a numeric digit from 0 to 9
\wAny letter, numeric digit, or the underscore character.(Think of this as matching "word" character.)
\WAny Character that is not a letter, numeric digit, or the underscore character.
\sAny space, tab, or newline character.(Think of this as matching "space" characters.)
\SAny character that is not a space, tab, or newline


In [12]:
lyrics = 'There are 12 soldiers, 500000 peoples, 44 apples, 22 videogames, 27 trees, and 3232343435 books'
In [13]:
regex = re.compile(r'\d+\s\w+')
In [14]:
regex.search(lyrics)
Out[14]:
<_sre.SRE_Match object; span=(10, 21), match='12 soldiers'>
In [15]:
regex.findall(lyrics)
Out[15]:
['12 soldiers',
 '500000 peoples',
 '44 apples',
 '22 videogames',
 '27 trees',
 '3232343435 books']

Making new character classes

In [16]:
vowelRegex = re.compile(r'[aeiou]')   # same as r'(a|e|i|o|u)'

The left one is more useful since we can use ranges in it, like following

In [17]:
small_letterRegex = re.compile(r'[a-z]')
In [18]:
all_letters_from_a_to_f_Regex = re.compile(r'[a-fA-F]')
In [19]:
vowelRegex = re.compile(r'[aeiouAEIOU]')
In [20]:
vowelRegex.findall('The title for Avengers four is Avengers:0.5')
Out[20]:
['e', 'i', 'e', 'o', 'A', 'e', 'e', 'o', 'u', 'i', 'A', 'e', 'e']
In [21]:
vowelRegex = re.compile(r'[aeiouAEIOU]{2}')
In [22]:
vowelRegex.findall('The title for Avengers four is Avengers:0.5')
Out[22]:
['ou']

`

Negative Character Class

If the caret symbol ^ is added at the start of the character class

In [23]:
vowelRegex = re.compile(r'[^aeiouAEIOU]')

Now above will match every character that isn't in that character class.
In above case, it's consonants

In [24]:
vowelRegex.findall('The title for Avengers four is Avengers:0.5')
Out[24]:
['T',
 'h',
 ' ',
 't',
 't',
 'l',
 ' ',
 'f',
 'r',
 ' ',
 'v',
 'n',
 'g',
 'r',
 's',
 ' ',
 'f',
 'r',
 ' ',
 's',
 ' ',
 'v',
 'n',
 'g',
 'r',
 's',
 ':',
 '0',
 '.',
 '5']

Notice that there are not only letters, but all the punctuations and spaces as well i.e. any character that is not aeiou or AEIOU



-The regex method findall() is passed a string and returns all matches in it, not just the first match.

-If the regex has 0 or 1 group, findall() returns a list of strings

-If the regex has 2 or more groups, findall() returns a list of tuples of strings.

-\d is shorthand character class that matches digits. \w matches word characters, \s matches whitespace characters.

-The uppercase shorthand characters \D, \W, \S match characters that are NOT digits, word characters, and spaces.

-You can make your own character classes with square brackets: [aeiou]

-A ^ caret makes it a negative character class, matching anything not in the brackets: [^aeiou]