Python Regex

Applications of Regular expressions

azam sayeed
4 min readMay 26, 2020

Extracting specific text like Timestamp from logs generated like in java Log4j framework.

Basic validations on input fields in websites for client-side validations like valid email id formats, password requirements, etc.

Filtering invalid phone Numbers in Pandas DataFrame to remove incorrect phone numbers provided based on Country code, digits, etc.

Updation of Student’s Address or Subject Code can be done by extracting a group of the student having required RE and subsequently updating them, rather than manually changing records iteratively.

Regular Expression is a special text string for describing a search pattern

Basic Example: Extract Names and Age from String where Name is Camel Case and Age is 2digit number

import reString1 ='''
Azam is 24 and Pri is 24
Sam is 20 and Zak is 23
'''
ages= re.findall(r'\d{1,3}',String1)
names = re.findall(r'[A-Z][a-z]+',String1)
Dict={}
x=0
for eachname in names:
#print(eachname)
Dict[eachname] = ages[x]
x=x+1

print(Dict)
o/p:
{'Azam': '24', 'Pri': '24', 'Sam': '20', 'Zak': '23'}Both T

Both String and RegEx have their own cursor like

Example Demonstration

Regular Expression Operations

  1. Find a specific word in a String
import reif re.search("help","God will help us go through 2020"):
print("There is Help")

allHelp=re.findall("help","we need to help the daily wagers who are helpless")
for i in allHelp:
print(i)
o/p:
There is Help
help
help
pattern = r"Cookie"
sequence = "Cookie"
if re.match(pattern, sequence):
print("Match!")
else: print("Not a match!")

2. Generate an iterator — getting the starting(inclusive starts with 0)and ending index(exclusive) of a particular String as a Tuple

import re
Str= "we need to help the daily wagers who are helpless"
for i in re.finditer("help",Str):
locTuple=i.span()
print(locTuple)
o/p:
(11, 15)
(41, 45)

3. Match one of any of the several letters- Match words with a particular pattern

import re
Str="improve, approve, redrove, commove"
str1=re.findall("imp+.ove|app+.ove",Str)
str2=re.findall("^imp+.ove",Str)
print(str1)
print(str2)
o/p:
['improve', 'approve']
['improve']

4. Replacing a String using re.compile

import re
food ="coke burger bat cat sat"
regex= re.compile("[b]at")
food= regex.sub("diseases",food)
print(food)
o/p
coke burger diseases cat sat
str3='here is \\notation'
print(re.search(r'\\notation',str3))
o/p:
<re.Match object; span=(8, 17), match='\\notation'>

5. Match a Single Character

str3='''keep the hopes
high destiny
can change'''
regex=re.compile("\n")
randstr=regex.sub(" ",str3)
print(str3)
print(randstr)
#other backspaces special characters
#\b: backspace
#\f: formfeed
#\r: carriage return
#\t: Tab
#\v: vertical tab
o/p:
keep the hopes
high destiny
can change
keep the hopes high destiny can change
randStr= "12345"
#\d: Any numbers , \D: Anything apart from numbers
print("Matches:",len(re.findall("\d",randStr)))
print("Matches:",len(re.findall("\D",randStr)))
print("Matches:",len(re.findall("\d{5}",randStr)))
o/p:
Matches: 5
Matches: 0
Matches: 1
num="123 1234 12345 123456 1234567"
print("Matches:",len(re.findall("\d{5,7}",num)))
# it finds digits 5 ...56 ...567
o/p:
Matches: 3

Review of Wild Card Characters

+ - Checks for one or more characters to its left.

* - Checks for zero or more characters to its left.

? - Checks for exactly zero or one character to its left.

. - A period. Matches any single character except newline character.

{x} - Repeat exactly x number of times.

{x,} - Repeat at least x times or more.

{x, y} - Repeat at least x times but no more than y times.

re.search(r'\d{9,10}', '0987684321').group()

. - A period. Matches any single character except newline character.

\w - Lowercase w. Matches any single letter, digit or underscore.

\W - Uppercase w. Matches any character not part of \w (lowercase w).

\s - Lowercase s. Matches a single whitespace character like: space, newline, tab, return.

\S - Uppercase s. Matches any character not part of \s (lowercase s).

\t - Lowercase t. Matches tab.

\n - Lowercase n. Matches newline.

\r - Lowercase r. Matches return.

\d - Lowercase d. Matches decimal digit 0-9.

^ - Caret. Matches a pattern at the start of the string.
$ - Matches a pattern at the end of string.

[abc] - Matches a or b or c.

[a-zA-Z0-9] - Matches any letter from (a to z) or (A to Z) or (0 to 9). Characters that are not within a range can be matched by complementing the set. If the first character of the set is ^, all the characters that are not in the set will be matched.

\ - Backslash. If the character following the backslash is a recognized escape character

Real-world Examples

Phone Number Verifications, FullName Validation, and Email Address

All phone numbers to have 3 starting digits and ‘-’ sign and 3 middle digits and ‘-’ sign and 4 digits in the end

#\w same as [a-zA-Z0-9_]
#\W everthing except [^a-zA-Z0-9]
#\d: Any numbers , \D: Anything apart from numbers
phn= "412-555-1212"if re.search("\d{3}-\d{3}-\d{4}",phn):
print("valid phone format")
#/s same as [\f\n\r\t\v]
#/S same as [^\f\n\r\t\v]
if re.search("\w{2,20}\s\w{2,20}","azam sayeed"):
print("fullname is valid")

Email address should have 1–20 lowercase and uppercase letters, numbers ,plus ._%+- followed by @ symbol and 2–20 lowercse and uppercase letters, numbers,plus .- a period and 2–3 lowercase/uppercase letters

if re.search("\w{1,20}[.%+-]*@\w{1,20}[-]*.[a-zA-Z]{3}","azam_1@gmail.com"):
print("fullname is valid")
str3="azam@gmail.com akr@gmail.com lol.c"
t=re.findall(r"\w{1,20}[.%+-]*@\w{1,20}[-]*.[a-zA-Z]{3}",str3)
print(t)
o/p:
fullname is valid
['azam@gmail.com', 'akr@gmail.com']

web scraping with RE

Scraping useful information from webpages using Regular Expression

Example of Extracting Phone Numbers from websites

import urllib.request
import re
url ="https://www.summet.com/dmsi/html/codesamples/addresses.html"
response =urllib.request.urlopen(url)
html=response.read()
htmlStr=html.decode()
#ex; (257) 563-7401
pdata=re.findall(r"\(\d{3}\)\s\d{3}-\d{4}",htmlStr)
print(pdata)

--

--