Now Reading
Damn Regex isn’t hard I just cant remember it.
0

Damn Regex isn’t hard I just cant remember it.

by Simon ParkerFebruary 11, 2018

I have learned and forgotten regex about 30 times now. The following are my notes while I learn again so I can refer back to it.

Basic Regex Matching

To match a string you can just type in the actual string you want to match so to match

Cat you can just type in cat as the regex and it will match

\d matches any digit from 1 to 9. the preceding slash is the escape symbol in regex

The . is a wildcard and matches anything including whitespace so to match three chars and a 4th char which is a full stop ie “htb.” you can use …\. escaping the last one. this matches any three characters and then a .

inside square brackets you can match specific characters. so to match can but not fan or dan you could write [c]an this means only c is the acceptable character in the first place. similarly if you want to match all of these but no pan you could write [cfd]an as the inside of the square bracket only matches one letter it just defines which ones are acceptable.

adding the hat inside the square brackets means match any character except fort these characters which should be excluded [^cfd]an would only match pan above

Use ranges inside of square brackets, [0-6] to match characters 0 to 6 and nothing else. [^b-x] will match anything other than the letters b to x

\w is a special character which is the equivalent of the [A-Za-z0-9] which is often seen to match English language characters. it matches one character only still, but any English language excluding special characters. it includes both uppercase and lower case

Matching Multiple Characters

use curly braces to match multiple characters so a{3} will match a 3 times.

Apparently some Regex engines will allow ranges in here for example a{1,3} will match a between one and three times

other examples

[wxy]{5} matches five characters which must be w x or y

.{2,6} matches between 2 and 6 of any character.

[https]{4,5} would be a crap way to match http or https but also matches hhhhh or htpsp

better way would be https? as that makes the last s optional so matches both

 

Optional Characters

using a ? means the preceding character is optional

ab?c matches abc and ac as the b is optional.

to match a question mark in a string you can \? escape it.

Whitespace

\s matches whitespace of all types including tabs spaces new lines and carriage returns

Example to match

  1.  abc
  2.     abc
  3.          abc

\d\.\s+abc matches any digit \d then matches a dot so needs to be escaped \. then match any number of whitespace above 1 \s+ then match the abc string

Start and end of a line

Being very specific ensures no unwanted matches

^ indicates the start of a line and $ indicates the end of a line

Example match 1 but not 2 below

  1. mission successful
  2. mission unsuccessful

^\w+\ssuccessful matches specifically the \s for whitespace and then the successful straight afterwards

the ^ matches the start of the new line.

Capturing Groups

using () creates a capturing group which can be referenced afterwards

Basically creates a variable from the match

example to match anything that starts with IMG then a filename made of digits finishing with .png would be

^(IMG\d+\.png)$

this indicates it must be the start of a line then match IMG plus any amount of digits more than 1 must exist followed by an escaped . then png

This would match IMG1.png or IMG12345675443232.png and it will extract the full filename to be used afterwards

to match a filename for example

^(file\w+)\.pdf would match anything which starts with file then anything and ends with .pdf capturing the filename

 

Capturing a month and year example

Jan 1987

May 1969

Aug 2011

match any character and then a whitespace then match any numbers

Use capturing on both groups so you get the full date and just the year

(\w+\s(\d+))

None capturing Groups are indicated with a ?: at the start

so (https?|ftp)://www\.(\w+)\.com would match the protocol and the domain

adding a ?: to the start (?:https?|ftp) still matches but is none capturing as i dont want the actual info. its only got parentheses because it contains an or statement.

[^/\r\n]

This matches anything that does not include a forward slash or a line break

so in the case of a full domain name (https?|ftp)://([^/\r\n]+)(/[^\r\n]*)?

Be Specific

to match things you can use a pipe for an OR statement enclosed in brackets

i love (cats|dogs)

Using a * or a + to match multiple characters

this always follows a character or group so not on its own.

\d* would match any number of digits or \d+ would match at least one digit

a+ would match one or more a’s

[abc]+ would match one or more of a b or c characters

.* would match Zero or more of any character. Not sure yet why this would be a thing. I guess for characters that could appear rather than those that definitely do appear.

to match aaaabcc aabbbbc and aacc you could use a+b*c+ as there is at least 1 a so use a + same with c but in one case there is 0 b’s so need to use a star to match 0 or more. [abc]+c works too.

\d+ matches 1 or more numbers

Characters which match multiple things

. matches anything including whitespace

/w matches A-Z a-z and 1-9 English Language

/d matches any digit 0-9

Meta Characters

\d captures digits

\s captures whitespace

\w matches any english language letter or digit upper and lower case

Upper case versions mean the opposite so

\D means anythign except for digits

\W means any non alphanumeric character

\S means any non whitespace character

\b matches the boundary between a word and a non word character

\w+\b for example capture the rest of a word until the next whitespace

Specific Examples

US phone numbers

To grab the area code from the phone numbers, we can simply capture the first three digits, using the expression (\d{3}).

However, to match the full phone number as well, we can use the expression 1?[\s-]?\(?(\d{3})\)?[\s-]?\d{3}[\s-]?\d{4}. This breaks down into the country code ‘1?’, the captured area code ‘\(?(\d{3})\)?’, and the rest of the digits ‘\d{3}’ and ‘\d{4}’ respectively. We use ‘[\s-]?’ to catch the space or dashes between each component.

Matching HTML

  1. dont do it. use a proper parsing library as HTML is not consistent enough
  2. <(\w+) will match anything in a < tag
  3. >([\w\s]*)<  matches the content of tags. useful for a hrefs ?
  4. href='([\w://.]*)’ finds the link target
  5. ='([\w://.]*)’ finds any attribute value
What's your reaction?
Love It
0%
Interested
0%
Meh...
0%
What?
0%
Hate It
0%
Sad
0%
About The Author
Simon Parker

Leave a Response