Regular Expressions and Patterns
Regular expressions are very powerful
tools for performing pattern matches.
So how
are regular expressions implemented in JavaScript? There are two ways:
1. Using literal syntax
var
RegularExpression = /pattern/
2. When you need to dynamically
construct the regular expression, via the RegExp() constructor as a
string, and is useful when the pattern is not known ahead of time.
var
RegularExpression = new RegExp("pattern");
A pattern defined inside
RegExp()
should be enclosed in quotes, with any special characters escaped to retain its
meaning (ie: "\d
"
must be defined as "\\d
").
Example (check input for 5 digit number)
Let’s deconstruct the regular expression used, which checks that a string
contains a valid 5-digit number, and ONLY a 5-digit number:
var re5digit=/^\d{5}$/;
- ^ indicates the beginning of the string. Using a ^ metacharacter requires that the match start at the beginning.
- \d indicates a digit character and the {5} following it means that there must be 5 consecutive digit characters.
- $ indicates the end of the string. Using a $ metacharacter requires that the match end at the end of the string.
Translated to English, this pattern states: "Starting at the beginning
of the string there must be nothing other than 5 digits. There must also be
nothing following those 5 digits."
Pattern flags (switches)
Properties
|
Description
|
Example
|
i
|
Ignore the case of characters.
|
/The/i matches "the" and "The" and
"tHe"
|
g
|
Global search for all occurrences of a pattern
|
/ain/g matches both "ain"s in "No pain no
gain", instead of just the first.
|
gi
|
Global search, ignore case.
|
/it/gi matches all "it"s in "It is our IT
department"
|
m
|
Multiline mode. Causes ^ to match beginning of line or
beginning of string. Causes $ to match end of line or end of string. JavaScript1.5+ only.
|
/hip$/m matches "hip" as well as
"hip\nhop"
|
Position Matching
Symbol
|
Description
|
Example
|
^
|
Only matches the beginning of a string.
|
/^The/ matches "The" in "The night" by
not "In The Night"
|
$
|
Only matches the end of a string.
|
/and$/ matches "and" in "Land" but not
"landing"
|
\b
|
Matches any word boundary (test characters must exist at
the beginning or end of a word within the string)
|
/ly\b/ matches "ly" in "This is really
cool."
|
\B
|
Matches any non-word boundary.
|
/\Bor/ matches “or” in "normal" but not
"origami."
|
(?=pattern)
|
A positive look ahead. Requires that
the following pattern in
within the input. Pattern is not included as part of the actual match.
|
/(?=Chapter)\d+/ matches any digits when it's proceeded by
the words "Chapter", such as 2 in "Chapter 2", though not
"I have 2 kids."
|
(?!pattern)
|
A negative look ahead. Requires that the following pattern
is not within the input. Pattern is not included as
part of the actual match.
|
/JavaScript(?! Kit)/ matches any occurrence of the word
"JavaScript" except when it's inside the phrase "JavaScript
Kit"
|
Literals
Symbol
|
Description
|
Alphanumeric
|
All alphabetical and numerical characters match themselves
literally. So /2 days/ will match "2 days" inside a string.
|
\O
|
Matches NUL character.
|
\n
|
Matches a new line character
|
\f
|
Matches a form feed character
|
\r
|
Matches carriage return character
|
\t
|
Matches a tab character
|
\v
|
Matches a vertical tab character
|
[\b]
|
Matches a backspace.
|
\xxx
|
Matches the ASCII character expressed by the octal number
xxx.
"\50" matches left parentheses character "(" |
\xdd
|
Matches the ASCII character expressed by the hex number
dd.
"\x28" matches left parentheses character "(" |
\uxxxx
|
Matches the ASCII character expressed by the UNICODE xxxx.
"\u00A3" matches "£". |
Character Classes
Symbol
|
Description
|
Example
|
[xyz]
|
Match any one character enclosed in the character set. You
may use a hyphen to denote range. For example. /[a-z]/ matches any letter in
the alphabet, /[0-9]/ any single digit.
|
/[AN]BC/ matches "ABC" and "NBC" but
not "BBC" since the leading “B” is not in the set.
|
[^xyz]
|
Match any one character not enclosed in the character set.
The caret indicates that none of the characters
NOTE: the caret used within a character class is not to be confused
with the caret that denotes the beginning of a string. Negation is only
performed within the square brackets. |
/[^AN]BC/ matches "BBC" but not "ABC"
or "NBC".
|
.
|
(Dot). Match any character except newline or another
Unicode line terminator.
|
/b.t/ matches "bat", "bit",
"bet" and so on.
|
\w
|
Match any alphanumeric character including the underscore.
Equivalent to [a-zA-Z0-9_].
|
/\w/ matches "200" in "200%"
|
\W
|
Match any single non-word character. Equivalent to
[^a-zA-Z0-9_].
|
/\W/ matches "%" in "200%"
|
\d
|
Match any single digit. Equivalent to [0-9].
|
|
\D
|
Match any non-digit. Equivalent to [^0-9].
|
/\D/ matches "No" in "No 342222"
|
\s
|
Match any single space character. Equivalent to [
\t\r\n\v\f].
|
|
\S
|
Match any single non-space character. Equivalent to [^
\t\r\n\v\f].
|
Repetition
Symbol
|
Description
|
Example
|
{x}
|
Match exactly x occurrences of a regular expression.
|
/\d{5}/ matches 5 digits.
|
{x,}
|
Match x or more occurrences of a regular expression.
|
/\s{2,}/ matches at least 2 whitespace characters.
|
{x,y}
|
Matches x to y number of occurrences of a regular
expression.
|
/\d{2,4}/ matches at least 2 but no more than 4 digits.
|
?
|
Match zero or one occurrences. Equivalent to {0,1}.
"?" can also be used following one of the quantifiers * , + ,
? , or {} to make the later match non greedy,
or the minimum number of times versus the default maximum. For example, using
the string "He counted 12345", the expression /\d+/ matches "12345", while
/\de?/ would match just
"1", or the minimum match. |
/a\s?b/ matches "ab" or "a b".
/\d{2,4}?/ matches "12" in the string "12345" instead
of "1234" due to "?" at the end of the quantifier. |
*
|
Match zero or more occurrences. Equivalent to {0,}.
|
/we*/ matches "w" in "why" and
"wee" in "between", but nothing in "bad"
|
+
|
Match one or more occurrences. Equivalent to {1,}.
|
/fe+d/ matches both "fed" and "feed"
|
Alternation & Grouping
Symbol
|
Description
|
Example
|
( )
|
Grouping characters together to create a clause. May be
nested.
|
/(abc)+(def)/ matches one or more occurrences of
"abc" followed by one occurrence of "def".
|
( )
|
Apart from grouping characters (see above), parenthesis
also serve to capture the desired subpattern within a pattern. The values of
the subpatterns can then be retrieved using
RegExp.$1 , RegExp.$2
etc after the pattern itself is matched or compared. For
example, the following matches "2 chapters" in "We read 2
chapters in 3 days", and furthermore isolates the value "2":
var mystring="We read 2 chapters in 3 days"
The subpattern can also be back referenced later within the main pattern.
See "Back
References" below.var needle=/(\d+) chapters/ mystring.match(needle) //matches "2 chapters" alert(RegExp.$1) //alerts captured subpattern, or "2" |
The following finds the text "John Doe" and
swaps their positions, so it becomes "Doe John":
"John Doe".replace(/(John) (Doe)/, "$2 $1") |
(?:x)
|
Matches x but does not capture it. In other words, no numbered
references are created for the items within the parenthesis.
|
/(?:.d){2}/ matches but doesn't capture "cdad".
|
x(?=y)
|
Positive lookahead: Matches x only if it's followed by y.
Note that y is not included as part of the match, acting only as a required
conditon.
|
/George(?= Bush)/ matches "George" in
"George Bush" but not "George Michael" or "George
Orwell".
/Java(?=Script|Hut)/ matches "Java" in "JavaScript" or
"JavaHut" but not "JavaLand". |
x(?!y)
|
Negative lookahead: Matches x only if it's NOT followed
by y. Note that y is not included as part of the match, acting only as a
required condiiton.
|
/^\d+(?! years)/ matches "5" in "5
days" or "5 oranges", but not "5 years".
|
|
|
Alternation combines clauses into one regular expression
and then matches any of the individual clauses. Similar to "OR"
statement.
|
/forever|young/ matches "forever" or
"young"
/(ab)|(cd)|(ef)/ matches and remembers "ab" or "cd" or
"ef". |
Back references
Symbol
|
Description
|
( )\n
|
"\n" (where n is a number from 1 to 9) when
added to the end of a regular expression pattern allows you to back reference
a subpattern within the pattern, so the value of the subpattern is remembered
and used as part of the matching . A subpattern is created by surrounding it
with parenthesis within the pattern. Think of "\n" as a dynamic
variable that is replaced with the value of the subpattern it references. For
example:
/(hubba)\1/
is equivalent to the pattern /hubbahubba/, as "\1" is replaced
with the value of the first subpattern within the pattern, or (hubba), to
form the final pattern.Lets say you want to match any word that occurs twice in a row, such as "hubba hubba." The expression to use would be:
/(\w+)\s+\1/
"\1" is replaced with the value of the first subpattern's match
to essentially mean "match any word, followed by a space, followed by
the same word again".If there were more than one set of parentheses in the pattern string you would use \2 or \3 to match the desired subpattern based on the order of the left parenthesis for that subpattern. In the example:
/(a (b (c)))/
"\1" references (a (b (c))), "\2" references (b (c)),
and "\3" references (c). |
Regular Expression methods
Method
|
Description
|
Example
|
String.match(regular expression)
|
Executes a search for a match within a string based on a
regular expression. It returns an array of information or null if no match is
found.
Note: Also updates the $1…$9 properties in the RegExp object. |
var oldstring="Peter has 8 dollars and Jane has
15"
newstring=oldstring.match(/\d+/g) //returns the array ["8","15"] |
RegExp.exec(string)
|
Similar to String.match() above in that it returns an
array of information or null if no match is found. Unlike String.match()
however, the parameter entered should be a string, not a regular expression
pattern.
|
var match = /s(amp)le/i.exec("Sample text")
//returns ["Sample","amp"] |
String.replace(regular expression, replacement
text)
|
Searches and replaces the regular expression portion
(match) with the replaced text instead. For the "replacement text"
parameter, you can use the keywords $1 to $99 to replace the original text
with values from subpatterns defined within the main pattern.
The following finds the text "John Doe" and swaps their
positions, so it becomes "Doe John":
var newname="John Doe".replace(/(John) (Doe)/,
"$2 $1")
The following characters carry special meaning inside "replacement
text":
|
var oldstring="(304)434-5454"
newstring=oldstring.replace(/[\(\)-]/g, "") //returns "3044345454" (removes "(", ")", and "-") |
String.split (string literal or regular expression)
|
Breaks up a string into an array of substrings based on a
regular expression or fixed string.
|
var oldstring="1,2, 3, 4, 5"
newstring=oldstring.split(/\s*,\s*/) //returns the array ["1","2","3","4","5"] |
String.search(regular expression)
|
Tests for a match in a string. It returns the index of the
match, or -1 if not found. Does NOT support global searches (ie:
"g" flag not supported).
|
"Amy and George".search(/george/i)
//returns 8 |
RegExp.test(string)
|
Tests if the given string matches the Regexp, and returns
true if matching, false if not.
|
var pattern=/george/i
pattern.test("Amy and George") //retuns true |
var
string1="Peter has 8 dollars and Jane has 15"
parsestring1=string1.match(/\d+/g);
//returns the
array [8,15]
var
string2="(304)434-5454"
parsestring2=string2.replace(/[\(\)-]/g,
"");
//Returns
"3044345454" (removes "(", ")", and
"-")
var
string3="1,2, 3, 4, 5"
parsestring3=string3.split(/\s*,\s*/);
//Returns the
array ["1","2","3","4","5"]
Delving deeper, you can actually use the replace() method to modify- and not simply replace- a substring. This is accomplished by using the $1…$9 properties of the RegExp object. These properties are populated with the contents of the portions of the searched string that matched the portions of the search pattern contained within parentheses. The following example illustrates how to use the replace method to swap the order of first and last names and insert a comma and a space in between them:
<SCRIPT language="JavaScript1.2">
var objRegExp = /(\w+)\s(\w+)/;
var strFullName = "Jane Doe";
var strReverseName =
strFullName.replace(objRegExp, "$2, $1");
alert(strReverseName) //alerts "Doe,
John"
</SCRIPT>
The output of this code will be “Doe, Jane”. How this works is
that the pattern in the first parentheses matches “Jane” and this string is
placed in the RegExp.$1 property. The \s (space) character match is not saved
to the RegExp object because it is not in parentheses. The pattern in the
second set of parentheses matches “Doe” and is saved to the RegExp.$2 property.
The String replace() method takes the Regular Expression object as its first
argument and the replacement text as the second argument. The $2 and $1 in the
replacement text are substitution variables that will substitute the contents
of RegExp.$2 and RegExp.$1 in the result string.
You can also use replace() method to strip unwanted characters from a string
before testing the string for validity or before saving the string to a
database. It can be used to add formatting characters for the display of a
string as well. Here is a simple example that uses test() to see if a regular expression matches against a certain string:
var
pattern=/php/i
pattern.test("PHP
is your friend"); //returns true
Sample Usage
Now that you’ve been introduced to regular expressions and
patterns, let’s look at a few examples of common validation and formatting
functions.
Valid Number
A valid number value should contain only an optional minus sign,
followed by digits, followed by an optional dot (.) to signal decimals, and if
it's present, additional digits. A regular expression to do that would look
like this:
var
anum=/(^-*\d+$)|(^-*\d+\.\d+$)/
Valid Date Format
A valid short date should consist of a 2-digit month, date
separator, 2-digit day, date separator, and a 4-digit year (e.g. 02/02/2000).
It would be nice to allow the user to use any valid date separator character
that your backend database supported such as slashes, dashes and periods. You
want to be sure the user enters the same date separator character for all
occurrences. The following function returns true or false depending on whether
the user input matches this date format:
function
checkdateformat(userinput){
var dateformat =
/^\d{1,2}(\-|\/|\.)\d{1,2}\1\d{4}$/
return
dateformat.test(userinput) //returns true or false depending on userinput
}
This example uses back referencing to ensure that the second
date separator matches the first one.
Replace HTML tags (brackets) with entities instead
User input often times must be parsed for security or to ensure
it doesn't mess up the formatting of the page. The most common task is to
remove any HTML tags (brackets) entered by the user, and replace them with
their entities equivalent instead. The following function does just that-
replace "<" and ">" with "<" and
">", respectively:
function
htmltoentity(userinput){
var
formatted=userinput.replace(/(<)|(>)/g,
function(thematch){if
(thematch=="<") return "<"; else return ">"})
}
The first parameter of replace() searches for a match for either "<" or ">". The second parameter demonstrates something new and interesting- you can actually use a function instead of a plain replacement text as the parameter. When a function is used, the parameter of it (in this case, "thematch") contains the matched substring and returns what you wish it to be replaced with. Since we're looking to replace both "<" and ">", this function will help us return two different replacement strings accordingly.
No comments:
Post a Comment