Regular Expressions in C# (including a new comprehensive email pattern)
Of course C# supports regular expressions. I happen to have learned regular expressions in my dealings with FreeBSD, shell scripting, php, and other open source work. So naturally I would want to add this as a skill as I develop in C#.
What is a Regular Expression?
This is a method in code or script to describe the format or pattern of a string. For example, look at an email address:
someuser@somedomain.tld
It is important to understand that we are not trying to compare the email string against another string, we are trying to compare the string against a pattern.
To verify the email was in the correct format using String functions, it would take dozens of different functions running one after another. However, with a regular expression, a proper email address can be verified in one single function.
So instead regular expression is a language, almost like a scripting language in itself, for defining character patterns.
Most characters represent themselves. However, some characters don’t represent themselves without escaping them with a backslash because they represent something else. Here is a table of those characters.
Expression | Meaning |
---|---|
* | Any number of the previous character or character group. |
+ | One of more of the previous character or character group. |
^ | Beginning of line or string. |
$ | End of line or string. |
? | Pretty much any single character. |
. | Pretty much any character, zero characters, one character, or any number of characters |
[ … ] | This forms a character class expression |
( … ) | This forms a group of items |
You should look up more regular expression rules. I don’t explain them all here. This is just to give you an idea.
Example 1 – Parameter=Value
Here is a quick example of a regular expression that matches String=String. At first you might think this is easy and you can use this expression:
.*=.*
While that might work, it is very open. And it allows for zero characters before and after the equals, which should not be allowed.
This next pattern is at least correct but still very open.
.+=.+
What if the first value is limited to only alphanumeric characters?
[a-zA-z0-9]=.+
What if the second value has to be a valid windows file path or URL? And we will make sure we cover start to finish as well.
^[0-9a-zA-Z]+=[^<>|?*\”]+$
See how the more restrictions you put in place, the more complex the expression gets?
Example 2 – The email address
The pattern of an email is as follows: (Reference: wikipedia)
See updates here: C# – Email Regular Expression
- It will always have a single @ sign
- 1 to 64 characters before the @ sign called the local-part. Can contain characters a–z, A–Z, 0-9, ! # $ % & ‘ * + – / = ? ^ _ ` { | } ~, and . if it is not at the first or end of the local-part.
- Some characters after the @ sign that have a pattern as follows called the domain.
- It will always have a period “.”.
- One or more character before the period.
- Two to four characters after the period.
So a simple patterns of an email address should be something like these:
- This one just makes sure there are characters before and after the @
.+@.+ - This one makes sure the are characters before and after the @ as well as a character before and after the . in the domain.
.+@.*+\..+ - This one makes sure that there is only one @ symbol.
[^@]+@[^@]+\.
This are all quick an easy examples and will not work in every instance but are usually accurate enough for casual programs.
But a comprehensive example is much more complex.
- I wrote one myself that is the shortest and gets the best results of any I have found:
^[\w!#$%&'*+\-/=?\^_`{|}~]+(\.[\w!#$%&'*+\-/=?\^_`{|}~]+)*@((([\-\w]+\.)+[a-zA-Z]{2,4})|(([0-9]{1,3}\.){3}[0-9]{1,3}))$
- Here is another complex one I found: [reference]
^(([^<>()[\]\\.,;:\s@\""]+(\.[^<>()[\]\\.,;:\s@\""]+)*)|(\"".+\""))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$
So let me explain the first one that I wrote as it passes my unit tests below:
The start | |
[\w!#$%&’*+\-/=?\^_`{|}~]+ | At least one valid local-part character not including a period. |
(\.[\w!#$%&’*+\-/=?\^_`{|}~]+)* | Any number (including zero) of a group that starts with a single period and has at least one valid local-part character after the period. |
@ | The @ character |
( | Start group 1 |
( | Start group 2 |
([\-\w]+\.)+ | At least one group of at least one valid word character or hyphen followed by a period |
[\w]{2,4} | Any two to four valid top level domain characters. |
) | End group 2 |
| | an OR statement |
( | Start group 3 |
([0-9]{1,3}\.){3}[0-9]{1,3} | A regular expression for an IP Address. |
) | End group 3 |
) | End group 1 |
Code for both examples
Here is code for both examples. My email regular expression is enabled and the one I found on line is commented out. To see how they work differently, just comment out mine, and uncomment the one I found online.
using System; using System.Collections.Generic; using System.Text.RegularExpressions; namespace RegularExpressionsTest { class Program { static void Main(string[] args) { // Example 1 - Parameter=value // Match any character before and after the = // String thePattern = @"^.+=.+$"; // Match only Upper and Lowercase letters and numbers before // the = as a parameter name and after the equal match the // any character that is allowed in a file's full path // // ^[0-9a-zA-Z]+ This is any number characters upper or lower // case or 0 thru 9 at the string's beginning. // // = Matches the = character exactly // // [^<>|?*\"]+$ This is any character except < > | ? * " // as they are not valid in a file path or URL String theNameEqualsValue = @"abcd=http://"; String theParameterEqualsValuePattern = "^[0-9a-zA-Z]+=[^<>|?*\"]+$"; bool isParameterEqualsValueMatch = Regex.IsMatch(theNameEqualsValue, theParameterEqualsValuePattern); Log(isParameterEqualsValueMatch); // Example 2 - Email address formats String theEmailPattern = @"^[\w!#$%&'*+\-/=?\^_`{|}~]+(\.[\w!#$%&'*+\-/=?\^_`{|}~]+)*" + "@" + @"((([\-\w]+\.)+[a-zA-Z]{2,4})|(([0-9]{1,3}\.){3}[0-9]{1,3}))$"; // The string pattern from here doesn't not work in all instances. // http://www.cambiaresearch.com/c4/bf974b23-484b-41c3-b331-0bd8121d5177/Parsing-Email-Addresses-with-Regular-Expressions.aspx //String theEmailPattern = @"^(([^<>()[\]\\.,;:\s@\""]+(\.[^<>()[\]\\.,;:\s@\""]+)*)|(\"".+\""))" // + "@" // + @"((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])" // + "|" // + @"(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$"; Console.WriteLine("Bad emails"); foreach (String email in GetBadEmails()) { Log(Regex.IsMatch(email, theEmailPattern)); } Console.WriteLine("Good emails"); foreach (String email in GetGoodEmails()) { Log(Regex.IsMatch(email, theEmailPattern)); } } private static void Log(bool inValue) { if (inValue) { Console.WriteLine("It matches the pattern"); } else { Console.WriteLine("It doesn't match the pattern"); } } private static List GetBadEmails() { List emails = new List(); emails.Add("joe"); // should fail emails.Add("joe@home"); // should fail emails.Add("a@b.c"); // should fail because .c is only one character but must be 2-4 characters emails.Add("joe-bob[at]home.com"); // should fail because [at] is not valid emails.Add("joe@his.home.place"); // should fail because place is 5 characters but must be 2-4 characters emails.Add("joe.@bob.com"); // should fail because there is a dot at the end of the local-part emails.Add(".joe@bob.com"); // should fail because there is a dot at the beginning of the local-part emails.Add("john..doe@bob.com"); // should fail because there are two dots in the local-part emails.Add("john.doe@bob..com"); // should fail because there are two dots in the domain emails.Add("joe<>bob@bob.come"); // should fail because <> are not valid emails.Add("joe@his.home.com."); // should fail because it can't end with a period emails.Add("a@10.1.100.1a"); // Should fail because of the extra character return emails; } private static List GetGoodEmails() { List emails = new List(); emails.Add("joe@home.org"); emails.Add("joe@joebob.name"); emails.Add("joe&bob@bob.com"); emails.Add("~joe@bob.com"); emails.Add("joe$@bob.com"); emails.Add("joe+bob@bob.com"); emails.Add("o'reilly@there.com"); emails.Add("joe@home.com"); emails.Add("joe.bob@home.com"); emails.Add("joe@his.home.com"); emails.Add("a@abc.org"); emails.Add("a@192.168.0.1"); emails.Add("a@10.1.100.1"); return emails; } } }
tratamiento para cabello quemado por plancha
Rhyous
mejores shampoos para cabello graso
Rhyous
primeras posiciones en youtube
Rhyous
The second regex works like a charm. Thanks!
The e-mail pattern gives negative result for ivanov@gk-pik.ru although it is a valid e-mail address.
Sorry for the previous comment. The pattern works. I had extra white spaces at the end of the email address and this is why it was giving negative result.
I am going to have to watch for the new Top Level Domains (TLDs) as they may have TLDs with more than four characters coming in 2013.
http://newgtlds.icann.org/en/program-status/application-results/strings-1200utc-13jun12-en
I am also using MSDN RegEx, it's working fine checking single-character domain.
Are you replying to Paul, because if so, he commented on the MSDN site and they updated it shortly after his comment to be more accurate.
How come email like 'aпп@dgh.com' is passing validation?
I don't understand, that email should work? What are you asking?
But it contains characters from cyrillic alphabet (п). Shouldn't regex reject such email addresses as invalid, since email address should only contain characters a-z?
I tried to use your regex and validate this email address, and it passes, though I expected it would fail. Not sure if this is filter problem or regex implementation problem (tried on Windows Phone 7).
Ahh...I didn't notice the characters where cyrillic.
RFC 6530 says unicode characters are allowed now.
http://tools.ietf.org/html/rfc6530
Domain names does not start with a hyphen so you regex may not invalidate that ?
Also \w contains _ so you don't need to mention underscore again. Also this allows domain names to contain _ in your logic. ( if that is what you wanted )
Great Post. But I think there's one mistake in the validation of the top level domain portion of the email. As written, it limits the TLD to 1-3 characters. But, there are TLD's that are more than 3 characters (museum and info just to name two, for a full list see http://data.iana.org/TLD/tlds-alpha-by-domain.txt). As of now, I don't think there are any 1 character TLD's, but I'm not sure if this is limited by specification, or just custom. A safer test might be {2,}.
Sorry, let me revise that slightly: it's currently {2,4}, not {1,3} as I stated, but the overall comment still stands. I'd still recommend {2,} for the TLD check.
You can make your regular expression even shorter by using zero-width negative assertions to ensure that the email address does not begin with period, and that the local part does not end with a period:
^(?!\.)[\.\w!#$%&'*+\-/=?\^_`{|}~]{1,64}(?<!\.)@((([\-\w]+\.)+[a-zA-Z]{2,4})|(([0-9]{1,3}\.){3}[0-9]{1,3}))$
Note that this will also allow two or more consecutive periods in the local part, which your regular expression does not. I'm not sure if consecutive periods in the local part are valid, and I'm too lazy to look it up right now 😉
Consecutive periods are not allowed. If starting with a period and ending with a period in the localpart are bad, I should definitely add these to the test.
Oops...they are already there.
Thanks for posting this. I was using the one published by MSDN: (http://msdn.microsoft.com/en-us/library/01escwtf.aspx) which would not allow a single-character subdomain such as foo@a.com. Yours appears to work correctly.
Glad this worked for you. I tried to write the best regular expression possible.