I would need one or more regular expressions to match some invalid urls of a website, that have uppercase letters before OR after a certain pattern.
These are the structure rules to match the invalid URLs:
- a defined website
- zero, or more uppercase letters if zero uppercase letters after the pattern
- a pattern
- zero, or more uppercase letters if zero uppercase letters before the pattern
To be explicit with examples:
http://website/uppeRcase/pattern/upperCase // match it, uppercase before and after pattern http://otherweb/WhatevercAse/pattern/whatevercase // do not match, no website http://website/lowercase/pattern/lowercase // do not match, no uppercase before or after pattern http://website/lowercase/pattern/uppercasE // match it, uppercase after pattern http://website/Uppercase/pattern/lowercase // match it, uppercase before pattern http://website/WhatevercAse/asdasd/whatEveRcase // do not match it, no pattern
Thanks in advance for your help!
I’d advise against doing the two things you are describing with a regular expression in one step. Use a url parsing library to extract the path and hostname components separately. You want to do this for a couple of reasons, There can be some surprising stuff in the host portion of the url that can throw you off, for instance, the hostname of
otherweb, and should be excluded, even though it begins with
should be excluded, even though the url has the pattern, surrounded by upper case path components, because the matching region is not part of the path.
is actually the same resource as your first example, but contains escapes that might prevent a regex from noticing it.
Once you’ve extracted and converted the escape sequences of just the path component, though, a regex is probably a great tool to use.
To match uppercase letters you simply need
A-Z. Then build around that the rest of your rules. Without knowing the exactly what you mean by “website” and “pattern” it is difficult to give better guidance.
This expression will match if uppercase characters are both between “website” and “pattern” as well as after “pattern”
This expression will bath on either uppercase-case
To @TokenMacGuy’s point, RegEx parsing of URLs can be very tricky. If you want to break into parts and then validate, you can start with this expression which should match and group most* URLs.
*it worked in all my tests, but I can’t claim I was exhaustive.