Regexp to parse URL
This is a great solution from very popular Python urllib3
_URI_RE = re.compile(
r"^(?:([a-zA-Z][a-zA-Z0-9+.-]*):)?"
r"(?://([^\\/?#]*))?"
r"([^?#]*)"
r"(?:\?([^#]*))?"
r"(?:#(.*))?$",
re.UNICODE | re.DOTALL,
)
To determine the hostname we need to match an authority against other regexp:
_SUBAUTHORITY_PAT = ("^(?:(.*)@)?(%s|%s|%s)(?::([0-9]{0,5}))?$") % (
_REG_NAME_PAT,
_IPV4_PAT,
_IPV6_ADDRZ_PAT,
)
The regexp-es to find DNS hostname, IPv4 or IPv5 are here:
_REG_NAME_PAT = r"(?:[^\[\]%:/?#]|%[a-fA-F0-9]{2})*"
_IPV4_PAT = r"(?:[0-9]{1,3}\.){3}[0-9]{1,3}"
_IPV4_RE = re.compile("^" + _IPV4_PAT + "$")
_IPV6_ADDRZ_PAT = r"\[" + _IPV6_PAT + r"(?:" + _ZONE_ID_PAT + r")?\]"
_IPV6_ADDRZ_RE = re.compile("^" + _IPV6_ADDRZ_PAT + "$")