获取URL的一部分（正则表达式）

132

给定URL（单行）：http :
//test.example.com/dir/subdir/file.html

如何使用正则表达式提取以下部分：

子域（测试）
域（example.com）
没有文件的路径（/ dir / subdir /）
文件（file.html）
文件的路径（/dir/subdir/file.html）
不含路径的网址（http://test.example.com）
（添加您认为有用的其他任何内容）

即使输入以下URL，该正则表达式也可以正常工作：

http://example.example.com/example/example/example.html

regex language-agnostic url

— 佩克
source

这不是直接的答案，但是大多数Web库都有完成此任务的功能。该函数通常称为与CrackUrl。如果存在这样的功能，请使用它，几乎可以保证它比任何手工编写的代码都更加可靠和高效。

— Konrad Rudolph

7

请向我们解释为什么需要使用正则表达式来完成。如果是家庭作业，请说那是因为这是您的限制。否则，有比使用正则表达式更好的特定于语言的解决方案。

— 安迪·莱斯特

1

指向第一个和最后一个样本的链接已断开。

— 锡人

：在这里你可以找到如何提取方案，域名，TLD，端口和查询路径stackoverflow.com/questions/9760588/...

— 保罗·罗韦利

151

单个正则表达式可解析和分解包括查询参数和锚点的完整URL，例如

https://www.google.com/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash

^((http[s]?|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$

RexEx职位：

网址：RegExp ['$＆']，

协议：RegExp。$ 2，

主机：RegExp。$ 3，

路径：RegExp。$ 4，

文件：RegExp。$ 6，

查询：RegExp。$ 7，

hash：RegExp。$ 8

然后，您可以轻松地进一步解析主机（以“。”分隔）。

什么我会做的是使用这样的：

/*
    ^(.*:)//([A-Za-z0-9\-\.]+)(:[0-9]+)?(.*)$
*/
proto $1
host $2
port $3
the-rest $4

进一步分析“其余”尽可能具体。在一个正则表达式中这样做有点疯狂。

— homeastast
source

4

链接代码nippets.joyent.com/posts/show/523从10年10

— W3Max

19

问题在于这部分：(.*)?由于Kleene星已经接受0或更多，所以该?部分（0或1）使它感到困惑。我通过更改(.*)?为修复了它(.+)?。您也可以只删除?

— rossipedia 2010年

3

嗨DVE，我已经改善它多了几分提取example.com从像URL http://www.example.com:8080/....这里所说：

^((http[s]?|ftp):\/\/)?\/?([^\/\.]+\.)*?([^\/\.]+\.[^:\/\s\.]{2,3}(\.[^:\/\s\.]{2,3})?(:\d+)?)($|\/)([^#?\s]+)?(.*?)?(#[\w\-]+)?$

— mnacos

4

并证明没有正则表达式是完美的，这是一个即时更正：

^((http[s]?|ftp):\/\/)?\/?([^\/\.]+\.)*?([^\/\.]+\.[^:\/\s\.]{2,3}(\.[^:\/\s\.]{2,3})?)(:\d+)?($|\/)([^#?\s]+)?(.*?)?(#[\w\-]+)?$

— mnacos 2012年

2

我修改了此正则表达式以识别URL的所有部分（改进版本） -Python中的

^((?P<scheme>[^:/?#]+):(?=//))?(//)?(((?P<login>[^:]+)(?::(?P<password>[^@]+)?)?@)?(?P<host>[^@/?#:]*)(?::(?P<port>\d+)?)?)?(?P<path>[^?#]*)(\?(?P<query>[^#]*))?(#(?P<fragment>.*))?

code 代码您在pythex.org

— arannasousa

81

我知道我参加聚会很晚，但是有一种简单的方法可以让浏览器在不使用正则表达式的情况下为您解析网址：

var a = document.createElement('a');
a.href = 'http://www.example.com:123/foo/bar.html?fox=trot#foo';

['href','protocol','host','hostname','port','pathname','search','hash'].forEach(function(k) {
    console.log(k+':', a[k]);
});

/*//Output:
href: http://www.example.com:123/foo/bar.html?fox=trot#foo
protocol: http:
host: www.example.com:123
hostname: www.example.com
port: 123
pathname: /foo/bar.html
search: ?fox=trot
hash: #foo
*/

— 抢
source

9

假设原始问题被标记为“不可知语言”，这是什么语言？

— MarkHu 2014年

请注意，此解决方案需要存在协议前缀，例如http://，才能正确显示协议，主机和主机名属性。否则，直到第一个斜杠的url开头都进入protocol属性。

— Oleksii Aza 2014年

我相信这虽然简单，但比RegEx解析要慢得多。

— demisx

所有浏览器都支持吗？

— 肖恩（Sean）

1

如果我们采用这种方式，您也可以这样做var url = new URL(someUrl)

— gman 2015年

67

我参加聚会已经晚了几年，但是令我惊讶的是，没有人提到统一资源标识符规范中有一节介绍了使用正则表达式解析URI的内容。Berners-Lee等人编写的正则表达式为：

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
 12            3  4          5       6  7        8 9
上面第二行中的数字仅用于增强可读性。它们指示每个子表达式（即，每个成对的括号）的参考点。我们将匹配子表达式的值称为$。例如，将上面的表达式匹配到

http://www.ics.uci.edu/pub/ietf/uri/#Related

导致以下子表达式匹配：
$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <undefined>
$7 = <undefined>
$8 = #Related
$9 = Related

对于它的价值，我发现我必须避免使用JavaScript中的正斜杠：

^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

— w
source

4

好答案！从RFC中选择某些内容肯定不会出错，这绝对不会出错

— 弗兰克斯特

1

这并不解析查询参数

— 雷米DAVID

2

这是最好的一面。具体来说，这解决了我在其他问题上遇到的两个问题1：：这可以正确处理其他协议，例如ftp://和mailto://。2：与username和正确处理password。这些可选字段由冒号分隔，就像主机名和端口一样，它会触发我见过的大多数其他正则表达式。@RémyDAVID浏览器location对象也不会正常解析查询字符串。如果您需要解析查询字符串，请查看我的微型库：uqs。

— Stijn de Witt

2

这个答案值得更多的投票，因为它涵盖了几乎所有协议。

— 天镇林

1

当使用用户名/密码的HTTP隐含该协议时，它会中断（我承认这是一种深奥且技术上无效的语法）：例如user:pass@example.com-RFC 3986说：

A path segment that contains a colon character (e.g., "this:that")    cannot be used as the first segment of a relative-path reference, as    it would be mistaken for a scheme name.  Such a segment must be    preceded by a dot-segment (e.g., "./this:that") to make a relative-    path reference.

— Matt Chambers

33

我发现投票率最高的答案（hometoast的答案）对我而言并不完美。两个问题：

它无法处理端口号。
哈希部分损坏。

以下是修改后的版本：

^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$

零件位置如下：

int SCHEMA = 2, DOMAIN = 3, PORT = 5, PATH = 6, FILE = 8, QUERYSTRING = 9, HASH = 12

编辑由匿名用户发布：

function getFileName(path) {
    return path.match(/^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/[\w\/-]+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$/i)[8];
}

— 明辉
source

1

请注意，如果URL在域后没有路径，http://www.example.com则该路径将不起作用-例如，或者该路径是单个字符，例如http://www.example.com/a。

— Fernando Correia

11

我需要一个正则表达式来匹配所有网址，并做到了这一点：

/(?:([^\:]*)\:\/\/)?(?:([^\:\@]*)(?:\:([^\@]*))?\@)?(?:([^\/\:]*)\.(?=[^\.\/\:]*\.[^\.\/\:]*))?([^\.\/\:]*)(?:\.([^\/\.\:]*))?(?:\:([0-9]*))?(\/[^\?#]*(?=.*?\/)\/)?([^\?#]*)?(?:\?([^#]*))?(?:#(.*))?/

它匹配所有URL，任何协议，甚至像

ftp://user:pass@www.cs.server.com:8080/dir1/dir2/file.php?param1=value1#hashtag

结果（在JavaScript中）如下所示：

["ftp", "user", "pass", "www.cs", "server", "com", "8080", "/dir1/dir2/", "file.php", "param1=value1", "hashtag"]

像这样的网址

mailto://admin@www.cs.server.com

看起来像这样：

["mailto", "admin", undefined, "www.cs", "server", "com", undefined, undefined, undefined, undefined, undefined]

— baadf00d
source

3

如果要匹配整个域/ ip地址（不用点分隔），请使用以下选项：

/(?:([^\:]*)\:\/\/)?(?:([^\:\@]*)(?:\:([^\@]*))?\@)?(?:([^\/\:]*))?(?:\:([0-9]*))?\/(\/[^\?#]*(?=.*?\/)\/)?([^\?#]*)?(?:\?([^#]*))?(?:#(.*))?/

— lepe 2016年

11

我正在尝试用JavaScript解决此问题，该问题应通过以下方式处理：

var url = new URL('http://a:b@example.com:890/path/wah@t/foo.js?foo=bar&bingobang=&king=kong@kong.com#foobar/bing/bo@ng?bang');

因为（至少在Chrome中）它解析为：

{
  "hash": "#foobar/bing/bo@ng?bang",
  "search": "?foo=bar&bingobang=&king=kong@kong.com",
  "pathname": "/path/wah@t/foo.js",
  "port": "890",
  "hostname": "example.com",
  "host": "example.com:890",
  "password": "b",
  "username": "a",
  "protocol": "http:",
  "origin": "http://example.com:890",
  "href": "http://a:b@example.com:890/path/wah@t/foo.js?foo=bar&bingobang=&king=kong@kong.com#foobar/bing/bo@ng?bang"
}

但是，这不是跨浏览器（https://developer.mozilla.org/en-US/docs/Web/API/URL），因此我将它们拼凑在一起以提取与上述相同的部分：

^(?:(?:(([^:\/#\?]+:)?(?:(?:\/\/)(?:(?:(?:([^:@\/#\?]+)(?:\:([^:@\/#\?]*))?)@)?(([^:\/#\?\]\[]+|\[[^\/\]@#?]+\])(?:\:([0-9]+))?))?)?)?((?:\/?(?:[^\/\?#]+\/+)*)(?:[^\?#]*)))?(\?[^#]+)?)(#.*)?

此正则表达式的功劳归https://gist.github.com/rpflorence谁发布了这个jsperf http://jsperf.com/url-parsing（最初在这里找到：https：//gist.github.com/jlong/2428561 ＃comment-310066）谁提出了这个正则表达式的正则表达式。

零件按此顺序排列：

var keys = [
    "href",                    // http://user:pass@host.com:81/directory/file.ext?query=1#anchor
    "origin",                  // http://user:pass@host.com:81
    "protocol",                // http:
    "username",                // user
    "password",                // pass
    "host",                    // host.com:81
    "hostname",                // host.com
    "port",                    // 81
    "pathname",                // /directory/file.ext
    "search",                  // ?query=1
    "hash"                     // #anchor
];

还有一个小的库可以包装它并提供查询参数：

https://github.com/sadams/lite-url（也可以在凉亭上找到）

如果您有改进，请创建一个包含更多测试的请求请求，我将接受并合并表示感谢。

— 山姆·亚当斯
source

很棒，但实际上可以使用这样的版本来提取子域，而不是复制的主机主机名。因此，如果我http://test1.dev.mydomain.com/举个例子，它将退出test1.dev.。

— Lankymart 2014年

这很好。我一直在寻找一种从URL中提取异常身份验证参数的方法，这种方法很漂亮。

— 亚伦M

6

提出一个更具可读性的解决方案（在Python中，但适用于任何正则表达式）：

def url_path_to_dict(path):
    pattern = (r'^'
               r'((?P<schema>.+?)://)?'
               r'((?P<user>.+?)(:(?P<password>.*?))?@)?'
               r'(?P<host>.*?)'
               r'(:(?P<port>\d+?))?'
               r'(?P<path>/.*?)?'
               r'(?P<query>[?].*?)?'
               r'$'
               )
    regex = re.compile(pattern)
    m = regex.match(path)
    d = m.groupdict() if m is not None else None

    return d

def main():
    print url_path_to_dict('http://example.example.com/example/example/example.html')

印刷品：

{
'host': 'example.example.com', 
'user': None, 
'path': '/example/example/example.html', 
'query': None, 
'password': None, 
'port': None, 
'schema': 'http'
}

— 冲绳
source

5

子域和域很困难，因为子域可以包含多个部分，顶级域http://sub1.sub2.domain.co.uk/

 the path without the file : http://[^/]+/((?:[^/]+/)*(?:[^/]+$)?)  
 the file : http://[^/]+/(?:[^/]+/)*((?:[^/.]+\.)+[^/.]+)$  
 the path with the file : http://[^/]+/(.*)  
 the URL without the path : (http://[^/]+/)

（Markdown对正则表达式不是很友好）

— tgmdbm
source

2

非常有用-我添加了一个额外内容(http(s?)://[^/]+/)也可以抓取https

— Mojowen 2013年

5

此改进的版本应与解析器一样可靠地工作。

   // Applies to URI, not just URL or URN:
   //    http://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Relationship_to_URL_and_URN
   //
   // http://labs.apache.org/webarch/uri/rfc/rfc3986.html#regexp
   //
   // (?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?
   //
   // http://en.wikipedia.org/wiki/URI_scheme#Generic_syntax
   //
   // $@ matches the entire uri
   // $1 matches scheme (ftp, http, mailto, mshelp, ymsgr, etc)
   // $2 matches authority (host, user:pwd@host, etc)
   // $3 matches path
   // $4 matches query (http GET REST api, etc)
   // $5 matches fragment (html anchor, etc)
   //
   // Match specific schemes, non-optional authority, disallow white-space so can delimit in text, and allow 'www.' w/o scheme
   // Note the schemes must match ^[^\s|:/?#]+(?:\|[^\s|:/?#]+)*$
   //
   // (?:()(www\.[^\s/?#]+\.[^\s/?#]+)|(schemes)://([^\s/?#]*))([^\s?#]*)(?:\?([^\s#]*))?(#(\S*))?
   //
   // Validate the authority with an orthogonal RegExp, so the RegExp above won’t fail to match any valid urls.
   function uriRegExp( flags, schemes/* = null*/, noSubMatches/* = false*/ )
   {
      if( !schemes )
         schemes = '[^\\s:\/?#]+'
      else if( !RegExp( /^[^\s|:\/?#]+(?:\|[^\s|:\/?#]+)*$/ ).test( schemes ) )
         throw TypeError( 'expected URI schemes' )
      return noSubMatches ? new RegExp( '(?:www\\.[^\\s/?#]+\\.[^\\s/?#]+|' + schemes + '://[^\\s/?#]*)[^\\s?#]*(?:\\?[^\\s#]*)?(?:#\\S*)?', flags ) :
         new RegExp( '(?:()(www\\.[^\\s/?#]+\\.[^\\s/?#]+)|(' + schemes + ')://([^\\s/?#]*))([^\\s?#]*)(?:\\?([^\\s#]*))?(?:#(\\S*))?', flags )
   }

   // http://en.wikipedia.org/wiki/URI_scheme#Official_IANA-registered_schemes
   function uriSchemesRegExp()
   {
      return 'about|callto|ftp|gtalk|http|https|irc|ircs|javascript|mailto|mshelp|sftp|ssh|steam|tel|view-source|ymsgr'
   }

— 谢尔比·摩尔
source

5

尝试以下方法：

^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+|/?)(\w+\.[\w]{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?

它支持HTTP / FTP，子域，文件夹，文件等。

我通过快速的Google搜索找到了它：

http://geekswithblogs.net/casualjim/archive/2005/12/01/61722.aspx

— 马克·英格拉姆
source

4

/^((?P<scheme>https?|ftp):\/)?\/?((?P<username>.*?)(:(?P<password>.*?)|)@)?(?P<hostname>[^:\/\s]+)(?P<port>:([^\/]*))?(?P<path>(\/\w+)*\/)(?P<filename>[-\w.]+[^#?\s]*)?(?P<query>\?([^#]*))?(?P<fragment>#(.*))?$/

从我对类似问题的回答。由于存在一些错误（例如，不支持用户名/密码，不支持单字符文件名，片段标识符被破坏），因此它们比其他提到的工具工作得更好。

— 斯特拉格
source

2

您可以使用.NET中的Uri对象获取所有http / https，主机，端口，路径以及查询。唯一困难的任务是将主机分为子域，域名和TLD。

没有标准这样做，不能简单地使用字符串解析或RegEx来产生正确的结果。首先，我使用RegEx函数，但并非所有URL都能正确解析该子域。练习方法是使用TLD列表。在定义了URL的TLD之后，左侧部分是域，其余部分是子域。

但是，该列表需要维护，因为可以使用新的TLD。我知道的当前时间是publicsuffix.org，它维护着最新列表，您可以使用Google代码中的域名解析器工具来解析公共后缀列表，并通过使用DomainName对象轻松地获取子域，域和TLD：domainName.SubDomain，domainName .Domain和domainName.TLD。

这个答案也有帮助：从URL获取子域

卡伦

— CallMeLaNN
source

2

这是完整的，并且不依赖任何协议。

function getServerURL(url) {
        var m = url.match("(^(?:(?:.*?)?//)?[^/?#;]*)");
        console.log(m[1]) // Remove this
        return m[1];
    }

getServerURL("http://dev.test.se")
getServerURL("http://dev.test.se/")
getServerURL("//ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js")
getServerURL("//")
getServerURL("www.dev.test.se/sdas/dsads")
getServerURL("www.dev.test.se/")
getServerURL("www.dev.test.se?abc=32")
getServerURL("www.dev.test.se#abc")
getServerURL("//dev.test.se?sads")
getServerURL("http://www.dev.test.se#321")
getServerURL("http://localhost:8080/sads")
getServerURL("https://localhost:8080?sdsa")

版画

http://dev.test.se

http://dev.test.se

//ajax.googleapis.com

//

www.dev.test.se

www.dev.test.se

www.dev.test.se

www.dev.test.se

//dev.test.se

http://www.dev.test.se

http://localhost:8080

https://localhost:8080

— 嗯
source

2

以上都不对我有用。这是我最终使用的内容：

/^(?:((?:https?|s?ftp):)\/\/)([^:\/\s]+)(?::(\d*))?(?:\/([^\s?#]+)?([?][^?#]*)?(#.*)?)?/

— 骷髅
source

2

我喜欢在“ Javascript：The Good Parts”中发布的正则表达式。它不太短，也不太复杂。github上的此页面还具有使用它的JavaScript代码。但是它可以适应任何语言。 https://gist.github.com/voodooGQ/4057330

— Yetti99
source

1

Java提供了可以执行此操作的URL类。查询URL对象。

附带说明，PHP提供了parse_url（）。

— 克里斯·巴托
source

看起来这虽然没有解析出子域？

— 克里斯·杜特罗

问问问正则表达式。URL类在创建时会打开一个连接。

— MikeNereson 2011年

“ URL类将在您创建连接时打开一个连接”-这是不正确的，仅当您调用诸如connect（）之类的方法时。但是，确实java.net.URL有点沉重。对于此用例，java.net.URI更好。

— jcsahnwaldt恢复莫妮卡2012年

1

我建议不要使用正则表达式。像WinHttpCrackUrl（）这样的API调用不太容易出错。

http://msdn.microsoft.com/zh-CN/library/aa384092%28VS.85%29.aspx

— 杰森
source

5

而且也非常特定于平台。

— 安迪尔

2

我认为重点是使用图书馆，而不是重新发明轮子。Ruby，Python和Perl拥有拆解URL的工具，因此抢夺URL而不是实施不良模式。

— Tin Man

1

我尝试了其中一些无法满足我需求的方法，尤其是投票最高但没有路径却无法捕获网址的网站（http://example.com/）

缺少组名也使它无法使用（或者可能缺少我的jinja2技能）。

所以这是我的版本，稍作修改，来源是此处投票最高的版本：

^((?P<protocol>http[s]?|ftp):\/)?\/?(?P<host>[^:\/\s]+)(?P<path>((\/\w+)*\/)([\w\-\.]+[^#?\s]+))*(.*)?(#[\w\-]+)?$

— 吉尔·泽尔纳
source

0

使用http://www.fileformat.info/tool/regex.htm的正则表达式效果很好。

但是这很重要，我想在程序中的不同情况下使用不同的正则表达式模式。

例如，我有这个URL，并且我有一个枚举列出了程序中所有受支持的URL。枚举中的每个对象都有一个getRegexPattern方法，该方法返回正则表达式模式，然后将其用于与URL比较。如果特定的正则表达式模式返回true，那么我知道我的程序支持此URL。因此，每个枚举都有自己的正则表达式，具体取决于它在URL中的位置。

Hometoast的建议很好，但是就我而言，我认为这无济于事（除非我在所有枚举中复制粘贴相同的正则表达式）。

这就是为什么我想要答案分别针对每种情况给出正则表达式的原因。虽然为家庭敬酒+1。;）

— 佩克
source

0

我知道您在声称与语言无关，但是可以告诉我们您在使用什么，以便我们知道您拥有哪些正则表达式功能吗？

如果您具有非捕获匹配的功能，则可以修改hometoast的表达式，以便对捕获不感兴趣的子表达式进行如下设置：

(?:SOMESTUFF)

您仍然必须将Regex复制并粘贴（并稍加修改）到多个位置，但这很有意义-您不仅要检查子表达式是否存在，还要检查它是否作为URL的一部分存在。。对子表达式使用非捕获修饰符可以为您提供所需的内容，仅此而已，如果我正确地阅读了您的内容，这就是您想要的。

就像一个很小的音符一样，hometoast的表达不需要在“ s”的括号中加上“ https”，因为他在那里只有一个字符。量词量化直接在它们前面的一个字符（或字符类或子表达式）。所以：

https?

可以匹配“ http”或“ https”。

— 布莱恩·沃肖
source

0

regexp以获取不带文件的URL路径。

url =' http：// domain / dir1 / dir2 / somefile'url.scan（/ ^（http：// [^ /] +）（（？：/ [^ /] +）+（？= /））？/？（？：[^ /] +）？$ / i）.to_s

将相对路径添加到此url很有用。

0

用于完全解析的正则表达式非常可怕。为了清晰起见，我包括了命名的反向引用，并将每个部分分成单独的行，但看起来仍然像这样：

^(?:(?P<protocol>\w+(?=:\/\/))(?::\/\/))?
(?:(?P<host>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^\/?#:]+)(?::(?P<port>[0-9]+))?)\/)?
(?:(?P<path>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)\/)?
(?P<file>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)
(?:\?(?P<querystring>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^#])+))?
(?:#(?P<fragment>.*))?$

要求它如此冗长的是，除了协议或端口外，任何部分都可以包含HTML实体，这使得对片段的描述非常棘手。因此，在最后几种情况下-主机，路径，文件，查询字符串和片段，我们允许使用任何html实体或不是a ?或的任何字符#。html实体的正则表达式如下所示：

$htmlentity = "&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);"

提取出来（我用小胡子语法表示）后，它就会变得更加清晰：

^(?:(?P<protocol>(?:ht|f)tps?|\w+(?=:\/\/))(?::\/\/))?
(?:(?P<host>(?:{{htmlentity}}|[^\/?#:])+(?::(?P<port>[0-9]+))?)\/)?
(?:(?P<path>(?:{{htmlentity}}|[^?#])+)\/)?
(?P<file>(?:{{htmlentity}}|[^?#])+)
(?:\?(?P<querystring>(?:{{htmlentity}};|[^#])+))?
(?:#(?P<fragment>.*))?$

当然，在JavaScript中，您不能使用命名的反向引用，因此正则表达式变为

^(?:(\w+(?=:\/\/))(?::\/\/))?(?:((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^\/?#:]+)(?::([0-9]+))?)\/)?(?:((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)\/)?((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)(?:\?((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^#])+))?(?:#(.*))?$

在每次匹配中，协议为\1，主机为\2，端口为\3，路径\4，文件\5，querystring \6和fragment \7。

— 史蒂夫·K
source

0

//USING REGEX
/**
 * Parse URL to get information
 *
 * @param   url     the URL string to parse
 * @return  parsed  the URL parsed or null
 */
var UrlParser = function (url) {
    "use strict";

    var regx = /^(((([^:\/#\?]+:)?(?:(\/\/)((?:(([^:@\/#\?]+)(?:\:([^:@\/#\?]+))?)@)?(([^:\/#\?\]\[]+|\[[^\/\]@#?]+\])(?:\:([0-9]+))?))?)?)?((\/?(?:[^\/\?#]+\/+)*)([^\?#]*)))?(\?[^#]+)?)(#.*)?/,
        matches = regx.exec(url),
        parser = null;

    if (null !== matches) {
        parser = {
            href              : matches[0],
            withoutHash       : matches[1],
            url               : matches[2],
            origin            : matches[3],
            protocol          : matches[4],
            protocolseparator : matches[5],
            credhost          : matches[6],
            cred              : matches[7],
            user              : matches[8],
            pass              : matches[9],
            host              : matches[10],
            hostname          : matches[11],
            port              : matches[12],
            pathname          : matches[13],
            segment1          : matches[14],
            segment2          : matches[15],
            search            : matches[16],
            hash              : matches[17]
        };
    }

    return parser;
};

var parsedURL=UrlParser(url);
console.log(parsedURL);

— 莫汉姆
source

0

我尝试使用此正则表达式解析网址分区：

^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/?(?:[^\/\?#]+\/+)*)([^\?#]*))(\?([^#]*))?(#(.*))?$

网址： https://www.google.com/my/path/sample/asd-dsa/this?key1=value1&key2=value2

火柴：

Group 1.    0-7 https:/
Group 2.    0-5 https
Group 3.    8-22    www.google.com
Group 6.    22-50   /my/path/sample/asd-dsa/this
Group 7.    22-46   /my/path/sample/asd-dsa/
Group 8.    46-50   this
Group 9.    50-74   ?key1=value1&key2=value2
Group 10.   51-74   key1=value1&key2=value2

— 比拉勒·德米尔（Bilal Demir）
source

-1

String s = "https://www.thomas-bayer.com/axis2/services/BLZService?wsdl";

String regex = "(^http.?://)(.*?)([/\\?]{1,})(.*)";

System.out.println("1: " + s.replaceAll(regex, "$1"));
System.out.println("2: " + s.replaceAll(regex, "$2"));
System.out.println("3: " + s.replaceAll(regex, "$3"));
System.out.println("4: " + s.replaceAll(regex, "$4"));

将提供以下输出：
1：https：//
2：www.thomas-bayer.com
3：/
4：axis2 / services / BLZService？wsdl

如果将URL更改为
String s =“ https：//www.thomas -bayer.com?wsdl=qwerwer&ttt=888 “; 输出结果如下：
1：https：//
2：www.thomas-bayer.com
3 ：？
4：wsdl = qwerwer＆ttt = 888

享受..
Yosi Lev

— 伊列夫
source

不处理端口。语言不可知。

— Ohgodwhy