如何使用Shell脚本从文件获取URL

10

我有一个包含URL的文件。我正在尝试使用Shell脚本从该文件获取URL。

在文件中，URL如下所示：

('URL', 'http://url.com');

我尝试使用以下内容：

cat file.php | grep 'URL' | awk '{ print $2 }'

它给出的输出为：

'http://url.com');

但是我只需要url.com在shell脚本中获取一个变量。我该怎么做？

bash scripts

— 塔伦
source

11

像这样吗

grep 'URL' file.php | rev | cut -d "'" -f 2 | rev

要么

grep 'URL' file.php | cut -d "'" -f 4 | sed s/'http:\/\/'/''/g

删除http：//。

— 弗兰提克
source

3

或：cat file.php | grep 'URL' | cut -d "'" -f 4。

— 埃里克·卡瓦略

我试图通过Frantique它给出的答案http://url.com不是url.com

— 塔伦

1

@Tarun是的，我只是想说没有必要对文本进行两次反向。

— 埃里克·卡瓦略

1

当您要与/ins 匹配时，通常应使用其他定界符，例如sed s@http://@@g。

— 凯文

2

但是，这效率非常低下，解决方案1在4个管道上调用5个进程，解决方案2在2个管道（包括2个正则表达式）上调用3个进程。这一切都可以在Bash shell中完成，而无需任何管道，进程或依赖项。

— AsymLabs 2014年

14

您可以使用简单的方法完成所有操作grep：

grep -oP "http://\K[^']+" file.php

来自man grep：

   -P, --perl-regexp
          Interpret  PATTERN  as  a  Perl  regular  expression  (PCRE, see
          below).  This is highly experimental and grep  -P  may  warn  of
          unimplemented features.
   -o, --only-matching
          Print  only  the  matched  (non-empty) parts of a matching line,
          with each such part on a separate output line.

诀窍是使用\KPerl regex中的手段discard everything matched to the left of the \K。因此，正则表达式将查找以http://（以开头，然后由于丢弃\K）后跟尽可能多的非'字符的字符串。与结合使用-o，这意味着将仅打印URL。

您也可以直接在Perl中执行此操作：

perl -ne "print if s/.*http:\/\/(.+)\'.*/\$1/" file.php\

— 特登
source

非常好的答案。向我+1。

— souravc

非常好的紧凑型解决方案。我也是最喜欢的

— AsymLabs 2014年

5

尝试这个，

awk -F// '{print $2}' file.php | cut -d "'" -f 1

— 苏拉夫
source

不，没有用。

— 塔伦2014年

问题是什么？您能告诉我是否正确echo "define('URL', 'http://url.com');" | awk -F// '{print $2}' | cut -d "'" -f 1

— 吗？souravc

问题在于，它url.com是一个不同的URL，例如abc.com，它是动态的，我需要使用shell脚本来获取此URL。

— 塔伦2014年

4

再次回顾此问题，并尝试仅使用Bash shell，另一种解决方案是：

while read url; do url="${url##*/}" && echo "${url%%\'*}"; done < file.in > file.out

其中file.in包含“脏” URL列表，而file.out将包含“干净” URL列表。没有外部依赖关系，也不需要产生任何新的进程或子shell。原始说明和更灵活的脚本如下。还有就是方法的一个很好的总结在这里，见例子10-10。这是Bash中基于模式的参数替换。

扩展想法：

src="define('URL', 'http://url.com');"
src="${src##*/}"        # remove the longest string before and including /
echo "${src%%\'*}"      # remove the longest string after and including '

结果：

url.com

无需调用任何外部程序。此外，以下bash脚本get_urls.sh允许您直接或从stdin中读取文件：

#!/usr/bin/env bash

# usage: 
#     ./get_urls.sh 'file.in'
#     grep 'URL' 'file.in' | ./get_urls.sh

# assumptions: 
#     there is not more than one url per line of text.
#     the url of interest is a simple one.

# begin get_urls.sh

# get_url 'string'
function get_url(){
  local src="$1"
  src="${src##*/}"        # remove the longest string before and including /
  echo "${src%%\'*}"      # remove the longest string after and including '
}

# read each line.
while read line
do
  echo "$(get_url "$line")"
done < "${1:-/proc/${$}/fd/0}"

# end get_urls.sh

— 非对称实验室
source

不错，+ 1。严格来说，有一个子外壳，而while循环发生在一个子外壳中。从好的方面来说，这几乎适用于除以外的任何外壳[t]csh，因此对sh，bash，dash，ksh，zsh

— 很有用

重击胜利！

— Andrea Corbellini 2014年

3

如果所有行都包含URL：

awk -F"'|http://" '{print $5}' file.php

如果仅某些行包含URL：

awk -F"'|http://" '/^define/ {print $5}' file.php

根据其他行，您可能需要更改^define正则表达式

— 弗洛里安·迪切
source

它只需要添加一个cut语句就可以了，我使用的命令是awk -F"'|http://" '/^define/ {print $5}' file.php | cut -d ")" -f 1

— Tarun 2014年

0

简单：

php -r 'include("file.php"); echo URL;'

并且如果您需要删除“ http：//”，则：

php -r 'include("file.php"); echo URL;' | sed 's!^http://\(.*\)!\1!'

所以：

myURL=$(php -r 'include("file.php"); echo URL;' | sed 's!^http://\(.*\)!\1!')

如果您需要URL 的特定部分，则需要改进术语，URL是以下所有内容，有时甚至更多：

URL := protocol://FQDN[/path][?arguments]

FQDN := [hostname.]domain.tld

— 萨米奇
source

0

对我来说，grep给出的其他答案在链接后返回字符串信息。

这对我来说只能拉出url：

egrep -o "(http(s)?://){1}[^'\"]+"

— 用户名
source