0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

Perl で文字列から URL (http/https) を抽出する正規表現

Posted at

サンプルコード

# 抽出テスト用URL
@urls = (
  'https://ja.wikipedia.org/wiki/A',
  'https://ja.wikipedia.org/wiki/%E6%84%9B',
  'https://ja.wikipedia.org/wiki/%E6%84%9B_(%E6%9B%96%E6%98%A7%E3%81%95%E5%9B%9E%E9%81%BF)',
  'https://ja.wikipedia.org/wiki/ABCDE?FGHIJ=KLMNO#PQRST',
  'https://ja.wikipedia.org/wiki/愛',
  'https://ja.wikipedia.org/wiki/愛_(曖昧さ回避)',
  'https://ja.wikipedia.org/wiki/あいうえお?かきくけこ=さしすせそ#たちつてと',
);

# 抽出テスト用URLを出力
print "*** Source URLs ***\n";
foreach my $url (@urls){
  print "Source URL=[$url]\n";
}
print "\n";

# 抽出テスト用URLをひとつの文字列につなげる
$text = join(' ', @urls);

# URLを抽出するための正規表現
@patterns = (
  'https?://[-_.!~*\'()a-zA-Z0-9;/?:@&=+$,%#]+', # 一般的なURL
  'https?://[^\s]+', # 日本語を含むURL
);

# 正規表現にマッチしたURLを出力する
foreach my $pattern (@patterns){
  print "*** Pattern: $pattern ***\n";
  while ($text =~ /$pattern/gp) {
    print "Matched URL=[${^MATCH}]\n";
  }
  print "\n";
}

実行結果

*** Source URLs ***
Source URL=[https://ja.wikipedia.org/wiki/A]
Source URL=[https://ja.wikipedia.org/wiki/%E6%84%9B]
Source URL=[https://ja.wikipedia.org/wiki/%E6%84%9B_(%E6%9B%96%E6%98%A7%E3%81%95%E5%9B%9E%E9%81%BF)]
Source URL=[https://ja.wikipedia.org/wiki/ABCDE?FGHIJ=KLMNO#PQRST]
Source URL=[https://ja.wikipedia.org/wiki/愛]
Source URL=[https://ja.wikipedia.org/wiki/愛_(曖昧さ回避)]
Source URL=[https://ja.wikipedia.org/wiki/あいうえお?かきくけこ=さしすせそ#たちつてと]

*** Pattern: https?://[-_.!~*'()a-zA-Z0-9;/?:@&=+$,%#]+ ***
Matched URL=[https://ja.wikipedia.org/wiki/A]
Matched URL=[https://ja.wikipedia.org/wiki/%E6%84%9B]
Matched URL=[https://ja.wikipedia.org/wiki/%E6%84%9B_(%E6%9B%96%E6%98%A7%E3%81%95%E5%9B%9E%E9%81%BF)]
Matched URL=[https://ja.wikipedia.org/wiki/ABCDE?FGHIJ=KLMNO#PQRST]
Matched URL=[https://ja.wikipedia.org/wiki/]
Matched URL=[https://ja.wikipedia.org/wiki/]
Matched URL=[https://ja.wikipedia.org/wiki/]

*** Pattern: https?://[^\s]+ ***
Matched URL=[https://ja.wikipedia.org/wiki/A]
Matched URL=[https://ja.wikipedia.org/wiki/%E6%84%9B]
Matched URL=[https://ja.wikipedia.org/wiki/%E6%84%9B_(%E6%9B%96%E6%98%A7%E3%81%95%E5%9B%9E%E9%81%BF)]
Matched URL=[https://ja.wikipedia.org/wiki/ABCDE?FGHIJ=KLMNO#PQRST]
Matched URL=[https://ja.wikipedia.org/wiki/愛]
Matched URL=[https://ja.wikipedia.org/wiki/愛_(曖昧さ回避)]
Matched URL=[https://ja.wikipedia.org/wiki/あいうえお?かきくけこ=さしすせそ#たちつてと]

「https?://[-_.!~*'()a-zA-Z0-9;/?:@&=+$,%#]+」では英数字やエンコードされた日本語のURLにマッチする。

「https?://[^\s]+」ではエンコードされていない日本語が含まれるURLにもマッチする。

参考資料

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?