Python-テキストからURLを抽出する

URL抽出は、正規表現を使用してテキストファイルから実行されます。式は、パターンに一致するテキストをフェッチします。この目的にはreモジュールのみが使用されます。

例

いくつかのURLを含む入力ファイルを取得し、次のプログラムで処理してURLを抽出できます。 findall（）関数は、正規表現に一致するすべてのインスタンスを検索するために使用されます。

Inoutファイル

以下に入力ファイルを示します。 teo URLが含まれます。

Now a days you can learn almost anything by just visiting http://www.google.com. But if you are completely new to computers or internet then first you need to leanr those fundamentals. Next
you can visit a good e-learning site like - https://www.finddevguides.com to learn further on a variety of subjects.

ここで、上記の入力ファイルを取得して次のプログラムで処理すると、ファイルから抽出されたURLのみを提供する必要な出力が得られます。

import re

with open("path\url_example.txt") as file:
        for line in file:
            urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line)
            print(urls)

上記のプログラムを実行すると、次の出力が得られます-

['http://www.google.com.']
['https://www.finddevguides.com']

Python-text-processing-python-extract-url-from-text

Python-テキストからURLを抽出する

例

Inoutファイル