URLExtract class¶

class urlextract.URLExtract(extract_email=False, cache_dns=True, extract_localhost=True, limit=10000, allow_mixed_case_hostname=True, **kwargs)¶

Class for finding and extracting URLs from given string.

Examples:

from urlextract import URLExtract

extractor = URLExtract()
urls = extractor.find_urls("Let's have URL example.com example.")
print(urls) # prints: ['example.com']

# Another way is to get a generator over found URLs in text:
for url in extractor.gen_urls(example_text):
    print(url) # prints: ['example.com']

# Or if you want to just check if there is at least one URL in text:
if extractor.has_urls(example_text):
    print("Given text contains some URL")

add_enclosure(left_char: str, right_char: str)¶

Add new enclosure pair of characters. That and should be removed when their presence is detected at beginning and end of found URL

Parameters:	left_char (str) – left character of enclosure pair - e.g. “(” right_char (str) – right character of enclosure pair - e.g. “)”

allow_mixed_case_hostname¶

If set to True host should contain mixed case letters (upper-case and lower-case)

Return type:	bool

extract_email¶

If set to True email will be extracted from text

Return type:	bool

extract_localhost¶

If set to True ‘localhost’ will be extracted as URL from text

Return type:	bool

find_urls(text: str, only_unique=False, check_dns=False, get_indices=False, with_schema_only=False) → List[Union[str, Tuple[str, Tuple[int, int]]]]¶

Find all URLs in given text.

Parameters:	text (str) – text where we want to find URLs only_unique (bool) – return only unique URLs check_dns (bool) – filter results to valid domains get_indices (bool) – whether to return beginning and ending indices as (<url>, (idx_begin, idx_end)) with_schema_only (bool) – get domains with schema only (e.g. https://janlipovsky.cz but not example.com)
Returns:	list of URLs found in text
Return type:	list
Raises:	URLExtractError – Raised when count of found URLs reaches given limit. Processed URLs are returned in data argument.

gen_urls(text: str, check_dns=False, get_indices=False, with_schema_only=False) → Generator[Union[str, Tuple[str, Tuple[int, int]]], None, None]¶

Creates generator over found URLs in given text.

Parameters:	text (str) – text where we want to find URLs check_dns (bool) – filter results to valid domains get_indices (bool) – whether to return beginning and ending indices as (<url>, (idx_begin, idx_end)) with_schema_only (bool) – get domains with schema only
Yields:	URL or URL with indices found in text or empty string if nothing was found
Return type:	str\|tuple(str, tuple(int, int))

get_after_tld_chars() → List[str]¶

Returns list of chars that are allowed after TLD

Returns:	list of chars that are allowed after TLD
Return type:	list

get_enclosures() → Set[Tuple[str, str]]¶

Returns set of enclosure pairs that might be used to enclosure URL. For example brackets (example.com), [example.com], {example.com}

Returns:	set of tuple of enclosure characters
Return type:	set(tuple(str,str))

get_stop_chars_left() → Set[str]¶

Returns set of stop chars for text on left from TLD.

Returns:	set of stop chars
Return type:	set

get_stop_chars_left_from_scheme() → Set[str]¶

Returns set of stop chars for text on left from scheme.

Returns:	set of stop chars
Return type:	set

get_stop_chars_right() → Set[str]¶

Returns set of stop chars for text on right from TLD.

Returns:	set of stop chars
Return type:	set

static get_version() → str¶

Returns version number.

Returns:	version number
Return type:	str

has_urls(text: str, check_dns=False, with_schema_only=False) → bool¶

Checks if text contains any valid URL. Returns True if text contains at least one URL.

Parameters:	text – text where we want to find URLs check_dns (bool) – filter results to valid domains with_schema_only (bool) – consider domains with schema only
Returns:	True if et least one URL was found, False otherwise
Return type:	bool

ignore_list¶

Set of URLs to be ignored (not returned) while extracting from text

Returns:	Returns set of ignored URLs
Return type:	set(str)

load_ignore_list(file_name)¶

Load URLs from file into ignore list

Parameters:	file_name (str) – path to file containing URLs

load_permit_list(file_name)¶

Load URLs from file into permit list

Parameters:	file_name (str) – path to file containing URLs

permit_list¶

Set of URLs that can be processed

Returns:	Returns set of URLs that can be processed
Return type:	set(str)

remove_enclosure(left_char: str, right_char: str)¶

Remove enclosure pair from set of enclosures.

Parameters:	left_char (str) – left character of enclosure pair - e.g. “(” right_char (str) – right character of enclosure pair - e.g. “)”

set_after_tld_chars(after_tld_chars: Iterable[str])¶

Set chars that are allowed after TLD.

Parameters:	after_tld_chars (list) – list of characters

set_stop_chars_left(stop_chars: Set[str])¶

Set stop characters for text on left from TLD. Stop characters are used when determining end of URL.

Parameters:	stop_chars (set) – set of characters
Raises:	TypeError

set_stop_chars_left_from_scheme(stop_chars: Set[str])¶

Set stop characters for text on left from scheme. Stop characters are used when determining end of URL.

Parameters:	stop_chars (set) – set of characters
Raises:	TypeError

set_stop_chars_right(stop_chars: Set[str])¶

Set stop characters for text on right from TLD. Stop characters are used when determining end of URL.

Parameters:	stop_chars (set) – set of characters
Raises:	TypeError

update()¶

Update TLD list cache file.

Returns:	True if update was successful False otherwise
Return type:	bool

update_when_older(days: int) → bool¶

Update TLD list cache file if the list is older than number of days given in parameter days or if it does not exist.

Parameters:	days (int) – number of days from last change
Returns:	True if update was successful, False otherwise
Return type:	bool