URLExtract class¶
-
class
urlextract.
URLExtract
(extract_email=False, cache_dns=True, extract_localhost=True, limit=10000, allow_mixed_case_hostname=True, **kwargs)¶ Class for finding and extracting URLs from given string.
Examples:
from urlextract import URLExtract extractor = URLExtract() urls = extractor.find_urls("Let's have URL example.com example.") print(urls) # prints: ['example.com'] # Another way is to get a generator over found URLs in text: for url in extractor.gen_urls(example_text): print(url) # prints: ['example.com'] # Or if you want to just check if there is at least one URL in text: if extractor.has_urls(example_text): print("Given text contains some URL")
-
add_enclosure
(left_char: str, right_char: str)¶ Add new enclosure pair of characters. That and should be removed when their presence is detected at beginning and end of found URL
Parameters: - left_char (str) – left character of enclosure pair - e.g. “(”
- right_char (str) – right character of enclosure pair - e.g. “)”
-
allow_mixed_case_hostname
¶ If set to True host should contain mixed case letters (upper-case and lower-case)
Return type: bool
-
extract_email
¶ If set to True email will be extracted from text
Return type: bool
-
extract_localhost
¶ If set to True ‘localhost’ will be extracted as URL from text
Return type: bool
-
find_urls
(text: str, only_unique=False, check_dns=False, get_indices=False, with_schema_only=False) → List[Union[str, Tuple[str, Tuple[int, int]]]]¶ Find all URLs in given text.
Parameters: - text (str) – text where we want to find URLs
- only_unique (bool) – return only unique URLs
- check_dns (bool) – filter results to valid domains
- get_indices (bool) – whether to return beginning and ending indices as (<url>, (idx_begin, idx_end))
- with_schema_only (bool) – get domains with schema only (e.g. https://janlipovsky.cz but not example.com)
Returns: list of URLs found in text
Return type: list
Raises: URLExtractError – Raised when count of found URLs reaches given limit. Processed URLs are returned in data argument.
-
gen_urls
(text: str, check_dns=False, get_indices=False, with_schema_only=False) → Generator[Union[str, Tuple[str, Tuple[int, int]]], None, None]¶ Creates generator over found URLs in given text.
Parameters: - text (str) – text where we want to find URLs
- check_dns (bool) – filter results to valid domains
- get_indices (bool) – whether to return beginning and ending indices as (<url>, (idx_begin, idx_end))
- with_schema_only (bool) – get domains with schema only
Yields: URL or URL with indices found in text or empty string if nothing was found
Return type: str|tuple(str, tuple(int, int))
-
get_after_tld_chars
() → List[str]¶ Returns list of chars that are allowed after TLD
Returns: list of chars that are allowed after TLD Return type: list
-
get_enclosures
() → Set[Tuple[str, str]]¶ Returns set of enclosure pairs that might be used to enclosure URL. For example brackets (example.com), [example.com], {example.com}
Returns: set of tuple of enclosure characters Return type: set(tuple(str,str))
-
get_stop_chars_left
() → Set[str]¶ Returns set of stop chars for text on left from TLD.
Returns: set of stop chars Return type: set
-
get_stop_chars_left_from_scheme
() → Set[str]¶ Returns set of stop chars for text on left from scheme.
Returns: set of stop chars Return type: set
-
get_stop_chars_right
() → Set[str]¶ Returns set of stop chars for text on right from TLD.
Returns: set of stop chars Return type: set
-
static
get_version
() → str¶ Returns version number.
Returns: version number Return type: str
-
has_urls
(text: str, check_dns=False, with_schema_only=False) → bool¶ Checks if text contains any valid URL. Returns True if text contains at least one URL.
Parameters: - text – text where we want to find URLs
- check_dns (bool) – filter results to valid domains
- with_schema_only (bool) – consider domains with schema only
Returns: True if et least one URL was found, False otherwise
Return type: bool
-
ignore_list
¶ Set of URLs to be ignored (not returned) while extracting from text
Returns: Returns set of ignored URLs Return type: set(str)
-
load_ignore_list
(file_name)¶ Load URLs from file into ignore list
Parameters: file_name (str) – path to file containing URLs
-
load_permit_list
(file_name)¶ Load URLs from file into permit list
Parameters: file_name (str) – path to file containing URLs
-
permit_list
¶ Set of URLs that can be processed
Returns: Returns set of URLs that can be processed Return type: set(str)
-
remove_enclosure
(left_char: str, right_char: str)¶ Remove enclosure pair from set of enclosures.
Parameters: - left_char (str) – left character of enclosure pair - e.g. “(”
- right_char (str) – right character of enclosure pair - e.g. “)”
-
set_after_tld_chars
(after_tld_chars: Iterable[str])¶ Set chars that are allowed after TLD.
Parameters: after_tld_chars (list) – list of characters
-
set_stop_chars_left
(stop_chars: Set[str])¶ Set stop characters for text on left from TLD. Stop characters are used when determining end of URL.
Parameters: stop_chars (set) – set of characters Raises: TypeError
-
set_stop_chars_left_from_scheme
(stop_chars: Set[str])¶ Set stop characters for text on left from scheme. Stop characters are used when determining end of URL.
Parameters: stop_chars (set) – set of characters Raises: TypeError
-
set_stop_chars_right
(stop_chars: Set[str])¶ Set stop characters for text on right from TLD. Stop characters are used when determining end of URL.
Parameters: stop_chars (set) – set of characters Raises: TypeError
-
update
()¶ Update TLD list cache file.
Returns: True if update was successful False otherwise Return type: bool
-
update_when_older
(days: int) → bool¶ Update TLD list cache file if the list is older than number of days given in parameter days or if it does not exist.
Parameters: days (int) – number of days from last change Returns: True if update was successful, False otherwise Return type: bool
-