helpers package¶
Subpackages¶
- helpers.classes package
- Submodules
- helpers.classes.arrayproperties module
- helpers.classes.config_keys module
- helpers.classes.errors module
- helpers.classes.exceptions module
- helpers.classes.filestatus module
- helpers.classes.job module
- helpers.classes.node module
- helpers.classes.nodetype module
- helpers.classes.repositorystatus module
- helpers.classes.statisticsbuilder module
- Module contents
Submodules¶
helpers.charthelper module¶
-
helpers.charthelper.create_bubblechart_graphic_iobytes(*args, **kwargs)¶
-
helpers.charthelper.create_filesize_histogram_iobytes(*args, **kwargs)¶
-
helpers.charthelper.create_property_frequency_graphic_iobytes(*args, **kwargs)¶
-
helpers.charthelper.create_tree_level_histogram_iobytes(*args, **kwargs)¶
-
helpers.charthelper.create_types_graphic_iobytes(*args, **kwargs)¶
-
helpers.charthelper.format_bytes(amount_bytes, decimals)¶ Prettifies an amount of bytes.
- Parameters
amount_bytes – The number of bytes.
- Returns
The number of bytes, but human-readable.
-
helpers.charthelper.reset_plot(original_function)¶ Decorator for resetting the plot before creating a new one. Usage: @reset_plot
helpers.colors module¶
-
class
helpers.colors.TColorsBackground(value)¶ Bases:
enum.EnumContains background color options.
-
BLACK= '\x1b[40m'¶
-
BLUE= '\x1b[44m'¶
-
CYAN= '\x1b[46m'¶
-
GREEN= '\x1b[42m'¶
-
LIGHTGREY= '\x1b[47m'¶
-
ORANGE= '\x1b[43m'¶
-
PURPLE= '\x1b[45m'¶
-
RED= '\x1b[41m'¶
-
-
class
helpers.colors.TColorsForeground(value)¶ Bases:
enum.EnumContains foreground color options.
-
BLACK= '\x1b[30m'¶
-
BLUE= '\x1b[34m'¶
-
CYAN= '\x1b[36m'¶
-
DARKGREY= '\x1b[90m'¶
-
GREEN= '\x1b[32m'¶
-
LIGHTBLUE= '\x1b[94m'¶
-
LIGHTCYAN= '\x1b[96m'¶
-
LIGHTGREEN= '\x1b[92m'¶
-
LIGHTGREY= '\x1b[37m'¶
-
LIGHTRED= '\x1b[91m'¶
-
ORANGE= '\x1b[33m'¶
-
PINK= '\x1b[95m'¶
-
PURPLE= '\x1b[35m'¶
-
RED= '\x1b[31m'¶
-
YELLOW= '\x1b[93m'¶
-
-
class
helpers.colors.TStyles(value)¶ Bases:
enum.EnumContains style options.
-
BOLD= '\x1b[01m'¶
-
RESET= '\x1b[0m'¶
-
STRIKETHROUGH= '\x1b[09m'¶
-
UNDERLINE= '\x1b[04m'¶
-
-
helpers.colors.cprint(*args, foreground: helpers.colors.TColorsForeground = None, background: helpers.colors.TColorsBackground = None, **kwargs)¶ Like print, but with colors.
- Parameters
foreground – The foreground color.
background – The background color.
-
helpers.colors.cstring(text, foreground: helpers.colors.TColorsForeground = None, background: helpers.colors.TColorsBackground = None)¶ Colors a string.
- Parameters
text – The string.
foreground – The foreground color.
background – The background color.
- Returns
The input string, but colorized.
helpers.commons module¶
-
helpers.commons.create_message(content)¶ Transforms a string into a sendable message.
- Parameters
content – The message string.
- Returns
The message as a byte array.
-
helpers.commons.download_file(link, storage_path, repository_uuid, ignore_ssl=False, output=True)¶ Downloads the files of a single CKAN dataset.
- This method catches and tracks several errors in order to store them to the “errors” section in the database. Possible errors are:
SSL: The server’s TLS certificate can’t be validated
TIMEOUT: The request ends with a timeout error, probably due to a large file size or an instable internet connection.
FILETOOLARGE: The file size exceeds the set limit.
STATUSCODE: The HTTP request returns a a non-2XX status code (e.g. 401, 403, 404, 5XX, …)
CONTENTTYPE: The content type in the HTTP header of the request doesn’t match any of the given content types.
MIMETYPE: The MimeType doesn’t match any of the given MimeTypes.
PROTOCOL: Unsupported protocol (e.g. FTP).
- Parameters
link – The URL to the file to be downloaded.
storage_path – The path to the directory of the new file.
repository_uuid – The UUID of the repository holding the file.
ignore_ssl – (opt) Whether or not SSL errors shall be ignored.
- Returns
The UUID (filename) of the newly created file, else False
-
helpers.commons.execute_function_periodically(func, seconds)¶ Executes a function every x seconds.
- Parameters
func – The function. If you need to pass arguments, consider using the ‘lambda’ keyword.
seconds – The amount of seconds between two function calls.
-
helpers.commons.oprint(text, end='\n', file=<colorama.ansitowin32.StreamWrapper object>)¶
-
helpers.commons.recvall(socket)¶ Receive everything.
helpers.confighelper module¶
-
helpers.confighelper.load()¶
-
helpers.confighelper.print_config()¶ Loads the config and prints parameters fetched from config.ini file.
-
helpers.confighelper.read_config(config_name)¶ Reads the configuration from the database.
- Parameters
config_name – The name of the configuration document that shall be used.
helpers.mongohelper module¶
-
class
helpers.mongohelper.AbuseType(value)¶ Bases:
enum.EnumAn enumeration.
-
ABUSIVE_BOOLEAN= 'abusive_boolean'¶
-
ABUSIVE_NUMBER= 'abusive_number'¶
-
EMPTY_STRING= 'empty_string'¶
-
-
class
helpers.mongohelper.Client¶ Bases:
object-
add_node(config_name, address, port, name=None)¶ Adds a new node to the configuration.
- Parameters
config_name – The name of the configuration.
address – The address of the node.
port – The port of the node.
name – (opt) A name for the node.
- Returns
The new node’s UUID, if the node has been added successfully, otherwise None.
-
add_repository(name, url)¶ Stores a new repository.
- Parameters
name – The name of the repository (e.g. European Data Portal).
url – The url to the repository (e.g. data.humdata.org).
- Returns
The (created) UUID of the repository.
-
add_repository_stats(repository_uuid, total_packages, links_scraped)¶
-
add_scraped_urls(repository_uuid, urls)¶
-
compute_global_stats(globs)¶ In: list of (height, [nb_elements_0, …, nb_elements_h]) Out: (list of heights, list of densest levels, list of occurences) all have the same length
-
delete_repository(uuid)¶ Deletes a stored repository.
- Parameters
uuid – The internal uuid of the repository.
- Returns
True if deletion was successful, otherwise False.
-
fetch_analyzed_files_from_repository(repository_uuid)¶
-
fetch_basic_repository_information(repository_uuid)¶ Returns basic information (name, url, status, …) of a repository.
- Parameters
repository_uuid – The UUID of the repository.
- Returns
A result set.
-
get_advanced_repository_statistics(repository_uuid=None)¶
-
get_amount_of_analyzed_documents(repository_uuid=None)¶
-
get_amount_of_downloaded_documents(repository_uuid=None)¶
-
get_amount_of_errors(repository_uuid=None)¶
-
get_amount_of_scraped_links(repository_uuid=None)¶
-
get_available_configs()¶ Gets the list of all available config files.
- Returns
A list of the config names.
-
get_basic_repository_statistics(repository_uuid=None)¶
-
get_bubblechart_provenance(provenance_x, provenance_y, repository_uuid=None)¶
-
get_bubblechart_values(repository_uuid=None)¶
-
get_children_counts(repository_uuid=None)¶ Gets the number of chidren per tree level.
- Parameters
repository_uuid – The UUID of the repository holding the files.
- Returns
An (int, int) dictionary where the keys are the tree levels and the values are the number of nodes on that level.
-
get_config(config_name)¶ Retrieves a configuration document from the database.
- Parameters
config_name – The name of the configuration to retrieve.
- Returns
dict
-
get_document_filesizes(repository_uuid=None)¶
-
get_document_filesizes_poweroftwo(repository_uuid=None)¶
-
get_documents_with_string_abuse(abuse_type: helpers.mongohelper.AbuseType, repository_uuid=None)¶ Get all documents where abusive strings were detected.
- Parameters
abuse_type – The type of the string abuse (AbuseType).
repository_uuid – (opt) The UUID of the repository.
- Returns
A list of UUIDs representing the files that contain abusive strings.
-
get_downloaded_files(repository_uuid)¶
-
get_error_context_counts(repository_uuid=None)¶ Gets all error contexts as well as their frequency.
- Parameters
repository_uuid – (opt) The UUID of the repository.
- Returns
A generator that yields
{ "context": <error context>, count: <count> }dictionaries.
-
get_error_type_counts(repository_uuid=None)¶ Gets all error types as well as their frequency.
- Parameters
repository_uuid – (opt) The UUID of the repository.
- Returns
A generator that yields
{ "type": <error type>, count: <count> }dictionaries.
-
get_failed_downloads(repository_uuid=None)¶
-
get_files_downloaded_by_node(repository_uuid, node_uuid)¶ Gets the UUIDs of files downloaded by a certain node.
- Parameters
repository_uuid – The UUID of the repository holding the files.
node_uuid – The UUID of the node.
-
get_geojson_documents(repository_uuid=None)¶ Get all documents where GeoJSON was detected.
- Parameters
repository_uuid – (opt) The UUID of the repository.
- Returns
A list of UUIDs representing the files that are probably GeoJSON documents.
-
get_node(config_name, node_uuid)¶ Get the information of a single node.
- Parameters
config_name – The name of the config the node belongs to.
node_uuid – The UUID of the node.
- Returns
A dictionary which consists of the node’s attributes.
-
get_node_status(config_name, node_uuid)¶ Gets the last status of a single node.
- Parameters
config_name – The name of the configuration file the node belongs to, for example ‘config’.
node_uuid – The UUID of the node.
- Returns
The latest node status as a dictionary.
-
get_node_statuses(config_name, node_uuid)¶ Gets all statuses of a single node. :param config_name: The name of the configuration file the node belongs to, for example ‘config’.
- Parameters
node_uuid – The UUID of the node.
- Returns
The node statuses as a list of dictionaries.
-
get_node_statuses_all(config_name)¶ Gets the status information of all nodes that are present in the nodeCollection in the database.
- Parameters
config_name – The name of the configuration file the nodes belong to, for example ‘config’.
- Returns
A list of node information dictionaries.
-
get_nodes(config_name)¶ Get the information of all nodes.
- Parameters
config_name – The name of the config the nodes belong to.
- Returns
A list of dictionaries consisting of the nodes’s attributes.
-
get_property_sum(property: helpers.mongohelper.Properties, repository_uuid=None)¶
-
get_repositories()¶ Returns available repositories.
-
get_repository_url(repository_uuid)¶
-
get_scraped_links(repository_uuid)¶
-
get_string_abuse_counts(abuse_type: helpers.mongohelper.AbuseType, repository_uuid=None)¶ Get the amount of abused strings.
- Parameters
abuse_type – The type of the string abuse (AbuseType).
repository_uuid – (opt) The UUID of the repository.
- Returns
The number of abused strings.
-
read_local_config()¶ Reads the LOCAL config file and updates the used values.
-
remove_node(config_name, node_uuid)¶ Deletes a node from a configuration.
- Parameters
config_name – The name of the configuration the node belongs to.
node_uuid – The UUID of the node to remove.
- Returns
(bool) Whether or not the node has been deleted successfully.
-
set_repository_status(repository_uuid, status: helpers.classes.repositorystatus.RepositoryStatus)¶
-
store_download_error(repository_uuid, error_type, error_url, context=None)¶
-
store_duration(repository_uuid, duration_type: helpers.mongohelper.DurationType, duration, node_count=0)¶ Stores the duration of an analysis process.
- Parameters
repository_uuid – The UUID of the repository.
duration_type – Either DurationType.analysisDuration or DurationType.downloadDuration.
duration – The duration of the analysis process.
node_count – The number of used nodes.
-
store_full_document_results(repository_uuid, document_uuid, filesize, stat_builder, glob, array_properties, is_geojson, node_uuid=None)¶ Store all properties of an analysed document at once.
- Parameters
repository_uuid – The UUID of the repository holding the document.
document_uuid – The UUID of the actual document.
filesize – The filesize of the document (file).
stat_builder – A StatisticsBuilder object holding the statistics of the document.
glob – A dictionary holding information about global properties.
array_properties – An ArrayProperties object holding array properties.
is_geojson – Whether or not GeoJSON has been detected.
node_uuid – The UUID of the node that analyzed the file.
-
store_new_file_information(repository_uuid, file_uuid, file_url, file_hash)¶
-
store_node_for_download(repository_uuid, file_uuid, node_uuid)¶ Sets the UUID of the node that downloaded a certain file.
- Parameters
repository_uuid – The UUID of the repository holding the file.
file_uuid – The UUID of the downloaded file.
node_uuid – The UUID of the node that downloaded the file.
- Returns
Whether or not the operation was successful.
-
store_scraping_error(repository_uuid, error_type, context=None)¶
-
toggle_all_nodes(config_name)¶ Toggles the state of every node.
- Parameters
config_name – The name of the config the node belongs to.
-
toggle_node(config_name, node_uuid)¶ Enables or disables a node, depending on its current state.
- Parameters
config_name – The name of the config the node belongs to.
node_uuid – The UUID of the node.
- Returns
A tuple of two boolean values. The first determines the success of the execution, the second determines the new state of the node (enabled = True).
-
update_config(config_name, key: helpers.classes.config_keys.ConfigKeys, value)¶ Updates a single parameter of a configuration.
- Parameters
config_name – The name of the configuration.
key – The key that shall be updated (type: ConfigKey).
value – The new value.
- Returns
True if the update was successful, otherwise False.
-
update_config_dict(config_name, dictionary, upsert=False)¶ Updates multiple parameters of a configuration at once. The parameters and their new values have to be part of a dictionary.
- Parameters
config_name – The name of the configuration.
dictionary – The dictionary containing the keys and values.
upsert – Whether or not a new config file shall be created if no such file exists yet.
- Returns
True if the update was successful, otherwise False.
-
update_node_status(config_name, node_uuid, node_stats)¶ Stores information of a node to the database.
- Parameters
config_name – The name of the configuration file the nodes belong to, for example ‘config’.
node_uuid – The UUID of the node.
node_stats – A dictionary containing the information. It should look like this:
- {
“address”: <node address>, “port”: <node port>, “cpu_usage”: <CPU usage>, “ram_total”: <total amount of RAM>, “ram_available”: <amount of RAM that’s available>, “platform”: <platform string (platform.platform())>, “distro”: <distro name (distro.linux_distribution()[0])>
}
-
-
class
helpers.mongohelper.DurationType(value)¶ Bases:
enum.EnumAn enumeration.
-
ANALYSIS_DURATION= 'analysis_durations'¶
-
DOWNLOAD_DURATION= 'download_durations'¶
-
-
class
helpers.mongohelper.Properties(value)¶ Bases:
enum.EnumAn enumeration.
-
ABUSIVE_BOOLEAN_COUNT= '$files.analysis_properties.string_abuse.abusive_boolean'¶
-
ABUSIVE_NUMBER_COUNT= '$files.analysis_properties.string_abuse.abusive_number'¶
-
ARRAY_COUNT= '$files.analysis_properties.type_counts.array'¶
-
BOOLEAN_COUNT= '$files.analysis_properties.type_counts.boolean'¶
-
EMPTY_STRING_COUNT= '$files.analysis_properties.string_abuse.empty_string'¶
-
INTEGER_COUNT= '$files.analysis_properties.type_counts.integer'¶
-
NULL_COUNT= '$files.analysis_properties.type_counts.null'¶
-
NUMBER_COUNT= '$files.analysis_properties.type_counts.number'¶
-
OBJECT_COUNT= '$files.analysis_properties.type_counts.object'¶
-
OPTIONAL_PROPERTIES= '$files.analysis_properties.optional_properties'¶
-
REQUIRED_PROPERTIES= '$files.analysis_properties.required_properties'¶
-
STRING_COUNT= '$files.analysis_properties.type_counts.string'¶
-
helpers.nodeextractor module¶
-
helpers.nodeextractor.act_end_array(pre, array_properties)¶ Function mapped to the end_array event of ijson
-
helpers.nodeextractor.act_end_map(pre, global_properties_counter)¶ Function mapped to the end_map event of ijson
-
helpers.nodeextractor.act_leaf(pre, typ, val, custom_obj)¶ Function mapped to the (string|boolean|number|null) event of ijson Occurs when a primitive-typed value is encountered by ijson
-
helpers.nodeextractor.act_map_key(pre, val, custom_objects, global_properties_counter=False)¶ Function mapped to the map_key event of ijson This event occurs when a new “prop”:value object is parsed by ijson, we receive here prop
-
helpers.nodeextractor.act_start_array(pre, custom_objects)¶ Function mapped to the start_array event of ijson Used to update the status of a node (current type, occurences of ARRAY type)
-
helpers.nodeextractor.act_start_map(pre, custom_objects, global_properties_counter=False)¶ Function mapped to the start_map event of ijson Used to update the status of a node (current type, occurences of OBJECT type)
-
helpers.nodeextractor.create_object(node_name, custom_objects)¶ Alias for the creation of a custom object structure
types: dictionary of all the node types an object gets in a JSON document, see create_object_type ltype: internal type used to determine which type has last been given to an object (used to give children properly)
-
helpers.nodeextractor.create_object_type(node, node_type)¶ Create a type structure for a node
children: objects contained within the node in a JSON-document, referred by their node name or their node type occ: number of times the child occurs, generally
-
helpers.nodeextractor.infer_type(ijson_event_type)¶ Define a node type for a leaf based on the event raised by ijson
-
helpers.nodeextractor.leaf_name(prefix)¶ Find the last part of a tree name e.g. leaf_name(“abc.def.ghi.jkl”) = “jkl”
-
helpers.nodeextractor.new_element_parsed(parsed, custom_objects, tcp, global_properties_counter, array_properties)¶ Called each time an ijson event is caught Parse the event and call the appropriate function
-
helpers.nodeextractor.parent_name(prefix, custom_objects)¶ Find the parent of a node according to the tree name The parent is the first node (leaf excluded) whose name is defined in the custom_objects array e.g. parent_name(“abc.def.ghi.item.item”) = “ghi”, if ghi has been defined in custom_objects before e.g. parent_name(“”) = ROOT_NAME
-
helpers.nodeextractor.parse_document(collection_name, file_path, tcp=False, custom_objects={})¶ API used to benefit from the full functionalities of this part from the outside world
-
helpers.nodeextractor.parsing_done(custom_objects)¶
-
helpers.nodeextractor.parsing_init(root_name, custom_objects)¶ Initialization function of the node extractor Creates a root object to allow for the recursion to have a base case