helpers package

Submodules

helpers.charthelper module

helpers.charthelper.create_bubblechart_graphic_iobytes(*args, **kwargs)
helpers.charthelper.create_filesize_histogram_iobytes(*args, **kwargs)
helpers.charthelper.create_property_frequency_graphic_iobytes(*args, **kwargs)
helpers.charthelper.create_tree_level_histogram_iobytes(*args, **kwargs)
helpers.charthelper.create_types_graphic_iobytes(*args, **kwargs)
helpers.charthelper.format_bytes(amount_bytes, decimals)

Prettifies an amount of bytes.

Parameters

amount_bytes – The number of bytes.

Returns

The number of bytes, but human-readable.

helpers.charthelper.reset_plot(original_function)

Decorator for resetting the plot before creating a new one. Usage: @reset_plot

helpers.colors module

class helpers.colors.TColorsBackground(value)

Bases: enum.Enum

Contains background color options.

BLACK = '\x1b[40m'
BLUE = '\x1b[44m'
CYAN = '\x1b[46m'
GREEN = '\x1b[42m'
LIGHTGREY = '\x1b[47m'
ORANGE = '\x1b[43m'
PURPLE = '\x1b[45m'
RED = '\x1b[41m'
class helpers.colors.TColorsForeground(value)

Bases: enum.Enum

Contains foreground color options.

BLACK = '\x1b[30m'
BLUE = '\x1b[34m'
CYAN = '\x1b[36m'
DARKGREY = '\x1b[90m'
GREEN = '\x1b[32m'
LIGHTBLUE = '\x1b[94m'
LIGHTCYAN = '\x1b[96m'
LIGHTGREEN = '\x1b[92m'
LIGHTGREY = '\x1b[37m'
LIGHTRED = '\x1b[91m'
ORANGE = '\x1b[33m'
PINK = '\x1b[95m'
PURPLE = '\x1b[35m'
RED = '\x1b[31m'
YELLOW = '\x1b[93m'
class helpers.colors.TStyles(value)

Bases: enum.Enum

Contains style options.

BOLD = '\x1b[01m'
RESET = '\x1b[0m'
STRIKETHROUGH = '\x1b[09m'
UNDERLINE = '\x1b[04m'
helpers.colors.cprint(*args, foreground: helpers.colors.TColorsForeground = None, background: helpers.colors.TColorsBackground = None, **kwargs)

Like print, but with colors.

Parameters
  • foreground – The foreground color.

  • background – The background color.

helpers.colors.cstring(text, foreground: helpers.colors.TColorsForeground = None, background: helpers.colors.TColorsBackground = None)

Colors a string.

Parameters
  • text – The string.

  • foreground – The foreground color.

  • background – The background color.

Returns

The input string, but colorized.

helpers.commons module

helpers.commons.create_message(content)

Transforms a string into a sendable message.

Parameters

content – The message string.

Returns

The message as a byte array.

helpers.commons.download_file(link, storage_path, repository_uuid, ignore_ssl=False, output=True)

Downloads the files of a single CKAN dataset.

This method catches and tracks several errors in order to store them to the “errors” section in the database. Possible errors are:
  • SSL: The server’s TLS certificate can’t be validated

  • TIMEOUT: The request ends with a timeout error, probably due to a large file size or an instable internet connection.

  • FILETOOLARGE: The file size exceeds the set limit.

  • STATUSCODE: The HTTP request returns a a non-2XX status code (e.g. 401, 403, 404, 5XX, …)

  • CONTENTTYPE: The content type in the HTTP header of the request doesn’t match any of the given content types.

  • MIMETYPE: The MimeType doesn’t match any of the given MimeTypes.

  • PROTOCOL: Unsupported protocol (e.g. FTP).

Parameters
  • link – The URL to the file to be downloaded.

  • storage_path – The path to the directory of the new file.

  • repository_uuid – The UUID of the repository holding the file.

  • ignore_ssl – (opt) Whether or not SSL errors shall be ignored.

Returns

The UUID (filename) of the newly created file, else False

helpers.commons.execute_function_periodically(func, seconds)

Executes a function every x seconds.

Parameters
  • func – The function. If you need to pass arguments, consider using the ‘lambda’ keyword.

  • seconds – The amount of seconds between two function calls.

helpers.commons.oprint(text, end='\n', file=<colorama.ansitowin32.StreamWrapper object>)
helpers.commons.recvall(socket)

Receive everything.

helpers.confighelper module

helpers.confighelper.load()
helpers.confighelper.print_config()

Loads the config and prints parameters fetched from config.ini file.

helpers.confighelper.read_config(config_name)

Reads the configuration from the database.

Parameters

config_name – The name of the configuration document that shall be used.

helpers.mongohelper module

class helpers.mongohelper.AbuseType(value)

Bases: enum.Enum

An enumeration.

ABUSIVE_BOOLEAN = 'abusive_boolean'
ABUSIVE_NUMBER = 'abusive_number'
EMPTY_STRING = 'empty_string'
class helpers.mongohelper.Client

Bases: object

add_node(config_name, address, port, name=None)

Adds a new node to the configuration.

Parameters
  • config_name – The name of the configuration.

  • address – The address of the node.

  • port – The port of the node.

  • name – (opt) A name for the node.

Returns

The new node’s UUID, if the node has been added successfully, otherwise None.

add_repository(name, url)

Stores a new repository.

Parameters
  • name – The name of the repository (e.g. European Data Portal).

  • url – The url to the repository (e.g. data.humdata.org).

Returns

The (created) UUID of the repository.

add_repository_stats(repository_uuid, total_packages, links_scraped)
add_scraped_urls(repository_uuid, urls)
compute_global_stats(globs)

In: list of (height, [nb_elements_0, …, nb_elements_h]) Out: (list of heights, list of densest levels, list of occurences) all have the same length

delete_repository(uuid)

Deletes a stored repository.

Parameters

uuid – The internal uuid of the repository.

Returns

True if deletion was successful, otherwise False.

fetch_analyzed_files_from_repository(repository_uuid)
fetch_basic_repository_information(repository_uuid)

Returns basic information (name, url, status, …) of a repository.

Parameters

repository_uuid – The UUID of the repository.

Returns

A result set.

get_advanced_repository_statistics(repository_uuid=None)
get_amount_of_analyzed_documents(repository_uuid=None)
get_amount_of_downloaded_documents(repository_uuid=None)
get_amount_of_errors(repository_uuid=None)
get_available_configs()

Gets the list of all available config files.

Returns

A list of the config names.

get_basic_repository_statistics(repository_uuid=None)
get_bubblechart_provenance(provenance_x, provenance_y, repository_uuid=None)
get_bubblechart_values(repository_uuid=None)
get_children_counts(repository_uuid=None)

Gets the number of chidren per tree level.

Parameters

repository_uuid – The UUID of the repository holding the files.

Returns

An (int, int) dictionary where the keys are the tree levels and the values are the number of nodes on that level.

get_config(config_name)

Retrieves a configuration document from the database.

Parameters

config_name – The name of the configuration to retrieve.

Returns

dict

get_document_filesizes(repository_uuid=None)
get_document_filesizes_poweroftwo(repository_uuid=None)
get_documents_with_string_abuse(abuse_type: helpers.mongohelper.AbuseType, repository_uuid=None)

Get all documents where abusive strings were detected.

Parameters
  • abuse_type – The type of the string abuse (AbuseType).

  • repository_uuid – (opt) The UUID of the repository.

Returns

A list of UUIDs representing the files that contain abusive strings.

get_downloaded_files(repository_uuid)
get_error_context_counts(repository_uuid=None)

Gets all error contexts as well as their frequency.

Parameters

repository_uuid – (opt) The UUID of the repository.

Returns

A generator that yields { "context": <error context>, count: <count> } dictionaries.

get_error_type_counts(repository_uuid=None)

Gets all error types as well as their frequency.

Parameters

repository_uuid – (opt) The UUID of the repository.

Returns

A generator that yields { "type": <error type>, count: <count> } dictionaries.

get_failed_downloads(repository_uuid=None)
get_files_downloaded_by_node(repository_uuid, node_uuid)

Gets the UUIDs of files downloaded by a certain node.

Parameters
  • repository_uuid – The UUID of the repository holding the files.

  • node_uuid – The UUID of the node.

get_geojson_documents(repository_uuid=None)

Get all documents where GeoJSON was detected.

Parameters

repository_uuid – (opt) The UUID of the repository.

Returns

A list of UUIDs representing the files that are probably GeoJSON documents.

get_node(config_name, node_uuid)

Get the information of a single node.

Parameters
  • config_name – The name of the config the node belongs to.

  • node_uuid – The UUID of the node.

Returns

A dictionary which consists of the node’s attributes.

get_node_status(config_name, node_uuid)

Gets the last status of a single node.

Parameters
  • config_name – The name of the configuration file the node belongs to, for example ‘config’.

  • node_uuid – The UUID of the node.

Returns

The latest node status as a dictionary.

get_node_statuses(config_name, node_uuid)

Gets all statuses of a single node. :param config_name: The name of the configuration file the node belongs to, for example ‘config’.

Parameters

node_uuid – The UUID of the node.

Returns

The node statuses as a list of dictionaries.

get_node_statuses_all(config_name)

Gets the status information of all nodes that are present in the nodeCollection in the database.

Parameters

config_name – The name of the configuration file the nodes belong to, for example ‘config’.

Returns

A list of node information dictionaries.

get_nodes(config_name)

Get the information of all nodes.

Parameters

config_name – The name of the config the nodes belong to.

Returns

A list of dictionaries consisting of the nodes’s attributes.

get_property_sum(property: helpers.mongohelper.Properties, repository_uuid=None)
get_repositories()

Returns available repositories.

get_repository_url(repository_uuid)
get_string_abuse_counts(abuse_type: helpers.mongohelper.AbuseType, repository_uuid=None)

Get the amount of abused strings.

Parameters
  • abuse_type – The type of the string abuse (AbuseType).

  • repository_uuid – (opt) The UUID of the repository.

Returns

The number of abused strings.

read_local_config()

Reads the LOCAL config file and updates the used values.

remove_node(config_name, node_uuid)

Deletes a node from a configuration.

Parameters
  • config_name – The name of the configuration the node belongs to.

  • node_uuid – The UUID of the node to remove.

Returns

(bool) Whether or not the node has been deleted successfully.

set_repository_status(repository_uuid, status: helpers.classes.repositorystatus.RepositoryStatus)
store_download_error(repository_uuid, error_type, error_url, context=None)
store_duration(repository_uuid, duration_type: helpers.mongohelper.DurationType, duration, node_count=0)

Stores the duration of an analysis process.

Parameters
  • repository_uuid – The UUID of the repository.

  • duration_type – Either DurationType.analysisDuration or DurationType.downloadDuration.

  • duration – The duration of the analysis process.

  • node_count – The number of used nodes.

store_full_document_results(repository_uuid, document_uuid, filesize, stat_builder, glob, array_properties, is_geojson, node_uuid=None)

Store all properties of an analysed document at once.

Parameters
  • repository_uuid – The UUID of the repository holding the document.

  • document_uuid – The UUID of the actual document.

  • filesize – The filesize of the document (file).

  • stat_builder – A StatisticsBuilder object holding the statistics of the document.

  • glob – A dictionary holding information about global properties.

  • array_properties – An ArrayProperties object holding array properties.

  • is_geojson – Whether or not GeoJSON has been detected.

  • node_uuid – The UUID of the node that analyzed the file.

store_new_file_information(repository_uuid, file_uuid, file_url, file_hash)
store_node_for_download(repository_uuid, file_uuid, node_uuid)

Sets the UUID of the node that downloaded a certain file.

Parameters
  • repository_uuid – The UUID of the repository holding the file.

  • file_uuid – The UUID of the downloaded file.

  • node_uuid – The UUID of the node that downloaded the file.

Returns

Whether or not the operation was successful.

store_scraping_error(repository_uuid, error_type, context=None)
toggle_all_nodes(config_name)

Toggles the state of every node.

Parameters

config_name – The name of the config the node belongs to.

toggle_node(config_name, node_uuid)

Enables or disables a node, depending on its current state.

Parameters
  • config_name – The name of the config the node belongs to.

  • node_uuid – The UUID of the node.

Returns

A tuple of two boolean values. The first determines the success of the execution, the second determines the new state of the node (enabled = True).

update_config(config_name, key: helpers.classes.config_keys.ConfigKeys, value)

Updates a single parameter of a configuration.

Parameters
  • config_name – The name of the configuration.

  • key – The key that shall be updated (type: ConfigKey).

  • value – The new value.

Returns

True if the update was successful, otherwise False.

update_config_dict(config_name, dictionary, upsert=False)

Updates multiple parameters of a configuration at once. The parameters and their new values have to be part of a dictionary.

Parameters
  • config_name – The name of the configuration.

  • dictionary – The dictionary containing the keys and values.

  • upsert – Whether or not a new config file shall be created if no such file exists yet.

Returns

True if the update was successful, otherwise False.

update_node_status(config_name, node_uuid, node_stats)

Stores information of a node to the database.

Parameters
  • config_name – The name of the configuration file the nodes belong to, for example ‘config’.

  • node_uuid – The UUID of the node.

  • node_stats – A dictionary containing the information. It should look like this:

{

“address”: <node address>, “port”: <node port>, “cpu_usage”: <CPU usage>, “ram_total”: <total amount of RAM>, “ram_available”: <amount of RAM that’s available>, “platform”: <platform string (platform.platform())>, “distro”: <distro name (distro.linux_distribution()[0])>

}

class helpers.mongohelper.DurationType(value)

Bases: enum.Enum

An enumeration.

ANALYSIS_DURATION = 'analysis_durations'
DOWNLOAD_DURATION = 'download_durations'
class helpers.mongohelper.Properties(value)

Bases: enum.Enum

An enumeration.

ABUSIVE_BOOLEAN_COUNT = '$files.analysis_properties.string_abuse.abusive_boolean'
ABUSIVE_NUMBER_COUNT = '$files.analysis_properties.string_abuse.abusive_number'
ARRAY_COUNT = '$files.analysis_properties.type_counts.array'
BOOLEAN_COUNT = '$files.analysis_properties.type_counts.boolean'
EMPTY_STRING_COUNT = '$files.analysis_properties.string_abuse.empty_string'
INTEGER_COUNT = '$files.analysis_properties.type_counts.integer'
NULL_COUNT = '$files.analysis_properties.type_counts.null'
NUMBER_COUNT = '$files.analysis_properties.type_counts.number'
OBJECT_COUNT = '$files.analysis_properties.type_counts.object'
OPTIONAL_PROPERTIES = '$files.analysis_properties.optional_properties'
REQUIRED_PROPERTIES = '$files.analysis_properties.required_properties'
STRING_COUNT = '$files.analysis_properties.type_counts.string'

helpers.nodeextractor module

helpers.nodeextractor.act_end_array(pre, array_properties)

Function mapped to the end_array event of ijson

helpers.nodeextractor.act_end_map(pre, global_properties_counter)

Function mapped to the end_map event of ijson

helpers.nodeextractor.act_leaf(pre, typ, val, custom_obj)

Function mapped to the (string|boolean|number|null) event of ijson Occurs when a primitive-typed value is encountered by ijson

helpers.nodeextractor.act_map_key(pre, val, custom_objects, global_properties_counter=False)

Function mapped to the map_key event of ijson This event occurs when a new “prop”:value object is parsed by ijson, we receive here prop

helpers.nodeextractor.act_start_array(pre, custom_objects)

Function mapped to the start_array event of ijson Used to update the status of a node (current type, occurences of ARRAY type)

helpers.nodeextractor.act_start_map(pre, custom_objects, global_properties_counter=False)

Function mapped to the start_map event of ijson Used to update the status of a node (current type, occurences of OBJECT type)

helpers.nodeextractor.create_object(node_name, custom_objects)

Alias for the creation of a custom object structure

types: dictionary of all the node types an object gets in a JSON document, see create_object_type ltype: internal type used to determine which type has last been given to an object (used to give children properly)

helpers.nodeextractor.create_object_type(node, node_type)

Create a type structure for a node

children: objects contained within the node in a JSON-document, referred by their node name or their node type occ: number of times the child occurs, generally

helpers.nodeextractor.infer_type(ijson_event_type)

Define a node type for a leaf based on the event raised by ijson

helpers.nodeextractor.leaf_name(prefix)

Find the last part of a tree name e.g. leaf_name(“abc.def.ghi.jkl”) = “jkl”

helpers.nodeextractor.new_element_parsed(parsed, custom_objects, tcp, global_properties_counter, array_properties)

Called each time an ijson event is caught Parse the event and call the appropriate function

helpers.nodeextractor.parent_name(prefix, custom_objects)

Find the parent of a node according to the tree name The parent is the first node (leaf excluded) whose name is defined in the custom_objects array e.g. parent_name(“abc.def.ghi.item.item”) = “ghi”, if ghi has been defined in custom_objects before e.g. parent_name(“”) = ROOT_NAME

helpers.nodeextractor.parse_document(collection_name, file_path, tcp=False, custom_objects={})

API used to benefit from the full functionalities of this part from the outside world

helpers.nodeextractor.parsing_done(custom_objects)
helpers.nodeextractor.parsing_init(root_name, custom_objects)

Initialization function of the node extractor Creates a root object to allow for the recursion to have a base case

Module contents