This notebook helps you to analyze datasets and to find interesting and meaningful patterns in the data. If you are only interested in looking at an automated report outlining the most important features of your dataset, you can upload your datafile via the dataset variable and run the notebook. Afterwards, you can export the report as HTML and read it in a webbrowser.
If you are interested in a more interactive analysis of your data, you can also adapt the parameters of the notebook to suit your needs. Each section conatins several values which can be adapted to your needs. These values are described in the code comments.
Finally, if you want to go beyond an automated report and answer your own questions, you can look at the final section of the notebook and use the code examples there to generate your own figures and analysis from the data model.
This report uses several statistical methods and specific phrases and concepts from the domains of statistics and machine learning. Whenever such methods are used, a small "Explanation" sign at the side of the report marks a short explanation of the methods and phrases. Clicking it will reveal the explanation.
You can toggle the global visibility of these explanations with a button at the top left corner of the report. The code can also be toggled with a button.
All graphs are interactive and will display additional content on hover. You can get the exact values of the functions by selecting the assoziated areas in the graph. You can also move the plots around and zoom into interesting parts.
This notebook is build on the MSPN implementation by Molina et.al. during the course of a bachelor thesis under the supervision of Alejandro Molina and Kristian Kersting at TU Darmstadt. The goal of this framework is to sum product networks for hybrid domains and to highlight important aspects and interesting features of a given dataset.
import pickle
import pandas as pd
import numpy as np
#from tfspn.SPN import SPN
from pprint import PrettyPrinter
from IPython.display import Image
from IPython.display import display, Markdown
from importlib import reload
import plotly.plotly as py
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.graph_objs import *
from src.util.text_util import printmd, strip_dataset_name
import src.ba_functions as f
import src.dn_plot as p
import src.dn_text_generation as descr
import src.util.data_util as util
from src.util.spn_util import get_categoricals
from src.util.CSVUtil import learn_piecewise_from_file, load_from_csv
init_notebook_mode(connected=True)
# pp = PrettyPrinter()
# path to the dataset you want to use for training
dataset = 'example_data/iris'
# the minimum number of datapoints that are included in a child of a 
# sum node
min_instances = 50
# the parameter which governs how strict the independence test will be
# 1 results in all features being evaluated as independent, 0 will 
# result in no features being acccepted as truly independent
independence_threshold = 0.1
spn, dictionary = learn_piecewise_from_file(
    data_file=dataset, 
    header=0, 
    min_instances=min_instances, 
    independence_threshold=independence_threshold, 
    )
df = pd.read_csv(dataset)
context = dictionary['context']
context.dataset = strip_dataset_name(dataset)
categoricals = get_categoricals(spn, context)
# path to the model pickle file
model_path = "deep_notebooks/models/test.pickle"
# UNCOMMENT THE FOLLOWING LINES TO LOAD A MODEL
#spn = pickle.load(open('../myokardinfarkt/spn_save.txt', 'rb'))
#df, _, dictionary = load_from_csv('../myokardinfarkt/data/cleaned_pandas.csv', header = 0)
#context = pickle.load(open('../myokardinfarkt/context_save.txt', 'rb'))
#context.feature_names = ([entry['name']
#                                  for entry in dictionary['features']])
#dictionary[context] = context
reload(descr)
descr.introduction(context)
reload(descr)
descr.data_description(context, df)
Below, the means and standard deviations of each feature are shown. Categorical features do not have a mean and a standard deviation, since they contain no ordering. Instead, the network returns NaN.
descr.means_table(spn, context)
In the following section, the marginal distributions for each feature is shown. This is the distribution of each feature without knowing anything about the other values.
reload(descr)
descr.features_shown = 'all'
descr.show_feature_marginals(spn, dictionary)
To get a sense of how the features relate to one another, the correlation between them is analyzed in the next section. The correlation denotes how strongly two features are linked. A high correlation (close to 1 or -1) means that two features are very closely related, while a correlation close to 0 means that there is no linear interdependency between the features.
The correlation is reported in a colored matrix, where blue denotes a negative and red denotes a positive correlation.
descr.correlation_threshold = 0.4
corr = descr.correlation_description(spn, dictionary)
The conditional distributions are the probabilities of the features, given a certain instance of a class. The joint probability functions of correlated variables are shown below to allow a more in-depth look into the dependency.
reload(descr)
descr.correlation_threshold = 0.2
descr.feature_combinations = 'all'
descr.show_conditional = True
descr.categorical_correlations(spn, dictionary)
To give an impression of the data representation as a whole, the complete network graph is shown below. The model is a tree, with a sum node at its center. The root of the tree is shown in white, while the sum and product nodes are green and blue respectively. Finally, all leaves are represented by red nodes.
#p.plot_graph(spn=spn, fname='deep_notebooks/images/graph.png', context=context)
#display(Image(filename='deep_notebooks/images/graph.png', width=400))
The data model provides a clustering of the data points into groups in which features are independent. The groups extracted from the data are outlined below together with a short description of the data they cover. Each branch in the model represents one cluster found in the data model.
# possible values: 'all', 'big', int (leading to a random sample), list of nodes to be displayed
nodes = f.get_sorted_nodes(spn)
reload(descr)
descr.nodes = 'all'
descr.show_node_graphs = False
descr.node_introduction(spn, nodes, context)
As stated above, each cluster captures a subgroup of the data. To show what variables are captured by which cluster, the means and variances for each feature and subgroup are plotted below. This highlights where the node has its focus.
descr.features_shown = 'all'
descr.mean_threshold = 0.1
descr.variance_threshold = 0.1
descr.separation_threshold = 0.1
separations = descr.show_node_separation(spn, nodes, context)
An analysis of the distribution of categorical variables is given below. If a cluster or a group of clusters capture a large fraction of the total likelihood of a categorical instance, they can be interpreted to represent this instance and the associated distribution.
reload(descr)
descr.categoricals = 'all'
descr.node_categorical_description(spn, dictionary)
Finally, since each node captures different interaction between the features, it is interesting to look at the correlations again, this time for the seperate nodes. Shallow nodes are omitted, because the correlation of independent variables is always 0.
reload(descr)
descr.correlation_threshold = 0.1
descr.nodes = 'all'
descr.node_correlation(spn, dictionary)
reload(util)
numerical_data, categorical_data = util.get_categorical_data(spn, df, dictionary)
After the cluster description, the data model is used to predict data points. To evaluate the performance of the model, the misclassification rate is shown below.
The classified data points are used to analyze more advanced patterns within the data, by looking first at the misclassified points, and then at the classification results in total.
descr.classify = 'all'
misclassified, data_dict = descr.classification(spn, numerical_data, dictionary)
Below, the misclassified examples are explained using the clusters they are most assoiciated with. For each instance, those clusters which form 90 % of the prediction are reported together eith the representatives of these clusters.
# IMPORTANT: Only set use_shapley to true if you have a really powerful machine
reload(descr)
reload(p)
descr.use_shapley = False
descr.shapley_sample_size = 1
descr.misclassified_explanations = 1
descr.describe_misclassified(spn, dictionary, misclassified, data_dict, numerical_data)
The following graphs highlight the relative importance of different features for a classification. It can show how different classes are predicted. For continuous and discrete features, a high positvie or negative importance shows that changing this features value positive or negative increases the predictions certainty.
For categorical values, positive and negative values highlight whether changing or keeping this categorical value increases or decreasies the predictive certainty.
reload(descr)
reload(p)
descr.explanation_vector_threshold = 0.2
descr.explanation_vector_classes = [4]
descr.explanation_vectors_show = 'all'
expl_vectors = descr.explanation_vector_description(spn, dictionary, data_dict, categoricals, use_shap=True)
reload(descr)
descr.print_conclusion(spn, dictionary, corr, nodes, separations, expl_vectors)
Use the Facets Interface to visualize data on your own. You can either load the dataset itself, or show the data as predicted by the model.
# Load UCI census and convert to json for sending to the visualization
import pandas as pd
df = pd.read_csv(dataset)
jsonstr = df.to_json(orient='records')
# Display the Dive visualization for this data
from IPython.core.display import display, HTML
HTML_TEMPLATE = """<link rel="import" href="/nbextensions/facets-dist/facets-jupyter.html">
        <facets-dive id="elem" height="600"></facets-dive>
        <script>
          var data = {jsonstr};
          document.querySelector("#elem").data = data;
        </script>"""
html = HTML_TEMPLATE.format(jsonstr=jsonstr)
display(HTML(html))
This notebook enables you to add your own analysis to the above. Maybe you are interested in drilling down into specific subclusters of the data, or you want to predict additional datapoint not represented in the training data.
from spn.algorithms.Inference import likelihood
# get samples to predict
data_point = numerical_data[1:2]
# get the probability from the models joint probability function
proba = likelihood(spn, data_point)
printmd(data_point)
printmd(likelihood(spn, data_point))
You can also predict the probability of several data points at the same time.
data_point = numerical_data[0:3]
proba = likelihood(spn, data_point)
printmd(data_point)
printmd(proba)