Introduction

Using this notebook

This notebook helps you to analyze datasets and to find interesting and meaningful patterns in the data. If you are only interested in looking at an automated report outlining the most important features of your dataset, you can upload your datafile via the dataset variable and run the notebook. Afterwards, you can export the report as HTML and read it in a webbrowser.

If you are interested in a more interactive analysis of your data, you can also adapt the parameters of the notebook to suit your needs. Each section conatins several values which can be adapted to your needs. These values are described in the code comments.

Finally, if you want to go beyond an automated report and answer your own questions, you can look at the final section of the notebook and use the code examples there to generate your own figures and analysis from the data model.

Reading this report in a webbrowser

This report uses several statistical methods and specific phrases and concepts from the domains of statistics and machine learning. Whenever such methods are used, a small "Explanation" sign at the side of the report marks a short explanation of the methods and phrases. Clicking it will reveal the explanation.

You can toggle the global visibility of these explanations with a button at the top left corner of the report. The code can also be toggled with a button.

All graphs are interactive and will display additional content on hover. You can get the exact values of the functions by selecting the assoziated areas in the graph. You can also move the plots around and zoom into interesting parts.

Aknowledgments

This notebook is build on the MSPN implementation by Molina et.al. during the course of a bachelor thesis under the supervision of Alejandro Molina and Kristian Kersting at TU Darmstadt. The goal of this framework is to sum product networks for hybrid domains and to highlight important aspects and interesting features of a given dataset.

In [1]:
import pickle
import pandas as pd
import numpy as np

#from tfspn.SPN import SPN
from pprint import PrettyPrinter
from IPython.display import Image
from IPython.display import display, Markdown
from importlib import reload

import plotly.plotly as py
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.graph_objs import *

from src.util.text_util import printmd, strip_dataset_name
import src.ba_functions as f
import src.dn_plot as p
import src.dn_text_generation as descr
import src.util.data_util as util
from src.util.spn_util import get_categoricals

from src.util.CSVUtil import learn_piecewise_from_file, load_from_csv

init_notebook_mode(connected=True)
# pp = PrettyPrinter()
In [2]:
# path to the dataset you want to use for training
dataset = 'example_data/iris'

# the minimum number of datapoints that are included in a child of a 
# sum node
min_instances = 50

# the parameter which governs how strict the independence test will be
# 1 results in all features being evaluated as independent, 0 will 
# result in no features being acccepted as truly independent
independence_threshold = 0.1


spn, dictionary = learn_piecewise_from_file(
    data_file=dataset, 
    header=0, 
    min_instances=min_instances, 
    independence_threshold=independence_threshold, 
    )
df = pd.read_csv(dataset)
context = dictionary['context']
context.dataset = strip_dataset_name(dataset)
categoricals = get_categoricals(spn, context)
In [3]:
# path to the model pickle file
model_path = "deep_notebooks/models/test.pickle"

# UNCOMMENT THE FOLLOWING LINES TO LOAD A MODEL
#spn = pickle.load(open('../myokardinfarkt/spn_save.txt', 'rb'))
#df, _, dictionary = load_from_csv('../myokardinfarkt/data/cleaned_pandas.csv', header = 0)
#context = pickle.load(open('../myokardinfarkt/context_save.txt', 'rb'))
#context.feature_names = ([entry['name']
#                                  for entry in dictionary['features']])
#dictionary[context] = context
In [4]:
reload(descr)
descr.introduction(context)

Exploring the iris dataset

the logo of TU Darmstadt
Report framework created @ TU Darmstadt
This report describes the dataset iris and contains general statistical information and an analysis on the influence different features and subgroups of the data have on each other. The first part of the report contains general statistical information about the dataset and an analysis of the variables and probability distributions.
The second part focusses on a subgroup analysis of the data. Different clusters identified by the network are analyzed and compared to give an insight into the structure of the data. Finally the influence different variables have on the predictive capabilities of the model are analyzes.
The whole report is generated by fitting a sum product network to the data and extracting all information from this model.

General statistical evaluation

In [5]:
reload(descr)
descr.data_description(context, df)

The dataset contains 150 entries and is comprised of 5 features, which are "sepal length", "sepal width", "petal length", "petal width", "scientific name".

"sepal length", "sepal width", "petal length", "petal width" are continuous features, while "scientific name" are categorical features. Continuous and discrete features were approximated with piecewise linear density functions, while categorical features are represented by histogramms of their probability.

Below, the means and standard deviations of each feature are shown. Categorical features do not have a mean and a standard deviation, since they contain no ordering. Instead, the network returns NaN.

In [6]:
descr.means_table(spn, context)

In the following section, the marginal distributions for each feature is shown. This is the distribution of each feature without knowing anything about the other values.

In [7]:
reload(descr)
descr.features_shown = 'all'

descr.show_feature_marginals(spn, dictionary)

Correlations

To get a sense of how the features relate to one another, the correlation between them is analyzed in the next section. The correlation denotes how strongly two features are linked. A high correlation (close to 1 or -1) means that two features are very closely related, while a correlation close to 0 means that there is no linear interdependency between the features.

The correlation is reported in a colored matrix, where blue denotes a negative and red denotes a positive correlation.

In [8]:
descr.correlation_threshold = 0.4

corr = descr.correlation_description(spn, dictionary)
/home/iliricon/Documents/Studium/ProjectsKersting/DeepNotebooks/src/Correlations.py:113: RuntimeWarning:

invalid value encountered in greater

/home/iliricon/Documents/Studium/ProjectsKersting/DeepNotebooks/src/Correlations.py:114: RuntimeWarning:

invalid value encountered in less

"petal length" and "sepal length" have a strong positive relation. "petal width" and "sepal length" influence each other strongly, and "petal width" and "sepal width" have a strong positive influence on each other. "scientific name" and "sepal length" influence each other strongly. On the other hand "scientific name" and "sepal width" have a moderate positive influence on each other. "scientific name" and "petal length" have a strong positive influence on each other, and "scientific name" and "petal width" influence each other strongly.

All other features do not have more then a very weak correlation.

The conditional distributions are the probabilities of the features, given a certain instance of a class. The joint probability functions of correlated variables are shown below to allow a more in-depth look into the dependency.

In [9]:
reload(descr)

descr.correlation_threshold = 0.2
descr.feature_combinations = 'all'
descr.show_conditional = True

descr.categorical_correlations(spn, dictionary)

There is a strong relation for the features "petal length" and "sepal length".

The features "petal length" and "sepal width" have a moderate relationship.

There is a strong dependency for the features "petal width" and "sepal length".

The model shows a moderate relationship for "petal width" and "sepal width".

"petal width" and "petal length" have a strong dependency.

There is a strong dependency between the features "scientific name" and "sepal length".

There is a moderate relationship for the features "scientific name" and "sepal width".

The model shows a strong relationship between "scientific name" and "petal length".

The model shows a strong dependency for the features "scientific name" and "petal width".


Cluster evaluation

To give an impression of the data representation as a whole, the complete network graph is shown below. The model is a tree, with a sum node at its center. The root of the tree is shown in white, while the sum and product nodes are green and blue respectively. Finally, all leaves are represented by red nodes.

In [10]:
#p.plot_graph(spn=spn, fname='deep_notebooks/images/graph.png', context=context)
#display(Image(filename='deep_notebooks/images/graph.png', width=400))

The data model provides a clustering of the data points into groups in which features are independent. The groups extracted from the data are outlined below together with a short description of the data they cover. Each branch in the model represents one cluster found in the data model.

Description of all clusters

In [11]:
# possible values: 'all', 'big', int (leading to a random sample), list of nodes to be displayed
nodes = f.get_sorted_nodes(spn)

reload(descr)
descr.nodes = 'all'
descr.show_node_graphs = False

descr.node_introduction(spn, nodes, context)

The SPN contains 4 clusters.

These are:

  • a deep Product Node, representing 33.33% of the data.
    • The node has 5 children and 15 descendants, resulting in a remaining depth of 3.
  • a deep Product Node, representing 29.33% of the data.
    • The node has 5 children and 15 descendants, resulting in a remaining depth of 3.
  • a deep Product Node, representing 22.67% of the data.
    • The node has 5 children and 15 descendants, resulting in a remaining depth of 3.
  • a deep Product Node, representing 14.67% of the data.
    • The node has 5 children and 15 descendants, resulting in a remaining depth of 3.

The node representatives are the most likely data points for each node. They are archetypal for what the node represents and what subgroup of the data it encapsulates.

As stated above, each cluster captures a subgroup of the data. To show what variables are captured by which cluster, the means and variances for each feature and subgroup are plotted below. This highlights where the node has its focus.

In [12]:
descr.features_shown = 'all'
descr.mean_threshold = 0.1
descr.variance_threshold = 0.1
descr.separation_threshold = 0.1

separations = descr.show_node_separation(spn, nodes, context)

The feature "sepal length" is moderately separated by the clustering. The variances of the nodes 0, 2 are significantly larger then the average node. The means of the nodes 1, 2 are significantly larger then the average node. The means of the nodes 0, 3 are significantly smaller then the average node.

The feature "sepal width" is moderately separated by the clustering. The variance of node 0 is significantly larger then the average node. The mean of node 0 is significantly larger then the average node. The means of the nodes 1, 3 are significantly smaller then the average node.

The feature "petal length" is strongly separated by the clustering. The variances of the nodes 0, 2 are significantly larger then the average node. The means of the nodes 1, 2 are significantly larger then the average node. The mean of node 0 is significantly smaller then the average node.

The feature "petal width" is strongly separated by the clustering. The variances of the nodes 0, 2 are significantly larger then the average node. The means of the nodes 1, 2 are significantly larger then the average node. The mean of node 0 is significantly smaller then the average node.

An analysis of the distribution of categorical variables is given below. If a cluster or a group of clusters capture a large fraction of the total likelihood of a categorical instance, they can be interpreted to represent this instance and the associated distribution.

In [13]:
reload(descr)

descr.categoricals = 'all'

descr.node_categorical_description(spn, dictionary)

Distribution of scientific name

90.48% of "['Iris-setosa']" is captured by the nodes 0. The probability of "['Iris-setosa']" for this group of nodes is 95.0%

92.0% of "['Iris-versicolor']" is captured by the nodes 1, 3. The probability of "['Iris-versicolor']" for this group of nodes is 70.61%

The feature "scientific name" is not separated well along the primary clusters.

Correlations by cluster

Finally, since each node captures different interaction between the features, it is interesting to look at the correlations again, this time for the seperate nodes. Shallow nodes are omitted, because the correlation of independent variables is always 0.

In [14]:
reload(descr)

descr.correlation_threshold = 0.1
descr.nodes = 'all'

descr.node_correlation(spn, dictionary)

Correlations for node 0

No features show more then a very weak correlation.

Correlations for node 1

No features show more then a very weak correlation.

Correlations for node 2

No features show more then a very weak correlation.

Correlations for node 3

No features show more then a very weak correlation.


Predictive data analysis

In [15]:
reload(util)
numerical_data, categorical_data = util.get_categorical_data(spn, df, dictionary)

After the cluster description, the data model is used to predict data points. To evaluate the performance of the model, the misclassification rate is shown below.

The classified data points are used to analyze more advanced patterns within the data, by looking first at the misclassified points, and then at the classification results in total.

In [16]:
descr.classify = 'all'

misclassified, data_dict = descr.classification(spn, numerical_data, dictionary)

For feature "scientific name" the SPN misclassifies 16 instances, resulting in a precision of 89.33%.

Below, the misclassified examples are explained using the clusters they are most assoiciated with. For each instance, those clusters which form 90 % of the prediction are reported together eith the representatives of these clusters.

In [17]:
# IMPORTANT: Only set use_shapley to true if you have a really powerful machine
reload(descr)
reload(p)

descr.use_shapley = False
descr.shapley_sample_size = 1
descr.misclassified_explanations = 1

descr.describe_misclassified(spn, dictionary, misclassified, data_dict, numerical_data)

Instance 119 was predicted as "['Iris-versicolor']", even though it is "['Iris-virginica']", because it was most similar to the following clusters: 0, 3, 2, 1

Information gain through features

The following graphs highlight the relative importance of different features for a classification. It can show how different classes are predicted. For continuous and discrete features, a high positvie or negative importance shows that changing this features value positive or negative increases the predictions certainty.

For categorical values, positive and negative values highlight whether changing or keeping this categorical value increases or decreasies the predictive certainty.

In [18]:
reload(descr)
reload(p)

descr.explanation_vector_threshold = 0.2
descr.explanation_vector_classes = [4]
descr.explanation_vectors_show = 'all'

expl_vectors = descr.explanation_vector_description(spn, dictionary, data_dict, categoricals, use_shap=True)

Class "scientific name": "['Iris-setosa']"

Predictive continuous feature "sepal length": "7.9"
Predictive continuous feature "sepal width": "4.4"
Predictive continuous feature "petal length": "6.9"
Predictive continuous feature "petal width": "2.5"

Class "scientific name": "['Iris-versicolor']"

Predictive continuous feature "sepal length": "7.9"
Predictive continuous feature "sepal width": "4.4"
Predictive continuous feature "petal length": "6.9"
Predictive continuous feature "petal width": "2.5"

Class "scientific name": "['Iris-virginica']"

Predictive continuous feature "sepal length": "7.9"
Predictive continuous feature "sepal width": "4.4"
Predictive continuous feature "petal length": "6.9"
Predictive continuous feature "petal width": "2.5"

Conclusion

In [19]:
reload(descr)
descr.print_conclusion(spn, dictionary, corr, nodes, separations, expl_vectors)

This concludes the automated report on the SumNode_0 dataset.

The initial findings show, that the following variables have a significant connections with each other.

  • "petal length" - "sepal length"
  • "petal length" - "sepal width"
  • "petal width" - "sepal length"
  • "petal width" - "sepal width"
  • "petal width" - "petal length"
  • "scientific name" - "sepal length"
  • "scientific name" - "sepal width"
  • "scientific name" - "petal length"
  • "scientific name" - "petal width"

The intial clustering performed by the algorithm seperates the following features well:

  • petal length
  • petal width

If you want to explore the dataset further, you can use the interactive notebook to add your own queries to the network.

Dive into the data

Use the Facets Interface to visualize data on your own. You can either load the dataset itself, or show the data as predicted by the model.

In [20]:
# Load UCI census and convert to json for sending to the visualization
import pandas as pd
df = pd.read_csv(dataset)
jsonstr = df.to_json(orient='records')

# Display the Dive visualization for this data
from IPython.core.display import display, HTML

HTML_TEMPLATE = """<link rel="import" href="/nbextensions/facets-dist/facets-jupyter.html">
        <facets-dive id="elem" height="600"></facets-dive>
        <script>
          var data = {jsonstr};
          document.querySelector("#elem").data = data;
        </script>"""
html = HTML_TEMPLATE.format(jsonstr=jsonstr)
display(HTML(html))

Build your own queries

This notebook enables you to add your own analysis to the above. Maybe you are interested in drilling down into specific subclusters of the data, or you want to predict additional datapoint not represented in the training data.

In [21]:
from spn.algorithms.Inference import likelihood

# get samples to predict
data_point = numerical_data[1:2]
# get the probability from the models joint probability function
proba = likelihood(spn, data_point)


printmd(data_point)
printmd(likelihood(spn, data_point))

[[4.9 3. 1.4 0.2 0. ]]

[[2.34341865]]

You can also predict the probability of several data points at the same time.

In [22]:
data_point = numerical_data[0:3]
proba = likelihood(spn, data_point)

printmd(data_point)
printmd(proba)

[[5.1 3.5 1.4 0.2 0. ] [4.9 3. 1.4 0.2 0. ] [4.7 3.2 1.3 0.2 0. ]]

[[5.55563308] [2.34341865] [1.17014867]]