This notebook shows the use of Azure AutoML on a (fake) variant data-set of multiple patients / samples.
The fake data-set is generated within the notebook. Its features are explained below.
The code used for AutoML is mainly taken from the official tutorial on how to predict taxi fares
The notebook was run from within an Azure ML Notebook that's connected to a workspace. See this tutorial for how to get started.
The Explainer Dashboard is used towards the end, but needs to be installed/enabled right at the start, official before the kernel is started. So I followed the instructions and restarted the kernel after the installation, since I was running this from within the ML Notebook. This is more straightforward if you are running your own Jupyter server, e.g on the Azure DSVM.
See also the official docs on Model interpretability with Azure Machine Learning service and this Github issue
! echo $PATH
! ps aux | grep jupyter | grep -v grep
# not sure if sudo is needed. if yes, prepend /anaconda/envs/azureml_py36/bin/
! jupyter nbextension install --py --sys-prefix azureml.contrib.explain.model.visualize
! jupyter nbextension enable --py --sys-prefix azureml.contrib.explain.model.visualize
! conda install -y nb_conda
microsoft-mli-widget
is the widget we want.
! jupyter nbextension list
import pandas as pd
import numpy as np
# number of samples
num_samples = 500
# number of cases
num_cases = 200
# number of overall sites
num_sites = 100
# number of positions causal for phenotype
num_causal_sites = 2
# background mutation frequency
bg_mut_freq = 0.1
# create a random variant matrix containing: 0 ref, 1 alt het, 2 alt hom
mat = np.random.choice(size=(num_samples, num_sites),
a=[0, 1, 2],
p=[1-bg_mut_freq, bg_mut_freq/2, bg_mut_freq/2])
# assign gender randomly
gender_transl = {1: 'male', 2: 'female', 3: 'unknown'}
gender_codes = list(gender_transl.keys())
gender = np.random.choice(size=(num_samples), a=gender_codes)
# determine which gender got unlucky
affected_gender = np.random.choice([k for k in gender_transl.keys() if k!='unknown'])
print("Affected gender = {} ({})".format(affected_gender, gender_transl[affected_gender]))
# pick causal sites
import random
causal_sites = sorted(random.sample(range(num_sites), num_causal_sites))
print(causal_sites, " zero offset")
# set phenotype status to 0 for all samples
status = np.zeros(num_samples, dtype=int)
# pick cases randomly (saved in status array). set gender to affected gender.
# make sure that at least one randomly chosen causal site is set to alt hom
#
for r in random.sample(range(num_samples), num_cases):
status[r] = 1
if not any([mat[r, c]==2 for c in causal_sites]):
i = random.choice(causal_sites)
mat[r, i] = 2
gender[r] = np.random.choice([k for k,v in gender_transl.items() if k!=affected_gender])
Put it all into a Pandas dataframe
cols = ["site-{:d}".format(i+1) for i in range(num_sites)]
df = pd.DataFrame(data=mat, columns=cols)
ids = ["sample-{:d}".format(i+1) for i in range(num_samples)]
df.insert(loc=0, column='ID', value=ids)
df.insert(loc=1, column='Gender', value=gender)
df.insert(loc=2, column='Status', value=status)
df
df.shape
# save for later use
df.to_csv("sample_matrix_clean.csv", index=False)
import pandas as pd
df = pd.read_csv("sample_matrix_clean.csv")
df
# keep a copy of the "annotation", i.e. all stuff that doesn't go into AutoML as features
annotation = df[["ID", "Status", "Gender"]].copy()
df = df.drop(["ID", "Status"], axis=1)
# Split into training and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df, annotation['Status'], test_size=0.2, random_state=42)
A very cool feature in AutoML is automatic preprocessing (see preprocess
below), which can automatically impute missing values, encode values, add features, embed words etc. See here for more information. Since the data-set here is clean already, there is no need for this. Plus, I couldn't get the Explainer below to work if preprocessing was on...
import logging
automl_settings = {
"iteration_timeout_minutes": 1,
"iterations": 10,
"primary_metric": 'accuracy',
"preprocess": False,
"verbosity": logging.INFO,
"n_cross_validations": 5
}
from azureml.train.automl import AutoMLConfig
automl_config = AutoMLConfig(task='classification',
debug_log='automated_ml_errors.log',
X=X_train.values,
y=y_train.values.flatten(),
**automl_settings)
Connect to the ML workspace on Azure so that everything is logged there as well
from azureml.core import Workspace
ws = Workspace.from_config()
from azureml.core.experiment import Experiment
experiment = Experiment(ws, "vcf-classification-local")
local_run = experiment.submit(automl_config, show_output=True)
from azureml.widgets import RunDetails
RunDetails(local_run).show()
best_run, fitted_model = local_run.get_output()
print(best_run)
print(fitted_model)
y_predict = fitted_model.predict(X_test.values)
print("Sample\tPredicted\tActual")
for idx, (dfidx, dfrow) in enumerate(X_test.iterrows()):
print("{}\t{}\t{}".format(annotation.at[dfidx, 'ID'], y_predict[idx], annotation.at[dfidx, 'Status']))
# top 10 is enough
if idx == 9:
break
print("...")
# idea from https://datatofish.com/confusion-matrix-python/
y_actual = []
for dfidx, dfrow in X_test.iterrows():# what's the pandassy way of doing this?
y_actual.append(annotation.at[dfidx, 'Status'])
data = {'y_Predicted': y_predict,
'y_Actual': y_actual}
df = pd.DataFrame(data, columns=['y_Actual','y_Predicted'])
# stats
from pandas_ml import ConfusionMatrix
Confusion_Matrix = ConfusionMatrix(df['y_Actual'], df['y_Predicted'])
Confusion_Matrix.print_stats()
confusion_matrix = pd.crosstab(df['y_Actual'], df['y_Predicted'],
rownames=['Actual'], colnames=['Predicted'])
print(confusion_matrix)
# idea from https://stackoverflow.com/questions/19233771/sklearn-plot-confusion-matrix-with-labels/48018785
import seaborn as sn
import matplotlib.pyplot as plt
ax = plt.subplot()
sn.heatmap(confusion_matrix, annot=True, ax = ax); #annot=True to annotate cells
# labels, title and ticks
ax.set_xlabel('Predicted');
ax.set_ylabel('True');
ax.set_title('Confusion Matrix');
#ax.xaxis.set_ticklabels(['business', 'health']);
#ax.yaxis.set_ticklabels(['health', 'business']);
Microsoft has six guiding AI principles. One of these is transparency, which states that it must be possible to understand how AI decisions were made. This is where model interpretability comes into play. Here we will use a TabularExplainer to understand global behavior of our model.
## Note, explainer doesn't work if preprocessing was used because, input column names cannot be
# found in fitted columns!?
from azureml.explain.model.tabular_explainer import TabularExplainer
# "features" and "classes" fields are optional. couldn't figure out how to use them
explainer = TabularExplainer(fitted_model, X_train)
# Now run the explainer. This takes some time...
global_explanation = explainer.explain_global(X_train)
# sorted feature importance values and feature names
sorted_global_importance_values = global_explanation.get_ranked_global_values()
sorted_global_importance_names = global_explanation.get_ranked_global_names()
## dict(zip(sorted_global_importance_names, sorted_global_importance_values))
# dictionary that holds the top K feature names and values
feature_importance = global_explanation.get_feature_importance_dict()
#for site, val in sorted(global_explanation.get_feature_importance_dict().items(), key=lambda x: x[1]):
# print(site, val)
print("Top 10: ", ", ".join(sorted_global_importance_names[:10]))
print("Real causal sites {}".format(", ".join(["site-{:d}".format(i+1) for i in causal_sites])))
from azureml.contrib.explain.model.visualize import ExplanationDashboard
dashboard = ExplanationDashboard(global_explanation, fitted_model, X_train)
Nothing shows? See this Github issue
Find the gender bias in the Explainer Dashboard...