For this PHYS291 project I hoped explore different artificial neural network architectures and configurations using the tools provided with ROOT's own Toolkit for Multivariate Analysis (TMVA) to classify Higgs to tau-tau events in the simulated dataset from the 2014 Higgs Boson Machine Learning Challenge.

Link to macros, models and data

The concept of artificial neural networks stems from the neurological structure of the brain, though it has evolved far from those roots. The neural network accepts a vector of values {x_0, ... x_n} each of which connects to (as a linear combination with a set of weights) a set of nodes. The sum of all the weighted values in a given node is passed through a differentiable limited activation function (usual choices include a sigmoid and hyperbolic tangent) which determines if the node activates and contributes to the activation function.

In a single layer feedforward network, there is one layer of these nodes, which then feed into a single output unit giving a probability for a given event to be signal or background. A multilayer neural network feeds the values from these nodes into another layer of nodes, with another set of weights for linear combination, and the values from the last layer as input values. The number of layers gives the *depth* of the neural network, whilst the number of nodes in each layer gives the *width* (which can vary between layers).

The determination of the weights is done through minimising a loss function:

where y(x_i) is predicted event type and y(C) is true event type for a given event. The minimisation of the loss function is done through back-propagation, where the weights are adjusted towards the direction of steepest gradient descent of the loss function:

Where eta represents the learning rate of the model, a positive number defined during the initialisation of the training.

TMVA is an extension to the ROOT framework which provides an environment with a number of easily accessible machine learning models. From this, I chose to focus on the artificial neural networks, of which there are 3 different implementations. The most recent implementation available in ROOT ("kMLP") was used for this project.

The Kaggle dataset was converted from a .csv file into a ROOT file, with the "Label" and "Set" being changed into numerical values for easier interpretation by ROOT. The training and testing sets are made into a separate .root files, both of which are used to load signal and background trees into a factory object, which handles the training, testing and evaluation of methods.

To explore the various neural network architectures, a number of models were trained using the macro TrainNetwork.C, changing the setup to investigate a few key hyperparameters and preprocessing techniques, namely:

The evaluation of the networks was based on the Recieving Operator Characteristic (ROC) curves produced by the macros in the TMVAGUI. These plot the background rejection (1-Signal Efficiency) versus the signal efficiency at various points when cutting on the classifier outputs. ROC curve integrals should represent how effective the classifier was at classifying correct signal events. The neuron activation function was held constant as tanh(x) across all training sessions. 30000 Signal and background events were used to train each of the models, so as to limit training time. The weight expression used was the normalised weight over each set (given as "KaggleWeight" for each event). The ROC-curves were generated using the macros provided in TMVAGUI. The top 10 variables were determined from a 2 hidden layer training run.

*Figure 1: ROC-Curve, Normalised/Not Normalised data, 2 Hidden layers with N nodes*

*Figure 2: ROC-Curve, 1/2 Hidden layers with N nodes *

* Figure 3: ROC-Curve, 2 Hidden layers with N+10/N nodes in the first layer*

*Figure 4: ROC-Curve, 2 Hidden layers with Top 10 variables*

Using a 50% training data to 50% testing data allocation seemed to produce a slightly less effective classification than the more conventional 80% testing to 20% training. This might be due to the testing set covering a larger amount of edge cases for which the model was not appropriately trained for, hence returning a smaller ROC-integral.

One of the clearest results from this exploration is the need for normalisation, at least in the case of this dataset. This could be due to the wide fluctuation between the range of the variables (some having a domain of [0-3], others [-999,999]). Normalising these allows the network to have equal response for all variables, allowing for a more uniform training process. Normalisation in the loading of the tree also seems necessary to maximise the ROC curve.

One surprising result is how unaffected the networks seem by any change in the width of the first layer. This should have produced a greater effect in the ROC-Curve, but instead is slightly less effective than two layers with the same number of nodes. A plausible explanation could be that given the large amount of variables for each event, the larger amount of nodes is superfluous. This carries over to the number of layers, with the curves being nearly identical for an MLP with one or two layers.

Using the top 10 variables produced varying results during training, with the integral still being slightly larger than the full variable set. This should not be the case, as the loss of information from the rest of the variables should make the classification less accurate.

Considering the architecture graph for the top 10 variables (for which the integral of the ROC Curve is approximately the same as the full set of variables):

*Figure 5: Network Architecture Graph, Top 10 variables*

It seems to show that only one of the nodes in the last hidden layer contributes to the output. This is further exemplified by the network architecture of a more complete model:

*Figure 6: Network Architecture Graph, All variables*

Where only two nodes of the final hidden layer contribute to the final model.

Implementing convergence tests in the training seemed to shorten training time considerably, and should have been considered for more of the models. Considering the convergence graph for a model trained on all the data:

*Figure 7: Convergence graph over 1000 cycles, All data *

It is somewhat concerning that this is the case,as it implies that the model finds a minimum which is not reflective of the true classification.

The use of KaggleWeight as a weight expression might have been misguided, as the original dataset provided to competitors excluded both weight and normalised weight as variables to be used during training. This might have led to certain events being deprioritised during the training, and reduction of importance for some of the variables.

Some variables in the dataset used the value -999 as a way to signify that the variable was undefined or did not exist for a given event. This might have affected the training, as the normalised variables used in most of the training runs might give a skewed result in regards to these variables. This also might be the cause for the reader not functioning properly, as ROOT interprets this value as "Not a Number". The lack of a reader function is regrettable as it leaves the trained models without a mode of application.

This might have been a far too ambitious project to set out on without a firmer grasp of multivariate methods and machine learning. Most of the results found seem to be in opposition to current understanding of neural networks (Both width and depth should contribute to a greater extent than they do). One somewhat plausible explanation might be that all the training runs found similar local minima in the function space, given the quick convergence and similar ROC-curves/integrals. This does not explain why the last hidden layers had so few activated nodes, for which no reasonable explanation can be given.

- The
*write*shell command seems to disrupt the training (other commands might do so too) - For the plots requiring an MLP method in the GUI, "MLP" should be in the title for these plots to be available

ATLAS collaboration (2014),

I. Goodfellow, Y. Bengio and A. Courville (2016),

Voss, H. (2012).