Cookie Consent by Free Privacy Policy Generator 📌 Let’s Do: Neural Networks

🏠 Team IT Security News

TSecurity.de ist eine Online-Plattform, die sich auf die Bereitstellung von Informationen,alle 15 Minuten neuste Nachrichten, Bildungsressourcen und Dienstleistungen rund um das Thema IT-Sicherheit spezialisiert hat.
Ob es sich um aktuelle Nachrichten, Fachartikel, Blogbeiträge, Webinare, Tutorials, oder Tipps & Tricks handelt, TSecurity.de bietet seinen Nutzern einen umfassenden Überblick über die wichtigsten Aspekte der IT-Sicherheit in einer sich ständig verändernden digitalen Welt.

16.12.2023 - TIP: Wer den Cookie Consent Banner akzeptiert, kann z.B. von Englisch nach Deutsch übersetzen, erst Englisch auswählen dann wieder Deutsch!

Google Android Playstore Download Button für Team IT Security



📚 Let’s Do: Neural Networks


💡 Newskategorie: AI Nachrichten
🔗 Quelle: towardsdatascience.com

A blow-by-blow account of building a neural network in PyTorch to predict insurance claims frequency

Photo by Skye Studios on Unsplash

Background

In this seventh — yes, seventh— instalment on neural networks¹, I finally get around to building one.

About time!

Anyway, to recap what we’ve learned so far:

  1. Neural networks are made up of neurons or nodes.
  2. These are arranged into an input layer (where the data goes in), an output layer (where the answer comes out), and hidden layers (where the magic happens). Hidden layers are the layers in the network which are neither an input nor an output layer.
  3. Weights are scalar values which connect the nodes of the network together. Weights indicate the strength of the relationship between two nodes. The larger the absolute weight, the stronger the connection.
  4. Each node can have a bias term. This is a scalar which helps the model learn patterns which might not pass through the “origin”.
  5. Each node also has an activation function operating on it. This is a function which transforms the node input, weights, and biases into an output. Activation functions are key to introducing non-linearity into the model.
  6. The connected nature of the node output (i.e. output of activation functions) means that we can think of the neural network as a nested function, with parameters being the weights and biases.
  7. A cost function is something that measures how close the model predictions are to the truth. There are various cost functions in use and in most cases, the lower the cost the better the model.
  8. We identify the contribution of each weight and bias to the model cost using calculus and calculating derivatives. This is the essence of backpropagation.
  9. We can combine the cost function, backpropagation, and specialised optimisation algorithms to train the model.
  10. This “learning” procedure is effectively an iterative process to identify the set of weights and biases which minimise the cost function (subject to certain constraints).

Now, we’re going to combine what we know — and probably learn a little more along the way — to build a neural network of our own. Before we get into the technical bits, let’s talk about the framework I chose to use: PyTorch.

PyTorch

PyTorch² is a framework for building machine learning models in Python.

As it’s a framework, it provides the tools and building blocks that allow the creation of advanced machine learning models like neural networks. Whilst a lot of the functionality is pre-built, the user still needs to assemble pieces together to form a model structure and training procedure.

The three PyTorch modules

PyTorch functionality largely falls within three modules.

autograd provides the functionality for (automatic) calculation of derivatives and gradients. Useful if, like me, you haven’t had to stretch your calculus muscles in a while.

Most of the functionality needed to build neural networks can be found in nn, like Module where we define the structure of the network and how the forward pass works.

optim contains pre-built optimisation algorithms — like the popular ADAM — which facilitate model training.

PyTorch, as a newbie

PyTorch uses its own data structure called the “tensor”; these multi-dimensional structures have numpy-esque logic and syntax which makes PyTorch feel familiar.

As most of the functionality I’ve used has been pre-built, it’s made my experience as a newcomer much more accessible. I’ve rarely had to get too deep into the underlying mathematics to get my model up and training.

The GPU acceleration was quite straightforward. I imagine that this is largely due to using an Nvidia graphics card, but set up was quick and simple, and I was training on the GPU before I knew it.

One of the most useful aspects of PyTorch is the collection of detailed tutorials available on its website and the rich community asking (and answering!) questions posted on the forums.

PyTorch vs TensorFlow³ & Keras

Although my choice to use PyTorch wasn’t the result of an extensive research exercise, it was in part motivated by recurring forum posts which suggested that:

  1. The move from TensorFlow v1 to TensorFlow v2 significantly changed the syntax.
  2. The majority of tutorials and resources available were produced using TensorFlow v1.
  3. It might not be clear to new users which version of TensorFlow the tutorial refers to.

And while that might not be the case anymore, I embraced PyTorch (which to my knowledge has no such issues).

That being said, based on a quick look through some of the resources on the TensorFlow website there might not be too much syntactical difference between the two. Who knows, maybe one day I’ll take a proper look.

Enough fanboying, time for something tangible.

The data

We’ll be using a great data set provided by Christophe Dutang and uploaded to OpenML⁴. It’s available for use under the CC BY 4.0 licence⁵ and as we’ll soon see, we’ll be making some changes to the underlying information. Of course, we’re using this data at our own risk, outside of a commercial context.

Exploratory data analysis

The data is a collection of risk details for ~ 680,000 French motor insurance policies, along with associated third-party claims.

After a quick renaming of the columns, we have the following data set:

Image by author

Whilst I’m not 100% certain of the exact definitions, a guide⁶ detailing what looks to be another version of the data suggests that:

  • claim_nb is the number of claims during the exposure period (i.e. policy lifetime)
  • exposure is the time that a policy is exposed to risk, expressed in years. This is just a fancy actuarial way of expressing the proportion of a year for which the insurance company provides cover to a policyholder.
  • area is a density value of the city community where the car driver lives in. It ranges from “A” for rural area to “F” for urban centre.
  • region is French policy region. Looks like it’s some form of common or widely used classification.
  • density is a population density estimate — the number of inhabitants per km² of the city where the policyholder lives.
  • driv_age is the age of the driver (in years).
  • bonus_malus is the bonus-malus rating for the policyholder. Apparently in France, values < 100 = bonus & > 100 = malus. Bonus malus is a Latin term which translates to “good bad”. It’s essentially a system which rewards a customer for good claims behaviour and penalises them for bad claims behaviour — i.e. if you have fewer insurance claims, you can expect to pay less for your insurance cover. In the UK, this is also called a “no claims discount”.
  • veh_power is the power of the car, grouped into ordered categories.
  • veh_age is the age of vehicle, in years.
  • veh_brand is the vehicle brand, grouped into various subgroups.
  • veh_gas is the vehicle fuel type. In this case, we have no electric or hybrid vehicles of any sort.

Although quit trimmed down, this data set is fairly representative of the data sets we use to build motor insurance pricing models in the UK. Generally, we collect and use features which fall into one of four categories:

  1. Features which relate to the policy — this could be the level of cover, any promotions or discounts in force when the policy was sold, etc.
  2. Features which relate to the driver(s) of the vehicle — this could be the age of the driver, prior claim history, occupation of the main driver. Ideally, the same information would be collected for all of the drivers listed on the policy, as well as the relationship between the main driver and any other listed drivers.
  3. Features which relate to the area(s) in which the vehicle is likely to be driven — much like area and density above, these are features which relate to the physical environment. Geographic and demographic features are quite often highly predictive of insurance risk.
  4. Features which relate to the vehicle(s) being insured — think vehicle make, model and specification (and associated details) as well as other features like the vehicle age and how long the policyholder has owned the vehicle.

Anyway, enough of that — let’s get back to business.

Looking at one-way views often helps us to get a feel for trends. We’ll calculate claims frequency — the number of claims per unit exposure — viewed in aggregate.

Firstly area, a good example of a monotonic trend:

Image by author

Here we have:

  • Yellow bars indicating the proportion of total exposure in each category of area
  • A pink line indicating the claim frequency for each category of area , expressed as a percentage.

Thinking back to the data definitions, we know that this is effectively an ordered categorical feature with “A” representing rural areas and the levels representing increasingly urban areas (“F” represents city centre).

So we see that frequency increases with area — this makes sense as we would expect more accidents in busier areas where there is more traffic and busier roads. We would expect for the opposite to hold true for rural areas.

Aside: it’s quite common to see a higher claim frequency in more urbanised areas. However, this doesn’t mean that living in a rural area would get you cheaper insurance, as the severity of accidents in rural areas is quite often higher than those in city centres… Think of the classic minor fender benders in traffic jams versus crashing into a ditch at speed.

Let’s have a look at something that isn’t so monotonic — vehicle brand:

Image by author

Here it looks like on a one-way basis, claim frequency by vehicle brand is fairly constant at ~ 9.5% for most brands, with the exception of brands B11 and B12 which show elevated levels of frequency. B14 on the other hand, shows significantly lower claim frequency.

These are quite important observations, and we’d expect our model to pick up these trends, especially for the brands in which we see the most volume (B1, B2, and B12).

Perhaps this is also a good time to remind ourselves that while the labelling of the feature might imply some sort of ordering, this is not necessarily the case and so expecting some form of monotonic trend from it might be overreaching.

Let’s move on to preparing our data for modelling.

Data preparation

We’re working with data of various types — a mix of numeric and categorical. Since we can only feed numerical data into our neural network, we’ll need to do some form of conversion (as well as some pre-processing). We’ll also need to split the data into subsets to help us with training and validation.

To avoid any potential target leakage, we should split the data prior to calculating required transformations on the training set.

Data splitting

Quite a simple one here — we’ll use scikit-learn’s train_test_split functionality. We will do it sequentially though, so we end up with our sets as desired.

from sklearn.model_selection import train_test_split

# split out the validation set
train_x, val_x, train_y, val_y,train_evy,val_evy = train_test_split(
df.drop(labels = ['claim_nb','exposure'],axis = 1),
df['claim_nb'],
df['exposure'],
test_size = 0.15,
random_state = 0,
shuffle = True
)

# now split train and test
train_x, test_x, train_y, test_y,train_evy,test_evy = train_test_split(
train_x,
train_y,
train_evy,
test_size = 0.2,
random_state = 0,
shuffle = True
)

It’s always a good idea to set the random seed. Notice how we increase the test_size in the second call of train_test_split . This is to ensure that our test set is slightly larger than our validation set.

This results in the following data sets:

Image be author

So far, so good.

Grouping and transforming the features

It’s quite common to apply some form of grouping to input variables.

This does a few things, the most significant being that it reduces the granularity of the input data. Reduced granularity can reduce the potential predictive power of the model, as the model essentially has less information available to it.

However, reducing granularity can improve stability if the grouping is applied over low-volume-high-variability segments. So — and like most things — we need to arrive at an appropriate balance between granularity and stability.

Grouping features prior to one hot encoding can reduce the size of the resulting (encoded) data set. Hint hint.

We’ll (largely) follow the lead of this paper⁹ and apply the following bandings and transformations to the features:

  • area remains unchanged as a categorical feature, as does region.
  • density has a log transformation applied to it and remains a numeric feature.
  • driv_age is banded into the following groups and converted to a string: (0,18], (18,21], (21,25], (25,35], (35,45], (45,55], (55,70], (70, ].
  • bonus_malus remains unchanged.
  • veh_power is converted to a string.
  • veh_age is banded into the following groups and converted to a string: (-1,0], (0,1], (1,4], (4,10], (10,].
  • veh_brand is remains unchanged, as does veh_gas.

Since we’re going to be applying the same transformations three times, we’ll make our lives easier (and reduce the likelihood of manual errors) by whipping up a helper function. This looks something like:

import pandas as pd
import numpy as np

# mappers
mapper_vehpower = {j:str(int(j)) for j in df['veh_power'].unique()}
mapper_vehbrand = {j:j for j in df['veh_brand'].unique()}
mapper_vehgas = {j:j for j in df['veh_gas'].unique()}
mapper_area = {j:j for j in df['area'].unique()}
mapper_region = {j:j for j in df['region'].unique()}

# bin edges for bandings
bins_vehage = [-1,0,1,4,10,np.inf]
bins_drivage = [0,18,21,25,35,45,55,70,np.inf]

# helper function
def transform_data(data):

'''
Function to apply transformations to input data.
'''

d = data.copy()

# location features
d['area'] = d['area'].map(mapper_area).astype(str)
d['region'] = d['region'].map(mapper_region).astype(str)
d['density'] = np.log(d['density'])

# vehicle features
d['veh_power'] = d['veh_power'].map(mapper_vehpower).astype(str)
d['veh_age'] = pd.cut(d['veh_age'],bins = bins_vehage).astype(str)
d['veh_brand'] = d['veh_brand'].map(mapper_vehbrand).astype(str)
d['veh_gas'] = d['veh_gas'].map(mapper_vehgas).astype(str)

# driver features
d['driv_age'] = pd.cut(d['driv_age'],bins = bins_drivage).astype(str)

return d

# apply groupings and transformations
train_x = transform_data(train_x)

Even though some things don’t need encoding (like veh_gas) it can be useful to explicitly code it. One thing’s for sure — it really helps the memory!

Eagle-eyed readers will notice that I’ve cheated somewhat by getting all the possible unique values that a feature may take from the combined DataFrame df , rather than the training set (like I was preaching about earlier).

I should have done it properly and dealt with any instances where there are values in the testing or validation sets which are not present in the training set. That would be much more realistic and robust.

But if we’re talking about realism, then we’re not going to be building neural networks in highly regulated and competitive environments like motor insurance if we haven’t got robust control over the data pipelines feeding the models. That means that in a real-world setting, we would know exactly all of the potential values that our features could take, and we could build appropriate transformers to handle the values.

Now, for one last thing on grouping features. Let’s take a look at (grouped) vehicle age:

Image by author

The chart above shows claim frequency by (grouped) vehicle age. We can see that claim frequency is elevated for brand new cars — i.e. whereveh_age == 0.

Now if we were to step back and think of potential groupings without looking at the data, we might have been tempted to group brand new cars with fairly new cars (e.g. group 0-year-old cars with 1-year-old cars). If we did that, we’d be moderating the massive differential in claim frequency which in turn would negatively impact the model’s predictive power.

So long story short — look before you group!

One hot encoding

I hinted at it above, but I’m going to one hot encode the (grouped) categorical features.

I won’t bang on about this as it’s quite a common approach and there are tons of tutorials available for it. In my mind there are two good packages to try out — scikit-learn and Category Encoders¹⁰. I slightly prefer Category Encoders as I’ve generally found it to be more forgiving than scikit-learn’s implementation.

The model

Now that we’ve prepared our data, we’re just about ready to start modelling.

I’ve structured my modelling workflow in two distinct stages, first defining the model class and then defining the training loop.

The model class

Remember how I mentioned that PyTorch provides the tools to build models rather than models themselves? Here’s a great example of that!

Our model is a Python class inheriting from PyTorch’s nn.Module. We have to define the structure of the network and its initialisation, as well as how information passes through the network (the “forward pass").

Here’s how I set my model up:

class NeuralNetwork(torch.nn.Module):
def __init__(self):
super(NeuralNetwork, self).__init__()

# linear
self.F_hidden_one = torch.nn.Linear(75 + 1, 250)
self.F_hidden_two = torch.nn.Linear(250, 250)
self.F_output = torch.nn.Linear(250, 1)

# add in drop out
self.dropout_one = torch.nn.Dropout(p = 0.25)
self.dropout_two = torch.nn.Dropout(p = 0.25)

# initialise weights
# He initialisation
torch.nn.init.kaiming_uniform_(self.F_hidden_one.weight)
torch.nn.init.kaiming_uniform_(self.F_hidden_two.weight)
torch.nn.init.kaiming_uniform_(self.F_output.weight)

# initialise the final bias
torch.nn.init.constant_(self.F_output.bias,y_hat)

def forward(self, x):
# ELU activations
elu = torch.nn.ELU(alpha = 1)

# calculate F
F = self.dropout_one(x)
F = self.F_hidden_one(F)
F = elu(F)
F = self.dropout_two(F)
F = self.F_hidden_two(F)
F = elu(F)
F = self.F_output(F)
F = torch.exp(F)

return F

Aside: there is an equivalent way of defining the network using nn.Sequential . I found this way much clearer and easier to understand than Sequential , and would recommend that you use whichever form you prefer.

In a nutshell:

  • It’s a feed-forward neural network with 2 hidden layers.
  • There are 75 + 1 input nodes — one for each column, plus an additional column (a little more about this later).
  • There are 250 nodes in each of the hidden layers.
  • Dropout with probability 0.25 applied to each of the hidden layers (a little more about this later).
  • Kaiming (He) initialisation for the weights.
  • Output layer bias initialised with y_hat , the average claim frequency, adjusted to reflect the form of the network.
  • Exponential Linear Unit (ELU) activation used on the hidden layers.
  • Exponentiation on the final layer.

The initialisation of the output layer bias is an interesting one, which I came across almost by chance while troubleshooting something else entirely.

The idea behind that is fairly simple — to improve training time, we give the network a “general idea” of what the target output might look like, remembering to transform into the appropriate space. In this case, the “general idea” is the average claim frequency, and we’ve translated it (elsewhere) into log space — that is, y_hat = log(mean claim frequency). If you’re familiar with gradient-boosted tree models, the idea is not dissimilar to using an initial or base score.

Also worth mentioning is the exponentiation of the forward result prior to outputting a final prediction. This formulation looks very much like the classic generalised linear model with a log link, but I’ve imposed the transformation to ensure the network output is positive. Perhaps alternatives similar to ReLU might also achieve this but during experimentation, I found that taking the exponent produced better results.

The training loop

Time for us to write our training loop — model’s got to learn right!

To get our model trained, we’ll be combining several important concepts in neural networks.

After we’ve initialised our model, we’ll make a forward pass of the training data to get initial model predictions.

We’ll then calculate the loss against the training target and propagate it back through the network (backpropagation). The chosen optimiser will update the weights and biases and conclude a single iteration of the training loop.

Code-wise, this looks something like this:

# new model
model = NeuralNetwork().to(device)

# optimiser
optimiser = torch.optim.Adam(model.parameters(),lr = 0.001)

# training loop
for epoch in range(1,n_epoch+1):

# training mode
model.train()

# prediction error
F = model(train_X)

# back-prop & weight update
train_loss = poisson_loss(train_y,train_evy.float() * F)
train_loss.backward()
optimiser.zero_grad()
optimiser.step()

Some details:

  • We create a new instance of the model with model = NeuralNetwork() . We send the model off to the GPU using the to(device) (device is a variable containing the name of the GPU if available).
  • We set up an ADAM optimiser with a learning rate of 0.001. Into this, we feed the weights and biases of the original untrained model (model.parameters() ).
  • model.train() activates the model’s “learning” mode.
  • F is the result of a forward pass through the model using training data.
  • The training loss is calculated using a custom function poisson_loss , after reflecting the time that the policy has been in force in the prediction (i.e. scaling by the EVY).
  • The backward() method does the backpropagation.
  • The gradients from any previous loop are cleared from the optimiser using optimiser.zero_grad() , and the updating of the weights and biases for this round of training is done using optimiser.step().

We keep stepping through the training loop while loss is decreasing; as we’re focussing on training loss, we can open the door to overfitting and poor generalisation. We need to take some steps to avoid this.

Overfitting and early stopping

One tool at our disposal is early stopping, an approach which aims to detect overfitting and prevent a model from going too far down that path.

The idea behind early stopping is quite simple. The model learns on one set of data and is tested on “unseen" data. In our case, these are our training and testing sets respectively.

As we repeat the training rounds, we would expect the loss to decrease on both training and testing sets. This represents a model which is becoming more predictive but is still generalisable.

However it’s quite likely that if we train for long enough, our model will start to pick up noise in the training set rather than true signal. We’ll see this loss in generalisation materialise as a deterioration in the loss on the unseen data. The trick is to train the model until we achieve an optimal loss on the unseen data.

But what about “breakthroughs”? I suppose formally, this would be the optimiser escaping local minima; in plain words we can think of the optimiser as learning what doesn’t work and finding out what does.

This alludes to some sort of probing and highlights another requirement of early stopping: the need to allow the optimiser to try out different things. So, we need to introduce some form of tolerance into the mix — to allow the optimiser to sniff out the route without going too far down a rabbit hole.

Now, what if the test loss doesn’t improve and the tolerance is exceeded? Well, we revert to the “best" version of the model. To do this, we need to keep a record of what the “best" version of the model is.

We can whip something up and build it into the training loop:

# initial set up
last_loss = 2_000_000
best_loss = last_loss
stopping_rounds = 200
counter = 0
model_state = model.state_dict()
minimum_improvement = 0.001


# training loop
for epoch in range(1,n_epoch+1):

# training mode
model.train()

# prediction error
F = model(train_X)

# back-prop & weight update
train_loss = poisson_loss(train_y,train_evy.float() * F)
train_loss.backward()
optimiser.zero_grad()
optimiser.step()


# test accuracy
model.eval()
with torch.no_grad():
F = model(test_X)

test_loss = poisson_loss(test_y,test_evy.float() * F).item()

# early stopping
current_loss = test_loss

if current_loss >= best_loss:
# loss does not improve - increment the counter
counter += 1

# if counter exceeds permissible rounds then break the loop
if counter >= stopping_rounds:
print(f'Early stopping after {epoch:,.0f} epochs')
print(f'Best loss: {best_loss:,.4f}')
print(f'Loss this epoch: {current_loss:,.4f}')
print(f'Min. improvement required: {minimum_improvement:.2%}')
break

elif (
current_loss < best_loss
) * (current_loss > best_loss * (1 - minimum_improvement)):
# loss improves but not my minimum required, increment the counter
counter += 1

# update best loss and model state
best_loss = current_loss
model_state = model.state_dict()

# if counter exceeds permissible rounds then break the loop
if counter >= stopping_rounds:
print(f'Early stopping after {epoch:,.0f} epochs')
print(f'Best loss: {best_loss:,.4f}')
print(f'Loss this epoch: {current_loss:,.4f}')
print(f'Min. improvement required: {minimum_improvement:.2%}')
break

else:
# loss improves more than minimum required
# reset the counter, update state, update best loss
counter = 0
model_state = model.state_dict()
best_loss = current_loss

Code formatted to improve readability.

What my little hack does is quite simple.

Initial set up defines the “last loss” and “best loss”, as well as the amount of tolerance allowed stopping_rounds and the minimum improvement in test loss required.

We set up three possible cases: (1) if test loss doesn’t improve, (2) if test loss improves but not by the minimum improvement required, and (3) if test loss improves by more than the minimum required.

If the test loss doesn’t improve, we add another strike to the score — i.e. we increment counter, which is a tally of a number of rounds for which the test loss hasn’t improved or hasn’t improved by the minimum requirement. If the tally exceeds the given tolerance, the loop is broken and we stop training — i.e. we stop early.

In the second case, the test loss improves but not by the minimum required. In this instance, we update our record of best loss and update the model state dictionary — we want to acknowledge that this is still an improvement. However, we also increment the counter and check if it exceeds the tolerance.

In cases where the test loss improves by more than the minimum, we update the best loss, model state, and reset the counter in anticipation of further gains!

Overfitting and dropout

After a time, neural networks can start to fit to noise in training data sets, reducing the network’s ability to generalise to new and unseen input. This is obviously undesirable in our case.

This from Jason Brownlee at Machine Learning Mastery¹¹:

One approach to reduce overfitting is to fit all possible different neural networks on the same dataset and to average the predictions from each model. This is not feasible in practice, and can be approximated using a small collection of different models, called an ensemble.

Now I would imagine that to effectively ensemble networks, we would need a significant number of networks to begin with. In our example this might not be completely unrealistic, but the approach would quickly become unfeasible in larger or more complex use cases.

Enter dropout — a method which allows us to approximate the training and ensembling of many different neural networks.

Again, Jason Brownlee¹¹:

During training, some number of layer outputs are randomly ignored or “dropped out.” This has the effect of making the layer look-like and be treated-like a layer with a different number of nodes and connectivity to the prior layer. In effect, each update to a layer during training is performed with a different “view” of the configured layer…
Dropout has the effect of making the training process noisy, forcing nodes within a layer to probabilistically take on more or less responsibility for the inputs…
This conceptualization suggests that perhaps dropout breaks-up situations where network layers co-adapt to correct mistakes from prior layers, in turn making the model more robust.

So, if I rephrase and build:

  • Each time we run through a training step we randomly “turn off” some of the nodes.
  • By “turn off”, we effectively mean overwrite the inputs and outputs of the node with zero.
  • This makes it more likely that the remaining nodes pick up signal rather than correcting mistakes from previous layers.
  • Dropout is implemented per-layer and can be introduced to any layer other than the output layer.
  • The probability of any one node being zeroed — the “dropout rate” — is an additional hyperparameter set by the user.
  • Dropout is inactive during the prediction phase.

If you, like me, like to draw parallels to tree-based models, I think of dropout as being similar to column sampling, although some models have started to implement dropout into their algorithms¹³.

Aside: there’s a great animation by Ayush Thakur on Weights & Biases¹² which demonstrates dropout visually.

Training on the GPU

Just a quick one on this, as I’d rather be talking about results!

I’m currently working on my personal laptop. While it’s more of “the little laptop that could” rather than a dedicated deep-learning rig, it does have a dedicated Nvidia GPU card which I’ve been taking full —think RuntimeError: CUDA error: out of memory— advantage of.

Nevertheless, I’ve been training the model on my GPU, passing the entire training set through the model in a single batch. While this means that I can take advantage of GPU acceleration (and not worry about batch training), it does set some limitations.

One limitation is model size. Using two hidden layers with 250 nodes in each is at the very limit of what my GPU can handle, and I very quickly ran into memory issues when trying to increase the node count (recommended when using dropout). More about this later.

Overall, my experience with using the GPU has been generally smooth. The initial installation and setup was quick, and once PyTorch has detected any available GPU (device = cuda if torch.cuda.is_available() else 'cpu'), moving objects to and from the GPU is easy using the .to(device) methods.

It’s slightly trickier keeping track of where everything is, but you get used to it fairly quickly.

Now, on to the exciting stuff — results!

The results

It’s finally time to see what’s going on under the model’s hood!

There are a few ways to do this. We’ll start with some of the simpler visualisations and move our way through to some more sophisticated explainers.

One-way actual-vs-expected

Like their name suggests, these are visualisations which compare the actual target against the expected (or “modelled”) outcome across the levels of a single feature at a time (“one-way”). They are also sometimes referred to as “AvE” charts.

Aside: there’s a valid reason you’re getting deja vu — these effectively build on the charts we looked at earlier.

Here’s an example for area:

Image by author

A quick refresher:

  • Along the x-axis we have the levels of area
  • Yellow bars (left y-axis) show the proportion of total exposure which falls into each level of area
  • The pink and green lines (right y-axis) represent actual claim frequency and predicted claim frequency respectively

In this example, we see that our model somewhat overshoots the actuals for areas A, B, and C and undershoots the actuals for areas E and F.

If we flick through these AvE charts for some other features, we see some interesting things — here’s the model capturing non-linear trends in region quite well for large parts of the exposure:

Image by author

Likewise with driv_age :

Image by author

… although the grouping applied to driver age can clearly be seen in the average predicted values (i.e. the steps evident in the dark green line).

Overall, model fit on a one-way basis doesn’t look too bad.

Let’s move on to investigating interactions.

Multi-way plots

A neat way of visualising feature interactions is to create multi-way plots: plotting predicted claim frequency by one, two, or even three features. This gives us a way of understanding how — on average — different features interact with one another.

For example, here is a multi-way plot for the driver age and area interaction:

Image by author

The chart is simple to read — along the x-axis we have the various levels of driver age, and this is plotted against predicted claim frequency on the y-axis. Each coloured line represents a level of area.

Here we see the general trend we’ve seen in driver age before — high frequencies for lower age groups, decreasing as we move up through the age range.

We also see the familiar area trend: that area A is generally less risky than area B, area B is generally less risky than area C, and so on.

What is interesting is the relationship between the levels in these features.

Let’s first focus on areas B and C:

  • Our model thinks that drivers ≤ 22 years old, living in area C are less risky than drivers of the equivalent age who live in area B.
  • This is evident in the green line being lower than the orange line for driver ages ≤ 22.
  • However, this effect seems to reverse for drivers aged 23 and older, where the model believes that drivers living in area C are riskier than drivers of equivalent age who live in area B.
  • We see this as the green line is consistently higher than the orange line for driver ages 23+.

There’s also an interesting effect in areas D and E, where we see a very similar claim frequency prediction until driver age 21, after which there is clear and consistent discrepancy.

Sometimes rebasing the multi-way charts can make spotting interactions easier; here’s one I’ve rebased to show the ratio of predicted frequency to the group minimum:

Image by author

As much as I like these multi-way visualisations, they aren’t a silver bullet solution.

For one, they rely one the modeller knowing which interactions to interrogate. This might be fine if you’re working on a familiar task or have good domain knowledge of the problem, but it can become problematic if you’re unsure where to start looking. In this instance, using some sort of feature selection approach to narrowing down “important” interactions would be useful (take a look at Friedman’s H statistic²²).

While providing more insight than the one-way views, multi-way charts do not provide the full story. For instance, we cannot say for sure that the effects described above are due entirely to the interaction of driver age and area — there may be other policyholder characteristics responsible for the trend.

Multi-way charts quickly become unwieldy and difficult to interpret when working with high-cardinality features, or when adding additional “depth” (i.e. more features) to the multi-way split. Not the end of the world, but something to bear in mind!

Let’s move onto using something a little more widespread in the data science community — SHAP plots.

SHAP

We can use the SHAP package to investigate our fancy new model, using either the more general kernel explainer or using an implementation of DeepExplainer.

Set up is quite simple — first create the explainer, and then get Shaply values:

# create the explainer
explainer = shap.DeepExplainer(
model,
torch.tensor(train_X.values).float()
)

# get the shapley values
shap_values = explainer.shap_values(
torch.tensor(train_X.values).float()
)

We can then get out feature importance:

Image by author

The feature encoding makes it a bit difficult to understand how a feature as a whole affects predictions, but we can see some granular effects.

For instance, it looks like driving an older car increases the expected likelihood of having a third party motor insurance claim. Perhaps — more expectedly — higher levels of bonus malus drive higher claim frequency predictions.

SHAP dependence plots allow us to see the interaction between two features. Even better, the user can specify the feature they wish to visualise and the SHAP package can suggest an “interesting” interaction for that feature. Neat!

Here’s an example of a dependence plot for the Bonus Malus x Density interaction. Again, not the most useful but we can see how Bonus Malus affects predictions.

Image by author

Force plots allow us to see how the characteristics of each policyholder drive its prediction.

Here’s a chart for an observation chosen at random:

Image by author

… and we can see that:

  • The “base value" or starting point for every prediction is 0.1137.
  • Blue bars indicate characteristics which drive predictions down. Here, we see that driving a newer car drives down this prediction.
  • Red bars represent characteristics which drive the prediction up. For instance, being a driver aged 18–21 increases your frequency prediction.
  • The final prediction — i.e. f(x) — is 0.08, and is the result of starting from the base position and applying the upward and downward movements.

These charts are great for checking individual predictions; I’d highly recommend using them to check very large or very small predictions.

Other approaches

SHAP is great, but it’s not the only way to get at measures of feature importance.

Permutation importance²⁰ comes at the problem a slightly different way, measuring importance via the reduction in prediction accuracy when a feature is randomly shuffled. In essence, the greater the reduction in accuracy caused by shuffling, the more “important” the feature is to the model.

I’d also like to check out Captum²⁰, a model interpretability package built on PyTorch.

And finally, we could take a swing at feature importance from first principles and derive some form of weighted contribution. This is based on the rationale that network weights indicate the strength of the relationship between nodes; focussing on weights connecting the input layer to the first hidden layer could allow the derivation of a “weight-based” feature importance.

Things that didn’t go too well

Unsurprisingly, things didn’t exactly go according to plan.

In fact, many things didn’t go according to plan. Here’s just a short list of the most memorable “experiments”.

Aside: I should emphasise that these are problems that I faced — there’s quite a good chance that you might not have the same experience. And of course, just because I couldn’t get something to work doesn’t mean that it won’t work!

The zero-inflated Poisson model

It’s common to model count data using a Poisson process. It might at first glance seem like a suitable approach to modelling insurance claim numbers (as they’re just counts of how many claims a policyholder is likely to incur), but it’s slightly more complex than that:

  • the vast majority of policyholders do not have a claim at all (we see < 10% of training data has an associated claim)
  • policyholders can have more than one claim

So in a way, this becomes a “nested” modelling problem where we first need to understand which policyholders are likely to have a claim at all, and then predict how many claims they are likely to have (given that they have a claim).

Enter the zero-inflated Poisson (ZIP) model — a mixture model combining two different processes:

  1. A process which generates zeros.
  2. A process which generates Poisson counts (some of which might be zero).

More mathematically now:

Image by author

In this set up, the neural network is designed to output two quantities — π and λ — and is trained such that the negative log-likelihood loss function is minimised. This is in fact the “F” part of the NeurFS model outlined in (9).

I tried various experiments using this approach but always found that my predictions were off by a consistent factor, and whilst tweaks to the network seemed to move predictions up and down, the predictions never quite reached a sensible level.

I also found it difficult to understand what a “good” negative log-likelihood score was. While it’s quite straightforward to understand what a good (or even perfect) score is for some loss functions (think MAE or RMSE), that’s definitely not the case with the negative log-likelihood function.

This is quite an interesting approach to modelling claims frequency, which I might return to in the future, and hope to get working!

If you’re interested in reading more about zero-inflated Poisson models, I would recommend taking a look at: (14), (150, (16).

Unstable models

… or “Aaaargh, why do things keep changing?!”.

Separate model runs produced quite significantly different outcomes in both predictions and fitted parameters.

I tried a variety of things, including setting the random seed and using different initialisations of the weights and biases, but these changes didn’t seem to help — the network was still unstable.

I then came across a discussion which suggested that larger networks are more stable than smaller networks. The underlying logic being (rephrased my me) that:

  • Smaller nets need to cram information into a smaller package…
  • Effectively meaning that node connections contain more information…
  • So the “de-activation” of certain nodes — through either dropout or initialisation — has a big impact as more “signal” gets turned off…
  • Which can cause instability in the network.

Moving from a smaller network (100 nodes in each hidden layer) to a larger network (250 nodes in each hidden layer) seemed to solve the issue.

Large models and batch training

Now that I had my fairly large — and stable — network up and running, I was keen to extract as much performance from it as possible.

I was power-hungry, wanting to expand the model outwards in all directions — more hidden layers, more nodes, more everything!

Only, my GPU said no. Very emphatically, in fact.

Which meant that I would need to use batch training if I was to achieve my grand designs.

The concept of batch training is quite simple: repeatedly train the model on subsets of the data until it has seen the training data in its entirety (i.e. one “epoch”), and repeat this for many training rounds.

There’s no such thing as a free lunch, as batch training comes with its own nuances.

One nuance is more instability in the network; the network only ever sees one subset of data at a time, and so any large variation in the target (and therefore loss) in the batch can cause disruption in the parameter updating.

In our case, the vast majority of claim counts were zero, but we did have some instances were a policyholder had 11 claims. This would create a significant training loss and might cause quite a shift when weights and biases get updated.

This phenomenon might be mitigated to an extent by using larger — potentially stratified — batches, as each batch would in theory be more representative of the target as a whole. But how large is “large”? The batch size is another tunable parameter which impacts not only training time but the way that the model learns (and therefore how well it predicts). There seems to be some evidence that using smaller batches produces better outcomes, but this is probably another factor that needs to be determined by the modeller contextually.

Batch normalisation¹⁷ is an approach often used to address instability in batch training. This technique essentially learns transformations to apply to data flowing through the network, ensuring that all data inside the network remains in a manageable range.

It’s recommend that batch normalisation and dropout not be used together as they reduce and introduce noise respectively — like driving with one foot on the brake and the other on the accelerator.

All of this is fairly technical stuff, and we haven’t yet come to the more pragmatic element of batch training: training time. When we were training using the entire data set, backpropagation and parameter update took place after the data set flowed through the model — essentially, we do number of rounds updates. Batch training requires updates after each batch has passed through the model, so we effectively do number of batchsx number of rounds updates; like it or not, this is computationally slower.

I wish I could report back on how much batch training improved my model’s performance. While I did get the batch training process up and running, I did also r̵u̵n̵ ̵o̵u̵t̵ ̵o̵f̵ ̵p̵a̵t̵i̵e̵n̵c̵e̵ re-prioritise and decided to revert to the smaller and more GPU-friendly model.

(Almost) beaten by a tree

I’m fairly familiar with tree-based models and so wanted to compare and contrast the performance of my neural network with that of a gradient-boosted tree model (LightGBM specifically).

I built a very simple GBM model on the same training data and ran a few tests against the neural network. The model performance turned out to be similar. Very similar in fact.

I was curious to see if this was a known and widespread occurrence, or if it was just a “me” problem. As it turns out, it doesn’t seem to be something specific to this use case!

Borisov et al¹⁸ note:

Heterogeneous tabular data are the most commonly used form of data and are essential for numerous critical and computationally demanding applications. On homogeneous data sets, deep neural networks have repeatedly shown excellent performance and have therefore been widely adopted. However, their adaptation to tabular data for inference or data generation tasks remains highly challenging.

But what about evidence? Is there empirical evidence to suggest that neural networks don’t easily outperform GBM on mixed-type tabular data?

There is, in fact¹⁹: out of 42 experiments on a mix of tasks on different data sets, LightGBM outperformed a neural network 35:7. There’s some nuance to this, but it’s still quite a trouncing.

Let’s leave it there and wrap up — you’re probably sick of me by now.

A wrap up, with just a bit more rambling

Wrap up

We’ve covered a lot of ground in this one. Let’s summarise and reflect.

We refreshed our knowledge of neural networks and had a brief look into PyTorch, the framework used to build the model.

We spent a bit of time talking about the data and the engineering applied to it. I probably spent more time talking about it than I had intended, but hopefully my ramblings gave the uninitiated some real-world insurance insight.

I went through the model structure, training loop, addressing overfitting through dropout, and addressing overfitting through early stopping. I also spent a little time on how training on the GPU worked for me.

We spent some time going through the results. We saw how simple visualisations like AvE charts can give simple insights into model fit. We also saw how the SHAP package can be used to give greater insight into the “black box” that is the neural network.

I shared some of the (many) things that didn’t quite work out for me.

We saw how the theory of the zero-inflated Poisson sounded like it had potential, but my implementation of it didn’t yield great results.

Expanding the size of the network overcame some stability issues, but increasing it too much required a move to batch training; batch training comes with its own set of complexities.

Most disappointingly, we also saw how my network was almost outperformed by a LightGBM model.

Reflections and things for next time

Building neural networks is difficult, and I’m not ashamed to admit that it took me some time to wrap my head around the complexity of these models (and I’ve only scratched the surface). That being said, there are some great resources available. I did find however, that most resources are pitched at either the beginner or advanced practitioner, and so there seems to be a bit of a gap in between.

My brief experience has suggested that building a good neural network is like picking a lock — for the model to work, everything has to be compatible and everything has to line up just right. Not that I know much about picking a lock.

I would do a fair few things differently next time. One of the major changes I would make is to stitch together separate pieces of code into a single pipeline, or pieces which can be easily assembled into running as a single pipeline. This would allow for more experimentation, and allow for experimentation in other parts of the pipeline — e.g. in the data preparation, where I would quite like to try different feature encodings.

I mentioned earlier that I would explain why I had 75 + 1 input layer nodes, and I’ve teased for long enough. This is simply an idea borrowed from regression: the design matrix contains a column of ones which is used to estimate the intercept. I found that including this improved model performance, even though I fed in suggestions about the “intercept” through the output layer bias term. This is something else to explore next time.

Speaking of more traditional regressions, I should probably expand on why I didn’t use exposure as a covariate, even though it’s a strong predictor of claims frequency. Most Poisson regressions use it, either by including the log(exposure) as an “offset” term, or by modelling the ratio of counts to exposure (i.e. count / exposure ). The answer is that I tried different ways of using it, and they didn’t really affect the model performance very much. It is still present in the training loop — I scale predictions by exposure prior to calculating training loss — but it’s not fed directly into the model.

Another thing for next time — the dropout rate. I used a rate of 0.25 — i.e. the probability of a node being “zeroed” is 25%. This is significantly lower than what I’ve seen in other articles and forums. I feel like this choice was fine given that I was also using an implementation of early stopping to avoid over-fitting, but would consider raising the dropout rate if early stopping wasn’t an option.

I was quite surprised to find out about the (fairly lacklustre) performance of the network on mixed tabular data. There might be some good news on the horizon though, so I’ll definitely be keeping an eye on NODE²³ and TabNet²⁴.

And that’s it from me. I hope this has been useful. Please feel free to provide feedback in the comments — as I’m new to all of this, I’d greatly appreciate any tips and pointers that you might have.

As it’s that time of year, I’d like to wish you and your loved ones a great festive season. Consider this my shiny Christmas present to you. Or lump of coal.

References, credits and licences

  1. Let’s Learn: Neural Nets #1. A step-by-step chronicle of me learning… | by Bradley Stephen Shaw | Medium
  2. PyTorch
  3. TensorFlow
  4. OpenML
  5. Creative Commons — Attribution 4.0 International — CC BY 4.0
  6. CASdatasets-manual.pdf (uqam.ca)
  7. Target-encoding Categorical Variables | by Vinícius Trevisan | Towards Data Science
  8. What is Target Leakage in Machine Learning and How Do I Avoid It? — DataRobot AI Cloud
  9. (PDF) A Neural Frequency-Severity Model and Its Application to Insurance Claims (researchgate.net)
  10. Category Encoders — Category Encoders 2.5.1.post0 documentation (scikit-learn.org)
  11. A Gentle Introduction to Dropout for Regularizing Deep Neural Networks — MachineLearningMastery.com
  12. Implementing Dropout in PyTorch: With Example — Weights & Biases (wandb.ai)
  13. DART booster — xgboost 1.7.1 documentation and Parameters — LightGBM 3.3.3.99 documentation
  14. Zero-inflated model — Wikipedia
  15. The Zero Inflated Poisson Regression Model — Time Series Analysis, Regression and Forecasting (timeseriesreasoning.com)
  16. Lots of zeros or too many zeros?: Thinking about zero inflation in count data (rbind.io)
  17. How to use the BatchNorm layer in PyTorch? — Knowledge Transfer (androidkt.com)
  18. 2110.01889.pdf (arxiv.org)
  19. Lightgbm vs Neural Network | MLJAR
  20. 4.2. Permutation feature importance — scikit-learn 1.2.0 documentation
  21. Captum · Model Interpretability for PyTorch
  22. 8.3 Feature Interaction | Interpretable Machine Learning (christophm.github.io)
  23. https://arxiv.org/abs/1909.06312
  24. https://arxiv.org/abs/1908.07442

Let’s Do: Neural Networks was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

...



📌 Introduction to Deep Learning and Neural Networks: What Are Convolutional Neural Networks in AI?


📈 41.39 Punkte

📌 Neural Networks Illustrated, Part 1: How Does a Neural Network Work?


📈 33.44 Punkte

📌 Let’s Do: Neural Networks


📈 28.81 Punkte

📌 Let’s Do: Neural Networks


📈 28.81 Punkte

📌 Accelerate Your Neural Network with the Samsung Neural SDK


📈 25.48 Punkte

📌 Point-Based Neural Rendering With Neural Point Catacaustics For Interactive Free-Viewpoint Reflection Flow


📈 25.48 Punkte

📌 Mastering the Art of Video Filters with AI Neural Preset: A Neural Network Approach


📈 25.48 Punkte

📌 Neural Network Diffusion: Generating High-Performing Neural Network Parameters


📈 25.48 Punkte

📌 Revolutionizing Neural Network Design: The Emergence and Impact of DNA Models in Neural Architecture Search


📈 25.48 Punkte

📌 Intel: Let's talk about SGX, baby. Let's talk about 2U and me. Let's talk about all the good things, and the bad...


📈 24.35 Punkte

📌 Intel: Let's talk about SGX, baby. Let's talk about 2U and me. Let's talk about all the good things, and the bad...


📈 24.35 Punkte

📌 Should I infect this PC, wonders malware. Let me ask my neural net...


📈 20.86 Punkte

📌 DeepMind Open Sources 'Sonnet' Library For Easier Creation Of Neural Networks


📈 20.7 Punkte

📌 Jürgen Schmidhuber: I Invented Neural Networks in Artificial Intelligence and Deep Learning


📈 20.7 Punkte

📌 Refocusing Videos With Neural Networks | Two Minute Papers #173


📈 20.7 Punkte

📌 Concerns raised over claim that neural networks can detect sexuality


📈 20.7 Punkte

📌 Image Matting With Deep Neural Networks | Two Minute Papers #209


📈 20.7 Punkte

📌 How Do Neural Networks See The World? Pt 2. | Two Minute Papers #211


📈 20.7 Punkte

📌 Using Neural Networks to Identify Blurred Faces


📈 20.7 Punkte

📌 Using Neural Networks to Identify Blurred Faces


📈 20.7 Punkte

📌 Google’s neural networks turn pixelated faces back into real ones


📈 20.7 Punkte

📌 How Neural Networks Work


📈 20.7 Punkte

📌 Distilling Neural Networks | Two Minute Papers #218


📈 20.7 Punkte

📌 ENHANCE! Upscaling Images with Neural Networks


📈 20.7 Punkte

📌 Pruning Makes Faster and Smaller Neural Networks | Two Minute Papers #229


📈 20.7 Punkte

📌 One Pixel Attack Defeats Neural Networks | Two Minute Papers #240


📈 20.7 Punkte

📌 Machine Learning Research & Interpreting Neural Networks


📈 20.7 Punkte

📌 Top 5 Uses of Neural Networks!


📈 20.7 Punkte

📌 Deep Neural Networks with PyTorch || Stefan Otte


📈 20.7 Punkte

📌 These Neural Networks Empower Digital Artists


📈 20.7 Punkte

📌 Convolutional Neural Networks Demystified


📈 20.7 Punkte

📌 Text CAPTCHAs easily beaten by neural networks


📈 20.7 Punkte

📌 MIT Develops Algorithm To Accelerate Neural Networks By 200x


📈 20.7 Punkte











matomo