Visualizing a Perceptron
A lot of machine learning techniques can be viewed as an attempt to
represent high-dimensional data in fewer dimensions without losing any
important information. In a sense, it is lossy compression—compressing the data
to be small and amenable before being passed to some next stage of data
processing. If our data consists of elements of
where
An extreme case is where we represent every data point with a single number,
and so our function
- its distance from the origin.
- deciding a neighbourhood around the data point and counting the number of other data points within it.
- choosing the average of the data point's elements, or the max, or the min.
A method that beautifully balances simplicity and flexibility is to choose
a single vector and project each data point onto it. In this way, each data
point is converted into a single number. If the chosen vector is denoted
as
Where "
Isn't it interesting that the vector
For people interested in the application to regression, Andrew Ng's notes on Kervel Methods cover this idea.
, in 2 dimensions
Below is a sequence of visualizations that try to spark some intuition for how data can be transformed by projection. First, consider the following 2D data points.

Warming up: consider projecting the data onto the two existing axes, with the
first axes being the horizontal one. If
Similarly, if
In the above two videos, and in all subsequent ones,
Note: in the above two figures, the ticks and numbers on the two data axes have been removed to avoid confusion with those on the projection line. This is done on many of the figures below too.
The previous two videos covered one projection each. The next video sweeps
Next we take a moment to see the data from the projection line.
What is left after the projection
When imagining these projections, it is useful to inhabit the perspective of the line receiving the points, as if ignorant of the original structure. The usefulness of this perspective is that you can better appreciate the limited information available once the points have been projected. After the projection, we will forget everything about each data point except the number along the projection line on which the point falls. After a projection, anything we might ask about a data point must be asked in terms of this number. This list of 1D points might be the only information received by the next stage of some processing pipeline, and some goals might be impossible if insufficient information is captured by the projection.
The below slightly jarring video is one visualization that might help to
shift your focus to the reference frame of the projection line. Just like the
video above,
Having seen
Choosing
Some values for
This isn't the only useful choice for
Below is the same dataset, now with labels: each point is either red or white.

First, inspect how well the data is separated if we set
Not a very useful projection; there is no obvious pattern to the class labels once the points lie on the projection line.
For comparison, below is another sweep of
Using our eyes we can pick out a good a value for
The intersection point is about
3 parts of a perceptron
A perceptron can be fully described by 3 things. They all appeared in the above examples, although one of them was not mentioned explicitly. They are:
- a projection vector,
- a bias term,
- an activation function,
The classifier that appeared above with
import numpy as np
def classify(x):
"""Classify a 2D point as either 0 (circle) or 1 (triangle)."""
v = np.array([0.56, 0.83])
cutoff = 2.4
x_projected = np.dot(x, v)
return 0 if x_projected < cutoff else 1
This function can be rewritten in a more standard way by replacing the
cuttoff term with a bias term, b = -cuttoff
.
import numpy as np
def classify(x):
"""Classify a 2D point as either 0 (circle) or 1 (triangle)."""
v = np.array([0.56, 0.83])
b = -2.4
x_projected = np.dot(x, v) + b
return 0 if x_projected < 0 else 1
The perceptron's parameters are thus:
Our activation function

import numpy as np
def sigmoid(t):
return 1 / (1 + math.exp(-t))
def classify(x):
"""Estimates the probability of x belonging to class 0 (circle)."""
v = np.array([0.56, 0.83])
b = -2.4
x_projected = np.dot(x, v) + b
return sigmoid(x_projected)
Sidenote: abstracting away the bias
As
We are free to choose both the vector
Perceptrons are not typically thought of in this way; however, there is
something appealing in the simplicity of viewing the datum of a perceptron as a
pair: a
Another common conceptualization is to add an extra dimension to
Visualizing the effects of and
In the previous videos, we kept the length of
Changing
One way of visualizing the effect of the bias,
An alternative visualization is to offset the number line in the opposite direction.
Changing (in other words, scaling )
Up until now, all videos had the length of
Below are two ways of visualizing the effect of scaling
An equivalent and in my opinion a more appealing way to visualize
the effect of scaling
A note about the white arrow representing
Setting to be a sigmoid function
So far the activation function
One way of explicitly visualizing
In the below video,
I didn't manage to mark on the sigmoid graph where each of the white dots land after passing through the sigmoid function. Hopefully, it's still clear what is happening.
An alternative way of conceptualizing
Perceptron by hand
If the data is in 2 or 3 dimensions, it's not too hard to calculate by hand some good parameters for a perceptron by looking at a plot of the data. Below is some fresh data for which we will craft a perceptron to act as a classifier.
Some fresh new labeled data:

Step 1. Roughly guess the position of an effective dividing line.
Here is a line I chose by eyeballing the data.
Step 2. Calculate the unique line through the origin and perpendicular to the dividing line.
The dividing line look to have a slope of about

Step 3. Choose to be a vector along this yellow line.
We don't need to worry too much a particular length for
Arbitrarily, I will designate the circles to be class 0 and the
triangles to be class 1. Thus,
The next image includes the vector

The length of
Step 4. Choose so that the dividing line is projected to .
Referring to the above figure, the dotted dividing line intersects the
yellow projection line at about

With the bias chosen, we have all the parameters required to implement the classifier.
Step 5. Build the classifier
The perceptron classifier is now complete. It is described by:
In Python, it can be implemented as:
import numpy as np
def classify(x):
"""Classify a 2D point as either 0 (circle) or 1 (triangle)."""
v = np.array([2, 2/3])
b = 2.8
x_projected = np.dot(x, v) + b
return 0 if x_projected < 0 else 1
The following code calculates and prints the classifier's accuracy on the training data:
# X.shape = (60, 2) and y.shape = (60,)
X,y = generate_data()
num_correct = 0
for x_i,y_i in zip(X,y):
if classify(x_i) == y_i:
num_correct += 1
accuracy = num_correct/X.shape[0]
print(f'Accuracy: {accuracy:.3f}')
This prints out:
Accuracy: 0.950
We can compare this result with a perceptron model trained using Scikit-learn. The following code trains a perceptron for classification using Scikit-learn and prints the model parameters and training accuracy.
import sklearn as sk
def classify_with_sklearn(X, y):
model = sk.linear_model.LogisticRegression()
model.fit(X, y)
v = model.coef_[0]
b = model.intercept_
print(f'v: {v}, b: {b}')
y_predict = model.predict(X)
accuracy = sk.metrics.accuracy_score(y, y_predict)
print(f'Accuracy: {accuracy}')
classify_with_sklearn(X, y)
This prints out:
v: [1.81096933 0.64242439], b: [2.24777174] Accuracy: 0.95
This classifier has had
Our rough estimation for the classifier produces a very similar set of parameters to those learned by Scikit-learn, and the dividing line implied by both is almost identical.
From this exercise, it should be clear that any line we might draw
to divide the plane can be implemented by choosing
A note on the magnitude of and
The projection vector of a perceptron trained by gradient descent will have a length dependent on factors that are not easy to consider in our rough estimate above. This applies to the magnitude of the bias value also. The loss calculation might include some regularization term that encourages a smaller projection vector and bias value, and the loss term might be targeting some probabilistic goal like maximum likelihood estimation, where the outputs of the perceptron represent probabilities.
Scaling up to neural networks
The dimension reduction transformation we have been working with:
is the building block that is repeated to create neural networks.
A neuron
The correspondence between this transformation and a typical nodal network
diagram is shown below. The diagram represents the

There is an extra intermediate
A layer
Next, build a whole neural network layer. As an example, consider that we have
3 transformations,
We can pack all of the these three transformations into matrices like so:
Which gets condensed as:
where
It's more common to see this expression in the form:
Where the only change is to rename
And beyond
Making layers like this and connecting them together is the fundamental idea of neural networks.