PyTorch is a library that is rapidly gaining popularity among Deep Learning researchers. If you are willing to get a grasp of PyTorch for AI and adjacent topics, you are welcome in this tutorial on its basics.

There is quite a number of tutorials available online, although they tend to focus on numpy-like features of PyTorch. Indeed, the creators of PyTorch present the library as a substitution for NumPy, designed for efficient and fast computation on both CPU and GPU. Nevertheless, I personally find this trend in Deep Learning tutorials distracting, and thus you can find it in the official documentation or elsewhere.

In this tutorial we will be covering the key DLish features of the library that facilitate neural network design and training. Those are DataLoader, Model class, Loss and Optimizer.

1: Requirements

You may need certain hardware and software.


  • A GPU with CUDA support


  • Anaconda environment with python 3
  • Jupyter Lab
  • Some familiarity with python and machine learning

Luckily, all of the above are easily available on the TensorPad website, which offers instances with GPU for Deep Learning purposes at a rather competitive price.

If you have all the necessary prerequisites, let's dig in.


2: A quick start

Getting familiar with PyTorch is possible in just a few lines of code. We can squeeze the data preprocessing step into a single line of code, so we can focus on building the PyTorch model. We will use MNIST written numbers dataset for demonstration purposes. You can consider this a "Hello World!" of PyTorch.

** Keep in mind: **

I highly recommend you not to run cells unconsciously. Please, read the text thoroughly first, soak up the new concepts and only after that begin experimenting. The following cell contains Jupyter Notebook magics, don't be afraid we will cover them in the future tutorials.

%reload_ext autoreload
%autoreload 2
%matplotlib inline
from utils.tutorial_1 import *

train_loader, test_loader = load_MNIST()

When data is gathered the following steps in Deep Learning comprise:

  • Building a model
  • Training
  • Evaluation

The model creation implies that we have to define the neurons, their activation functions and the way they are connected in the network. It is convenient to consider a group of neurons in the network as a layer abstraction. Thus a typical neural network would consist of input, hidden and output layers.

2.1: Building a model

The model in our example is being created with nn.Sequential function that simply stacks the layers of the neural network. If you ever made neural networks in Keras you may already have been familiar with this approach.

We use simple linear layers with relu activation function. Next we initialise a loss function that evaluates the model and an optimiser that minimises the loss. The lesser the loss, the better the model should perform in terms of accuracy.

n_features = 28*28
n_hidden = 512

model = nn.Sequential(
                nn.Linear(784, n_hidden),
                nn.Linear(n_hidden, n_hidden),
                nn.Linear(n_hidden, 10),

criterion = nn.CrossEntropyLoss()
optimiser = torch.optim.Adam(model.parameters(),lr=0.01)

2.2: Training

The training process happens in a for loop. First we flatten the 2D arrays representing the pixels in pictures and set the gradients to zero. The gradients have to be zeroed because PyTorch accumulates them by default on subsequent backward passes. After that we make a forward pass simply passing the data to the model and calculate the loss. To get the new set of gradients we make a backward() call and then we propagate them with optimizer.step().

for batch_idx, (data, target) in enumerate(train_loader):
    X_batch = data.view(data.shape[0], -1)
    y_pred = model(X_batch)
    loss = criterion(y_pred, target)

Here we create a line plot to visualise the loss function values over the training epochs.
As can be seen from the plot, the loss function quickly approaches zero, which indicates that the model learns to classify the data.



2.3: Evaluation

Let's pick a random image from the set and see if the model is able to interpret the number written in it correctly. We have designed a custom function to do that and we will use it along with the imshow() function from the matplotlib library to display the image.

For the model to make a prediction we flatten the 2D array with image's pixels and pass it to the model. After that we choose the label with the highest score using the torch.max() function.

image = drag_image(test_loader)

# Display the image
plt.imshow(image.numpy().squeeze(), cmap='gray_r')

image = image.view(1, -1)
output = model(image)
predicted = torch.max(output,1)[1]
print(f"The model has predicted the number on the picture to be {int(predicted)}.")
The model has predicted the number on the picture to be 7.


This is how Neural Networks are created and trained in a nutshell. For further explanation see the chapters below.

3: Imports

You can find the list of all imports you may need for the entire tutorial in the cell below.

In python the os library allows us to interact with operating system. We will use it to list all files in the directory where the dataset will be downloaded. There are many formats for data, however .csv is the most popular one when it comes to files smaller than tens of MB. In pandas there are functions that enable us to load these files into computer's memory and process them.

torch is our Deep Learning library of choice. It provides us with a range of functions to build and train our custom neural networks. To track the learning process we will use tqdm for displaying a progress bar.

The model will be evaluated with metrics and statistics from sklearn (Scikit-learn) and scipy. The former is a machine learning library built upon the latter. We will also import the necessary preprocessing functions from it.

The results will be illustrated with graphs from the matplotlib library.

Other libraries are imported to clear the Jupyter cells' output.

import os
import pandas as pd
from tqdm import tqdm_notebook as tqdm

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.datasets as datasets

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale
from sklearn.metrics import classification_report, accuracy_score
from scipy.stats import pearsonr

from IPython import display
from IPython.display import clear_output
import warnings

4: Loading the data

Our first example in the intro section was the MNIST dataset. Pictures generally tend to be more comprehensive and make the demonstration of Deep Learning classifier straightforward. But the steps preceding the model training are more lucid if your dataset consists of a set of numeric features. This is why we use wine quality dataset in this part of the tutorial.

We will load the data from UCI archive with the following code. The !! operator redirects the code to the console allowing us to use bash's wget command, which will automatically download all the files in the dataset.

!! wget --recursive --no-parent -nH -q --cut-dirs=2 \
"""Downloading dataset""";

We can now look into the folder with the files which have just been downloaded. You can find there the two datasets that are related to red and white wines the Portuguese "Vinho Verde".

for file in os.listdir("wine-quality/"):
    print(f"\t {file}") # Enjoy recently added to python f-strings

As we can see, our dataset chiefly consists of two files, namely winequality-red.csv and winequality_white.csv. We will load them into the machine's memory with pandas.read_csv function. We unite two datasets while labelling each record with red or white label.

The dataset contains 11 physicochemical properties for each wine and one sensory, which is labelled quality in the dataframe. Let's refer the dataset description.

Input variables (based on physicochemical tests):

  1. fixed acidity
  2. volatile acidity
  3. citric acid
  4. residual sugar
  5. chlorides
  6. free sulfur dioxide
  7. total sulfur dioxide
  8. density
  9. pH
  10. sulphates
  11. alcohol

Output variable (based on sensory data):
12. quality (score between 0 and 10)

df_red = pd.read_csv('wine-quality/winequality-red.csv', sep=';')
df_whi = pd.read_csv('wine-quality/winequality-white.csv', sep=';')
df_red["wine type"] = 'red'
df_whi["wine type"] = 'white'
df = pd.concat([df_red, df_whi], axis=0)
df = df.reset_index(drop=True)


There are the two classic problems in supervised Machine Learning: the classification and the regression.

The classification problem is the problem of identifying to which class the given observation belongs. In the wine quality dataset one can predict if a wine is red or white based on its characteristics, which is an example of classification.

The regression problem is a process of estimating the target value based on the input data. In our dataset the prediction of the wine quality is an example of the regression problem. We will start with tackling this problem.

5: Train/Test Split

The Train/Test Split is an advanced technique designed to avoid overfitting when you create a Machine Learning model. The overfitting is a kind of problem that arises when model fits too well to the limited set of data from the training set and fails to generalise to unseen data.

On this step we separate a quarter from all data and store it separately to test our model as soon as we have finished building it.

The training part consists of all the features that are present in the dataset. The appeal of the neural networks is that the model sorts out what features to use on its own, so that we don't necessarily have to perform the feature engineering.

data = df.loc[:, ~df.columns.isin(['quality', 'wine type'])].values
target = df['quality'].values

X_train, X_test, y_train, y_test = train_test_split(data,
X_train.shape, y_train.shape
((4872, 11), (4872,))

In the first example of this tutorial we have used the nn.Sequential function to build a neural network. It is a convenient wrapper, which unfortunately does not provide us with any perspective of what is going on inside. We will build a neural network as a class that inherits from torch's nn.Module. In this case we call the super() function to gain access to all functions and properties in nn.Module class.

We initialise two linear layers and override the forward() function in nn.Module class with sigmoid activation function. The architecture of the network may be confusing at this point. If so, look at the picture below for the network's structure.


Since we are solving the regression problem, the last neuron has no activation function, so its output will not be limited to [0,1] as the ouput of the sigmoid function.

After we initialise the model we also initialise the loss function and the optimiser. We choose Mean Squared Error as our loss function, which is a classic estimator in regression problems. For optimiser we opt for Adam. An introduction to optimisers and gradient descent will be given in future tutorials.

In order to perform all calculations on GPU, we use the cuda() function that sends the network parameters to GPU.

class RegNN(nn.Module):
    def __init__(self):
        super(RegNN, self).__init__()
        self.l1 = nn.Linear(11,64)
        self.l2 = nn.Linear(64,1)
    def forward(self,x):
        out = torch.sigmoid(self.l1(x))
        return self.l2(out)
model = RegNN()
criterion = nn.MSELoss()
optimiser = torch.optim.Adam(model.parameters(),lr=0.01)

The self.l1 is the first layer in our network. It is a nn.Linear layer with $11$ inputs and $64$ outputs. The number of inputs in the first layer is the same as the number of features in the sample and it is also the same as the number of neurons in the input layer. The same is applicable to the self.l2, although it has $64$ neurons and produces but $1$ output.

The training step is the same as in the first example. The only difference is that we don't use the DataLoader here, so we have to trasform numpy arrays to torch tensors. This is done simply with from_numpy() function. Overall, our model has 15000 epochs.

loss_lst = []
X_torch = torch.from_numpy(X_train.astype('float32')).cuda()
y_torch = torch.from_numpy(y_train.astype('float32')).cuda()

for epoch in tqdm(range(15000)):
    y_pred = model(X_torch)
    loss = criterion(y_pred, y_torch)

As in out first example in the quick intro section we may want to visualise the loss function in order to see how the model learns.

[<matplotlib.lines.Line2D at 0x7efcf0f86550>]


The loss function value—mean squared error in our case—at the end of training is approximately $0.8$. That means that the model's error is $ \sqrt{0.8}=0.89 $ and each prediction differs from the true value by $0.89$ on average.

tensor(0.7689, device='cuda:0')

Although the loss is quite low, the model does not demonstrate the expected behaviour. To see this we will plot true and predicted values on a single graph.

true = y_torch.cpu().numpy()
predicted = y_pred.cpu().data.numpy()
plt.plot(true, predicted, 'ro');


As opposed to our expectations, the model is not nearly as accurate. This is supported with pearson R coefficient that we can use to evaluate the quality of prediction in regression problems.

stat = pearsonr(true, predicted.flatten())
print("The R coefficient is {:.2} p-value {:.2}".format(*stat))
The R coefficient is -0.036 p-value 0.012

You might wonder why the model preformed so poorly. If you refer the description of our target variable, you will read that the quality of wine was based on subjective perception, which is hard to predict. For this reason, let's build a classifier for a slightly more objective thing—the type of wine.

data = df.loc[:, ~df.columns.isin(['quality', 'wine type'])].values
target = df['wine type'].map({'red': 1, 'white': 0}).values

X_train, X_test, y_train, y_test = train_test_split(data,

The architecture of the classsifier the neural network is almost the same as for regression one, except for the last layer. This time it uses sigmoid activation function, which limits the output to [0,1]. We will use the same optimiser, but for the loss function we now choose binary cross entropy, which is more suitable for classification problem.

class ClaNN(nn.Module):
    def __init__(self):
        super(ClaNN, self).__init__()
        self.l1 = nn.Linear(11,64)
        self.l2 = nn.Linear(64, 1)
    def forward(self,x):
        out1 = torch.sigmoid(self.l1(x))
        out2 = torch.sigmoid(self.l2(out1))
        return out2.view(-1)
model = ClaNN().cuda()
criterion = nn.BCELoss()
optimiser = torch.optim.Adam(model.parameters(),lr=0.01)

The training step has no changes.

loss_lst = []
X_torch = torch.from_numpy(X_train.astype('float32')).cuda()
y_torch = torch.from_numpy(y_train.astype('float32')).cuda()

for epoch in tqdm(range(15000)):
    y_pred = model(X_torch)
    loss = criterion(y_pred, y_torch)

The model has been trained. Let's test it.

X_test_torch = torch.from_numpy(X_test.astype('float32'))
y_test_torch = torch.from_numpy(y_test.astype('float32'))

y_pred = model(X_test_torch)
loss = criterion(y_pred, y_test_torch)

print("The test loss is {:.2}".format(
The test loss is 0.098

The accuracy score is excellent in test.

acc = accuracy_score(y_test_torch.numpy().astype('int'),
                            [1 if i>.5 else 0 for i in y_pred.detach().numpy().astype('int')])

print("The accuracy of the prediction is {:.2}".format(acc))
The accuracy of the prediction is 0.91

Nevertheless, the accuracy score is not a perfect metric to estimate the performance of the model, especially with unbalanced classes in the dataset, which is the case now. You can see that the model fails to identify a lot of red wines correctly, as can be assumed based on its recall.

                            [1 if i>.5 else 0 for i in y_pred.detach().numpy().astype('int')],
                            target_names=['white', 'red']))
              precision    recall  f1-score   support

       white       0.89      1.00      0.94      1191
         red       1.00      0.67      0.80       434

   micro avg       0.91      0.91      0.91      1625
   macro avg       0.95      0.84      0.87      1625
weighted avg       0.92      0.91      0.91      1625

6: Post Scriptum

Here are the functions I used in the MNIST example for data loading and processing. The load_MNIST function downloads MNIST dataset from torchvision.datasets library to your disk, normalizes the image arrays and loads them to torch's Dataloader. The latter is a useful tool that takes care of your data when you train neural networks.

The drag_image function picks an image array from the Dataloader object generated on the previous step. The loss is plotted with the plot_loss function.

def load_MNIST():
    mnist_trainset = datasets.MNIST(root='./data', 
                                   torchvision.transforms.Normalize((0.5,), (0.5,))]))
    mnist_testset = datasets.MNIST(root='./data',
                                   torchvision.transforms.Normalize((0.5,), (0.5,))]))
    train_loader =, batch_size=32)
    test_loader =, batch_size=32)
    return train_loader, test_loader

def drag_image(test_loader):
    dataiter = iter(test_loader)
    images, labels =
    return images[0]

def plot_loss(loss_lst):
    plt.xlabel('N_Epochs', fontsize=18)
    plt.ylabel('Loss', fontsize=18)