MNIST Handwritten Digits Recognition using scikit-learn
A simple approach to the handwritten digits recognition system using Machine Learning (SVM and KNN)
Handwriting Recognition
Recognizing handwritten text is a problem that can be traced back to the first automatic machines that needed to recognize individual characters in handwritten documents. Think about, for example, the ZIP codes on letters at the post office and the automation needed to recognize these five digits. Perfect recognition of these codes is necessary in order to sort mail automatically and efficiently. Included among the other applications that may come to mind is OCR (Optical Character Recognition) software. OCR software must read handwritten text, or pages of printed books, for general electronic documents in which each character is well defined.
MNIST Dataset
The MNIST dataset is an acronym that stands for the Modified National Institute of Standards and Technology dataset.
It is a dataset of 70,000 small square 28×28 pixel grayscale images of handwritten single digits between 0 and 9.
Problem Statement
The task is to classify a given image of a handwritten digit into one of 10 classes representing integer values from 0 to 9, inclusively.
Step by Step Process for Handwritten Digits Recognition
Step 1: Import necessary libraries.
- sklearn.datasets contain many different datasets for building and testing ML models.
- sklearn.metrics for calculating accuracy and precision
- sklearn.neighbors for KNN algorithm
- sklearn for SVM algorithm
- numpy for numerical calculations
from sklearn.datasets import fetch_openml
from sklearn.metrics import accuracy_score,classification_report
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
from sklearn import svm
Step 2: Fetching data from Sklearn datasets
mnist = fetch_openml(‘mnist_784’)
Step 3: Data understanding
After loading the dataset, you can analyze the content. First, you can read lots of information about the datasets by calling the DESCR attribute.
print(digits.DESCR)
For a textual description of the dataset, the authors who contributed to its creation and the references will appear.
mnist.data
mnist.target.shape
Step 4: Using Matplotlib visualizing the handwritten digits
The mnist.data contains all the data in a 1-D array. We have to transform the data frame into a NumPy array and then reshape it into 28x28.
image= mnist.data.to_numpy()plt.subplot(431)
plt.imshow((image[0].reshape(28,28)), cmap=plt.cm.gray_r,
interpolation=’nearest’)plt.subplot(432)
plt.imshow(image[1].reshape(28,28), cmap=plt.cm.gray_r,
interpolation=’nearest’)plt.subplot(433)
plt.imshow(image[3].reshape(28,28), cmap=plt.cm.gray_r,
interpolation=’nearest’)plt.subplot(434)
plt.imshow(image[4].reshape(28,28), cmap=plt.cm.gray_r,
interpolation=’nearest’)plt.subplot(435)
plt.imshow(image[5].reshape(28,28), cmap=plt.cm.gray_r,
interpolation=’nearest’)plt.subplot(436)
plt.imshow(image[6].reshape(28,28), cmap=plt.cm.gray_r,
interpolation=’nearest’)
Step 5: Dividing the dataset into training data and test data.
index_number= np.random.permutation(70000)x1,y1=mnist.data.loc[index_number],mnist.target.loc[index_number]x1.reset_index(drop=True,inplace=True)
y1.reset_index(drop=True,inplace=True)x_train , x_test = x1[:55000], x1[55000:]
y_train , y_test = y1[:55000], y1[55000:]
Step 6: Implementing the SVM algorithm and calculating its accuracy.
svc = svm.SVC(gamma=’scale’,class_weight=’balanced’,C=100)svc.fit(x_train,y_train)result=svc.predict(x_test)print('Accuracy :',accuracy_score(y_test,result))
print(classification_report(y_test,result))
Step 6: Implementing the KNN algorithm and calculating its accuracy.
knn = KNeighborsClassifier(n_neighbors=6,weights=’distance’)knn.fit(x_train, y_train)# Predict on dataset which model has not seen before
result=knn.predict(x_test)print('Accuracy :',accuracy_score(y_test,result))
print(classification_report(y_test,result))
Conclusion:
In this blog, we create a simple algorithm to detect handwritten digits using machine learning algorithms (SVM and KNN). KNN gives an accuracy of 97.34% and SVM (RBF kernel) gives an accuracy of 98.52%. KNN algorithm is faster than SVM although the accuracy of SVM is slower.