Handwritten Kannada Alphabets Image Dataset

Sunny Arokia Swamy Bellary, Kusumika Dutta

Sep 1, 2020 Machine Learning, Computer Vision

Introduction

One of the Dravidian language spoken majorly by 60 million people in and around Karnataka state of India is known as Kannada. It is one among 22 scheduled languages of India. Kannada langauge is written in Kannada script which has its traces back from kadamba script (325-550 AD). There are many languages which were used centuries back and aren’t being used currently whereas Kannada is one such language which is used even today for writing official documents and are being taught at schools which means it is going to be for many years.

Kannada script has 13 vowels, 34 consonants, 2 other symbols and 10 numericals. Vinay Prabhu and team have worked on Kannada numbers by developing a novel image dataset for 10 numericals . For more info please refer the following url. Our focus is towards developing image database for alphabets.

Implementation procedure

Students of the Center for Artificial Intelligence from M S Ramaiah Institute of Technology, Bangalore along with a faculty member made students to write by hand. Students were provided a A4 paper with grids of equal spacing (5 rows, 10 columns) drawn over it. Each student wrote every alphabets 10 time and 72 students were involved. Sample student submission copy is shown in below figure. Students were asked to write in every possible direction, orientation and various strokes. The plan was to scan in same pattern and have the same resolution but because of COVID-19 before the student could submit they had to go back to their homes. So every student submission was different and we had to perform lot of preprocessing.

Preprocessing

The first stage was to transform snap each alphabet and accumulate every alphabet as a single image in different places. Various preprocessing techniques were used for different kinds of images. As each and every students submission was different, automating was difficult.

In the following images we show some of the processing performed on raw images.

Some of the preprocess techniques implemented. Left column is raw image and middle column is processed image and last column is technique used

Sample Images

Sample images from the dataset after preprocessing.

Dataset statistics

The dataset is divided into two subparts: vowels and consonants. There are total of 10 vowels and each having minimum of 100 images with deviation of 50 images. Similarly 21 consonants has minimum of 200 images with deviation of 50 images.

Syntax	Description
Header	Title
Paragraph	Text

How to obtain this dataset

This dataset can be accessed by contacting Kusumika Krori Dutta. Please send your email at kusumika@msrit.edu

kannada-dataset Dataset Artificial Intelligence Computer Vision

Sunny Arokia Swamy Bellary

Engineer/Scientist II

Kusumika Dutta

Assistant Professor

Kusumika Krori Dutta, working as Assistant professor in Electrical and Electronics Engineering Department , MSRIT, Bangalore, India. She completed B.E (Electrical Engg) , M.Sc(Engg) by research and pursuing PhD in investigation on neurological disorder. She believes in lifelong learning and is passionate to explore different socio-cultural needs and their solutions through her indagation. Her love towards research, aided her to invent a mathematical formula of pattern based multiplication and patented many products having social utility