Handwritten Kannada Alphabets Image Dataset
Introduction
One of the Dravidian language spoken majorly by 60 million people in and around Karnataka state of India is known as Kannada. It is one among 22 scheduled languages of India. Kannada langauge is written in Kannada script which has its traces back from kadamba script (325-550 AD). There are many languages which were used centuries back and aren’t being used currently whereas Kannada is one such language which is used even today for writing official documents and are being taught at schools which means it is going to be for many years.
Kannada script has 13 vowels, 34 consonants, 2 other symbols and 10 numericals. Vinay Prabhu and team have worked on Kannada numbers by developing a novel image dataset for 10 numericals . For more info please refer the following url. Our focus is towards developing image database for alphabets.
Implementation procedure
Students of the Center for Artificial Intelligence from M S Ramaiah Institute of Technology, Bangalore along with a faculty member made students to write by hand. Students were provided a A4 paper with grids of equal spacing (5 rows, 10 columns) drawn over it. Each student wrote every alphabets 10 time and 72 students were involved. Sample student submission copy is shown in below figure. Students were asked to write in every possible direction, orientation and various strokes. The plan was to scan in same pattern and have the same resolution but because of COVID-19 before the student could submit they had to go back to their homes. So every student submission was different and we had to perform lot of preprocessing.
Preprocessing
The first stage was to transform snap each alphabet and accumulate every alphabet as a single image in different places. Various preprocessing techniques were used for different kinds of images. As each and every students submission was different, automating was difficult.
In the following images we show some of the processing performed on raw images.
Sample Images
Sample images from the dataset after preprocessing.
Dataset statistics
The dataset is divided into two subparts: vowels and consonants. There are total of 10 vowels and each having minimum of 100 images with deviation of 50 images. Similarly 21 consonants has minimum of 200 images with deviation of 50 images.
Syntax | Description |
---|---|
Header | Title |
Paragraph | Text |
How to obtain this dataset
This dataset can be accessed by contacting Kusumika Krori Dutta. Please send your email at kusumika@msrit.edu