युआयडी इनोव्हेशन लॅब

Release of Open Data and Algorithms

At UID Innovation Labs, our multidisciplinary team is doing lot of R&D to clean up the databases, de-duplicate the databases, and research in biometrics- photographs, as well as field usage of fingerprints and Iris.

We are happy to release some (masked/anonymised) data as well as algorithms in public domain, so that other Government Departments, as well as Researchers in these areas can make use of this knowledge.

SET 1: OpenCV based work in Facial images and Vehicle Number Plate Images

Please see the detailed Paper as well as corresponding XML data.


UID Innovation center worked with more than 6.28 Crore resident names in English and Marathi. The English Names and Marathi names were cleaned by removing white spaces, special characters and rejecting records with ambiguous characters See Code used for Cleaning.

The clean database of names where then broken into words by separating First name, last name, middle name etc. A database of commonly used words was than grouped and count was generated. These 15.66 crore pairs of English, Marathi equivalent were grouped into 32 Lakh unique pairs, with frequency count. As a further cross check, the Marathi word was then transliterated to English Using GIST tool provided by CDAC. The English transliteration was then compared to original English word, through fuzzy matching, using Approximate Matching of Indian Names (AMIN). See Code of AMIN, which is based on careful analysis of Indian names and surnames, and is a huge improvement over existing such algorithms. This validated, quite clean English-Marathi dictionary of Names and Surnames, containing about 3,00,000 pairs with frequency count (covering 96% of total words) can be useful to many programmers, and Researchers working in this area. This dictionary can also be helpful in implementing Auto Suggest and Auto complete solutions in various applications, as well as in Transliteration from English to Marathi and vice versa. We are also releasing an Excel sheet containing top 100 Names of every decade.