Thursday, May 19, 2005

Partner the nation on the road to IT empowerment

CDAC has advertised in national dailies inviting software developers to join them in putting together language "Tools / Technologies / Resources" in all 22 official languages of India. They invite discussions from potential partners in the following areas: (their language)
  • Fonts - Trye Type Fonts and Open Type Fonts
  • Keyboard Drivers
  • Font Encoding Converters
  • Storage Code Converters
  • Open Office (Word Processor, Presentation Tool, Drawing Tool, Spreadsheet)
  • Browser
  • E-Mail Client
  • OCR's - Optical Character Recognition
  • Spell Checker
  • Dictionary
  • Thesaurus
  • Keyboard Typing Tutor
  • Language Learning Tool
  • TTS - Text to Speech System
  • ASR - Automatic Speech Recognition
  • Corpora - suitable for developer community
  • Machine Assisted Translation Systems
  • Braille Utilities and others

On April 15th, in a well advertised function in Chennai, Dayanidhi Maran launched a software pack to be distributed for free to people, containing a host of Tamil fonts and software. Public were not allowed into the function for "safety considerations". The only particpants were reporters from media - both print and electronic. There were glorious tributes in an editorial on The Hindu.

The CD contained a motely collection of fonts, tamilized open source software, and a few proprietary software purchased from local developers.

To start with, fonts from three companies were bundled along with fonts that CDAC itself hhad produced. Where was the need to purchase fonts fron the local companies when the Indian Govt. funded CDAC had already got sufficient number of fonts in its hands? It appears that each of the three companies were paid close to Rs. 4 lakhs for their fonts. The fonts were in multiple encoding - Unicode as well as the Tamil nadu Govt. standard TAM and TAB. One of the private companies had not provided a single Unicode font. No keyboard drivers were made available to generate Tamil characters. [This apparently has been fixed subsequently, and is rumoured to have cost close to Rs. 10 lakhs, when there are free options available.] Microsoft anyway makes available indic character generation IM for a few languages for Windows XP and 2000 from its site, if one is interested in Unicode.

However the Indian Government has still not zeroed in on Unicode. Parliament websites and daily transcripts of the Parliament debates still use some fancy encoded Nagari script. Tamil Nadu Government continues to use TAM and TAB. The election commission rolls are in yet another font. For Tamil Nadu the electoral rolls are in Tamil but in PDF format with no idea in what encoding the content has been put together. It is impossible to search them. CDAC would do well to first freeze on using Unicode and only Unicode, and start with forcing every Governmental website to follow Unicode strictly. Then, they do not have to worry about distributing font packs. Unicode fonts - at least one - comes with Windows operating system. CDAC, if they want, can then offer their font pack (in Unicode). No need to invite private companies to offer their fonts and waste money on them.

Besides the font, the Tamil CD offered a bunch of open source software. Tamilized Firefox (worked on by open source volunteer Muguntharaj in Malaysia), Tamilized Open Office (CDAC had earlier worked on this, but this has subsequently been managed by Muguntharaj and Evolution at http://ta.openoffice.org/ ), Tamilized Columba - a Java based email client. Poor folks at Columba seem to believe that CDAC and Dayanidhi Maran will take Columba to 3 million users by distributing the software through various avenues. I would myself suggest taking up Thunderbird, to avoid having bulky Java in the background.

Besides the open source applications, there were software programs developed by AU-KBC - a spell checker, a dictionary from Palaniappa Brothers, a Tamil OCR from LearnFun systems and a few bits and pieces such as nursery rhymes etc.

The OCR is a fantastic piece of software, unquestionably the best product in the CD that comes to you for free. It is not easy or intuitive like a typical English OCR but nevertheless it works and works well if your scan is decent. The dictionary is awful piece of software. Doesn't even allow you cut & paste facility. The spell checker from AU-KBC - a research center, part of the Madras Institute of Technology is a stand-alone software and cannot be used along with any other software. LearnFun system, for example, produces a much better Tamil dictionary which comes as a plugin (which I use by the way) and can be used along with Microsoft word or Adobe Pagemaker etc. Even this is not good enough as it cannot understand Unicode at this stage.

In all, the CD from CDAC was more of a joke than a worthwhile effort. April 15, the Tamil new years day, and the accompanying political mileage - the CD had on its cover pictures of Sonia Gandhi, DMK's Karunanidhi, Manmohan Singh and Dayanidhi Maran - was the only reason why a half baked product was released.

To this day, after one month, nobody in the public have gotten hold of the software CD. Those qho requested the CD from CDAC are still waiting for the post. (I have a copy of the CD and make further copies for anyone who is interested.)

-*-

Now, on to the discussion CDAC wants for other languages. If the aim is to gain some political mileage, it is not worth our time and effort. However if CDAC is genuine, then what is required is a simple, easy to install CD for each language. There is no need at this stage of OCR, TTS, ASR and MATS! Good dictionary (English-Local Language-English at the least), Good spell checker, keyboard driver (bundle the one from Microsoft by getting their permission, it is after all free), focus on only Unicode, localise Firefox, Thunderbird and Open Office, and a good installer for Windows 98 and Windows XP.

This is what is required.

2 comments:

  1. Badri: Could you tell me how do you generate tamil text in unicode?

    ReplyDelete
  2. Iam in the field of siddha medicine.and in need of a professional tamil ocr for reproduce endangered works of siddhars.it is a dedicated work by students of National Institute of Siddha chennai-47. please reply.

    ReplyDelete