The University of Edinburgh (United Kingdom) and the Technology Development for Indian Languages (TDIL) of the Indian government have released a system for machine translation Hindi-Punjabi as developed by Mr. Ajit Singh, a faculty member from The Punjabi University. Mr Singh is an assistant professor at the MM Modi College. He received assistance and collaborated with the university’s Computer Science department. Both organizations have recently put the software at work at both sites.
Vishal Goyal, assistant professor at the department of Computer Science at the Punjabi University commented: “this software has been made available online on the servers of the Edinburgh University and the TDIL.” He described the software at approximately 94% accuracy, a much higher figure than for English-based systems.
Vice-chancellor Jaspal Singh from The Punjabi University congratulated the department and the faculty’s efforts as it brought international recognition to the University.
Initially, Mr. Singh installed both the Moses package and the Web server on a single machine, which run a Linux installation. Afterwards, he tested both Moses and the Web server on the local host and checked the system worked using the University’s hardware.
However, part of the system worked fine but most of it did not seem to get the translation of the input text. This was a discovery Mr. Singh realised after installing the system on Web server for public use. The problem was that input text was displayed in transliterated form in the post-processing script written in transliterate.pl This Hindi-Punjabi Machine Translation System development is unique because it uses the Moses package with 2 sets of non-Western languages, whereas most work done with Moses takes English as either the source or the target language. Several months were employed to develop a system capable of running the scripts, since developers faced problems when translate.cgi was expecting to have many copies of the daemon.pl running, and all listening on several different ports and each one should wrap a different instance of Moses. Consequently, multi-threading could not be a viable option for the newly created web-based Hindi-Punjabi Machine Translation System as it had been written before Moses had threads. Developers had to think about an option based on multi-process.
The development is based on the direct approach. This includes:
- Preprocessing: Text Normalization, Replacing Proper Nouns, Replacing Collocations, etc.
- Translation Engine: Inflection Analysis, Identifying Surnames, Word Sense Disambiguation, Identifying Titles, Lexicon Lookup, Transliteration; and
- Post processing module.
The developers of the software point to this system as a success story, claiming around 94% on the basis of human evaluation intelligibility test. Nevertheless, they are also working on higher accuracy of the system, fixes and improvements.
Why is a Hindi-Punjabi Machine Translation System Important?
Punjabi is an Indo-Aryan language, that is related to European languages and the only one that is completely tonal (it uses pitch, like Chinese). It is a descendant of the Shauraseni language. Shauraseni was the chief language of mediaeval northern India several centuries ago. Punjabi is spoken by circa 105M speakers worldwide, making it the 10th most widely spoken language (2015 data). The Punjabi language originates from the Punjab region, which nowadays is divided between the Republic of India and the Islamic Republic of Pakistan. Punjabi is the most widely spoken language in Pakistan. As a result of immigration from the Indian sub-continent, it is also the 4th most spoken language in England and Wales and 3rd in Canada after English and French. Although half of the population in Pakistan speak it as mother tongue, it lacks official status at a national level.
Out of a population of 1,2bn Indians in 2011, more than 35 million people spoke Punjabi in India, (2,8% of the population in India). In India, Punjabi is just another minority (11th most spoken language) with Northern roots. It is the first official language of the Indian State of Punjab and one of the 22 scheduled languages. In Pakistan, the latest official figures point to around 82 million speakers in 2012.
Hindi also belongs to the Indo-Aryan group, but it is related to the Indo-Iranian branch of the Indo-European language family. Hindi ranks number 4 in the world in terms of number of speakers. This includes not only native Hindu speakers of Hindustani, but typically native speakers of related languages who consider themselves as speakers of a Hindi dialect (the so-called “Hindi belt”). This means that Hindi is the mother tongue of 425 million people in India at least and additionally the second language to some 120 million more, mostly in Northern and Central India. Southern India, where Tamil and other languages are spoken, has often resisted the process whereby Hindi was imposed as the national language. Most government documentation is prepared in three languages: Hindi, English and also the primary official language of the local state, whenever the majority of the inhabitants of the state do not speak either English nor Hindi.
The dialect upon which Standard Hindi is based is Khari boli, the dialect of Delhi and the surrounding regions of western Uttar Pradesh and Southern Uttranchal. It acquired linguistic prestige during the Mughal Empire (1600s) and it became known as Urdu or “the language of the court”.
We thus face a historic, “very Indian” tongue like Punjabi spoken by Aryan conquerors whose speakers are now displaced mostly into neighbouring Pakistan and another Indo-European language (Hindi) which is the national language of the Republic of India. As a language pair, Hindi and Punjabi are a closely related, which points to their common roots.
Hindi-Punjabi Writing Systems
Punjabi also faces a difficult as to what script is used for its written form. Majhi-Standard Punjabi is the written standard for Punjabi in both parts of Punjab. However, Punjabi speakers in Pakistan and in India tend to use more familiar scripts. Pakistani nationals write Punjabi is using the Shahmukhī script, which was created from a modification of the Persian Nastaʿlīq script. Punjabi speakers in India, prefer to use the Gurumukhī script, though they too use the Devanagari script (used for Hindi) and even Latin script due to influence of the English language as a vernacular in India.
There are two ways to write Punjabi: Gurmukhi and Shahmukhi. The word Gurmukhi translates into “Guru’s mouth”, Shahmukhi means “from the King’s mouth”. The two different writing systems can be seen in the map above. The Pakistani side of the Punjab province uses the Shahmukhi script, which clearly differs from Urdu. The Indian side of the Punjab (or Eastern Punjab) is divided into three states. In the state of Punjab proper, Gurmukhī is the most common script used for writing Punjabi.
Hindi and Urdu are in fact the same language, although there are historical reasons why the are considered separate. Hindi is written in the Devanagari script and uses more Sanskrit words. Because of historical reasons, Urdu uses the Persian script and has also adopted many words of Persian origin.
Internet and Technology
Hindi has a very strong presence on the internet. India is home to many Internet engineers, SEO specialists, website designers. Indian software designers are reputed all over the world and many feel at home in the US, launching their own digital companies.
Due to lack of standard encoding among Indian languages, many Western search engines cannot locate text written in Hindi properly, although Hindi is one of the 7 languages of India that can be used to make URLs (web addresses).
Interestingly, Hindi has lent words to technology: ‘avatar’ means “a spirit taking a new form”. “Avatar” has been used in films, artificial intelligence, computer sciences and even robotics.
Related news: Hindi-Punjabi machine translation system