LTRC: Celebrating A Legendary Research Centre

NLP may be mainstream in undergraduate colleges today but 25 years ago it was a niche field taking its tentative steps at IIITH. As the Language Technology Research Centre (LTRC) rings in its silver jubilee, here’s an account of its glorious journey.

No mention of LTRC’s genesis is complete without a simultaneous reference to Akshar Bharati. For the uninitiated, the latter is a personification of a group that came together in the early 1980s to work on the computer processing of Indian languages, laying special emphasis on the traditional Indian theories of language. “It was set up at IIT Kanpur by the pioneers of language technology research in India, Prof. Rajeev Sangal and Vineet Chaitanya ji,” reminisces Prof. Dipti Misra, adding that one of the first books on natural language processing in India, NLP: A Paninian Perspective was the result of this dedicated team of researchers. Narrating how the Akshar Bharati team found itself making frequent trips to Hyderabad due to a number of academic institutes engaged in linguistic research; EFLU, Osmania University, and the University of Hyderabad (UoH), Prof. Vasudeva Varma, current Head of Language Technology Research Centre, IIITH and former research associate of Akshar Bharati, recounts, “There was even a Govt-funded project titled ‘IITK Centre for NLP at UoH’ which was set up at the Centre for Applied Linguistics and Translation Studies. Therefore, it made perfect sense for the Kanpur-based group to relocate to Hyderabad, and specifically to UoH”.

IIITH’s First Research Center

When the concept of setting up a research-oriented institution that could attract IT companies to Hyderabad was taking shape in 1997, Prof. Sangal who was actively involved in the founding of the International Institute of Information Technology Hyderabad (IIITH) moved to IIITH with the Akshar Bharati team. They set up not only the institute’s first research centre but also the nation’s first-of-its-kind centre dedicated to language technology research – the Language Technologies Research Centre (LTRC). The LTRC which was founded with generous funding from Satyam Computers had an ambitious blueprint for NLP research. Alongside Anusaaraka – one of India’s early machine translation projects, the founding team envisioned research in speech, search engines, and information retrieval leading to the subsequent establishment of the MT-NLP lab, speech processing lab and the information retrieval and extraction lab (IREL).

Multidisciplinary Experts

As a testimony to IIITH’s ethos which believes in research centres as opposed to traditional ‘departments’, LTRC too was set up as a research centre focusing on the broad problem area of processing natural languages both in text and speech mode. “LTRC is a great example because unlike traditional departments where faculty possess PhDs in one particular area, here we have people who are PhDs in Computer Science, Linguistics, Statistics, Sanskrit, and others. The idea was that a diverse set of people would sit together and look at the same problem from all these dimensions,” remarks Prof. Varma. Prof. Dipti Misra is one such great example herself. As a linguist well-versed in the Chomskyan approach to language, her association with Akshar Bharati while at UoH and later at LTRC exposed her to Indian grammatical traditions, Sanskrit grammar in particular which she terms ‘linguistics-rich’. “My biggest learning from being here and interacting with Sanskrit scholars like Prof. Ramakrishnamacharyalu – a collaborator from the National Sanskrit University, Tirupati – has been getting familiar with the Paninian framework.”

Educating The Ecosystem

Cognisant of the fact that multidisciplinary collaboration is far from easy, the centre resolved to primarily attract more talent from diverse disciplines and impart training to them. Thus the centre’s first academic program was introduced in the form of a PhD in Computational Linguistics in 2002. “The objective was to expose a computer science professional to basic linguistics, a Sanskrit scholar to basic programming and so on,” says Prof. Misra, adding that until then, workshops and awareness sessions on NLP or computational linguistics had been conducted by the group across the country, albeit in a non-structured way.

When Prasad Pingali – the centre’s first graduate student – learned about IIITH’s Search and Information Extraction Lab (SIEL), he had been engaged with several B2B e-commerce portal startups in the US. “I was working on product catalog and search functionality for those products. We were supposed to extract product attributes and organize them into a product taxonomy which was a manual process. I quickly realised that it was not a straightforward problem to search for products based on a free text box search using RDBMS queries,” he says. With the US recession coinciding with his awareness of having to use NLP in his line of work, Prasad found himself returning to India and registering for a PhD at LTRC’s SIEL (now Information Retrieval and Extraction Lab).

Fun Times

Equating SIEL to a startup with its accompanying teething problems, Prasad reminisces about how they grew from 3 students to about 50 researchers in 3 years. “Evangelizing search as an interesting area to pursue and attracting the cream of students was an interesting journey,” he recollects. While they had many interesting interactions with people from industry and students alike, one particular incident that stands out for him is when he and then MS by Research student, Jagadeesh Jagarlamudi were helping someone out on the Nutch forum. Their expertise and the kind of work they were involved in impressed none other than Sabeer Bhatia (of Hotmail fame) who caught the next flight to Hyderabad to visit SIEL in person. Explaining this as the result of being the handful specialising in a niche area then, Prasad muses,”Now-a-days of course, these subjects have become mainstream in many colleges.”

A visual that is hard to forget for the early team of researchers at LTRC is that of the desi jugaad at work. “None of our students were hardware students. But we needed powerful servers to process the gazillion documents that were crawled through the internet. So the team went to Ameerpet and picked up some hardware to cobble together huge systems that could do this,” chuckles Prof. Varma. When industry folk were invited to visit the centre, the sight of the deconstructed systems moved one of them so much that a set of 20 decommissioned servers was shipped all the way from the Yahoo headquarters in Santa Clara, California.

Academic Impact Of LTRC

At LTRC, what began as an academic offering of a PhD in Computational Linguistics gradually expanded to include an MTech, an MS by Research and even an MPhil in Computational Linguistics. “The MPhil was unique because it was open to students from the Humanities and Social Sciences background,” says Prof. Misra. Another path-breaking idea of its time was opening up computational linguistics research at the undergraduate level. “We felt that it would be a good idea to provide a transdisciplinary training at a much younger age – the +2 level. So that’s how the CLD (dual degree of BTech in Computer Science and an MS by Research in Computational Linguistics) program was introduced in 2009,” she observes.

College and High School Outreach

Along with formal academic training, LTRC is also known nation-wide for its hugely popular ‘Summer School in NLP’ which was launched in 2004. “The goal was to invite people, again from very diverse backgrounds, give them rigorous project-oriented training for a period of 4 weeks where they not only learn about the theoretical subject but also indulge in hands-on project work,” remarks Prof. Varma. It’s interesting to think that almost every NLP researcher in the country can trace their initiation into the field to IIITH’s centre. LTRC is also the official birthplace of the annual International Conference on Natural Language Processing (ICON). In the 21st year since its inception, the conference series has emerged as a forum for promoting interaction among researchers in NLP and computational linguistics both in India and abroad.

In an example of outreach to students in their formative years of secondary and high school, LTRC has also been actively engaged in facilitating and organising the Panini Linguistics Olympiad – the nationwide selection and training program for candidates who represent India at the International Linguistics Olympiad (IOL). “The students have gone on to win many medals on the international stage,” says Prof. Varma.

Research Impact

Narrating that a lot of work in NLP has been undertaken at LTRC, Prof. Varma lists out machine translation, speech recognition, speech synthesis – which is understanding speech and producing it in the most natural way in Indian languages, and information retrieval, which includes information extraction and summarization. But according to him, “The most significant of them all has been the creation of high quality data sets. Because of 25 years of relentless efforts, we have people from startups to big tech companies apart from independent researchers who benefitted from it. This resource is freely available to anyone who wishes to use it. And then the other is language resource development which includes building dictionaries, treebanks and more, something that Prof. Misra’s team has been consistently working on over the years. ” Prof. Misra herself adds that a major part of developing the resources itself begins with annotating corpora or text which has to be done in a consistent and standardised manner. “All this work triggered an effort to develop standards for different levels of linguistic annotation for Indian languages. It culminated in developing a standard for tagging parts of speech in different Indian languages which was then published by the Bureau of Indian Standards (BIS)”.

The other aspect of the centre’s linguistic efforts that is close to Prof. Varma’s heart is the creation of encyclopaedic information in Indian languages, or the Indic-Wiki project. “The Wikipedia in Indian languages is very, very small. So we came up with an entire process that includes tools and methodology to create millions of pages in various Indian languages. Initially, we began by experimenting in Telugu and since then we’ve expanded to include other languages. As part of this initiative, every year we conduct a Wikimedia Technology summit that encourages collaboration between developers, users, and researchers to make knowledge accessible in diverse languages,” he says, elaborating that it has spawned an entire ecosystem of Wikimedians.

Current Focus

“When the MT-NL lab was set up, machine translation meant text-to-text translation; when we ventured into speech technology, it meant translating speech-to-text and text-to-speech. When these independent technologies reached a level of maturation, it was conceptualized that the researchers in India come together to work towards speech-to-speech machine translation,” explains Prof. Misra. Thus the National Language Translation Mission project was envisioned around the same time and LTRC joined its Bhashini project which aims to employ voice as a medium to transcend language barriers for the average citizen wishing to access digital services. The centre leads a multi-consortium Indian language-to-Indian language machine translation project under Bhashini titled ‘Himangy’ (HIndustani Machini ANuvaad TechnoloGY). Work is underway to build bi-directional machine translation systems for English-to-Hindi and English-to-Telugu in addition to 9 other Indian language pairs which can be easily integrated into a Speech-to-Speech Machine Translation (SSMT) system pipeline.

Looking Ahead

“In the last 25 years, NLP has undergone three major phases. The first phase was when it was rule-based, and dealt with grammar, linguistics; the second phase was statistical and the third phase that we are currently in is all about artificial neural networks. In 2002, when LTRC was set up, the statistical phase had just begun where it married the linguistic phase of research along with statistical processing. A whole range of tools were developed for Indian languages. Now with neural networks coming in, they use large data sets as well as the tools that were developed in many important ways,” remarks Prof. Sangal, the pioneer of NLP in India.

Today, the advent of large language models (LLMs) is disrupting our daily lives and more specifically language technology. As new challenges crop up in terms of research and education, LTRC seeks to address the limitations of LLMs, reinvent itself and continue to contribute to research with a difference. Afterall, no better place to do this than at the trailblazing centre for language technology.