WILDRE
Workshop on Indian Language Data: Resources and Evaluation

21May 2012,  Lütfi Kirdar Istanbul Exhibition and Congress Centre, Turkey

(Organized under LREC2012, May21-27, 2012)
  • Home
  • CFP
  • Organizers
  • Committee
  • Invited Speakers
  • Schedule

 

Motivation and Aim

In the past couple of decades, the Indian NLP and Speech Technology community has shown an ever increasing interest in the development of Language Resources for Indian Languages. This has primarily been due to the fact that as the community grew, increasing research in and development of Language Technology brought out the acute awareness of a serious lack of appropriate resources across the languages of India. A number of initiatives have been taken to address this issue, by the Government of India as well as academia and the industry. Many of these initiatives have targeted specific NLP and Speech technologies, inculcating collaborations between several academic institutions across the country, and active involvement of industry partners. As expected, when a number of resources are simultaneously being developed by several research groups across many languages, the need for standards also takes on some urgency. In the past 5 years, the Govt. of India, in consultation with the experts from academia and industry have taken lead in developing appropriate standards for NLP resources. This concentrated effort has resulted in a number resources, standards, tools and technologies becoming available for many Indian languages in the past few years. While the activity in the Indian Language community may still not be comparable to for example, the work done on European languages, we firmly believe that the community has come of age and is at a point where sharing of ideas and experience is necessary, not only within the community but with other communities working in similar situations, so that India can move forward in planning for the future language technology resources and requirement while maintaining its linguistic diversity.

India has 4 language families – Indo Aryan (76.87 % speakers), Dravidian (20.82 % speakers) being the major ones. These families have contributed 22 constitutionally recognized (‘scheduled’ or ‘national’) languages out of which Hindi has the ‘official’ status in addition to having the ‘national’ status. Besides these, India has 234 mother tongues reported by the recent census (2001), and many more (more than 1600) languages and dialects. Of the major Indian languages, Hindi is spoken in 10 (out of a total of 25) states of India with a total population of over 60 % followed by Telugu and Bangla. There are more than 18 scripts in India which need to be standardized and supported by technology. Devanagari is the largest script being used by more than 6 languages.

Indian languages are under the exclusive control of respective states they are spoken in. Therefore every state may decide on measures to promote its language. However, since these 22 languages are national (constituent) languages, the center (union of India) also has responsibility towards each of them, though it has certain additional responsibility towards Hindi which is national as well official language of the Indian union.  From time to time, minor/neglected languages claim constituent status.  The situation becomes more complex when such a language becomes the rallying point for the demand for a new state or autonomous region.

This complex linguistic scene in India is a source of tremendous pressure on the Indian government to not only have comprehensive language policies, but also to create resources for their maintenance and development. In the age of information technology, there is a greater need to have a fine balance between allocation of resources to each language keeping in view the political compulsions, electoral potential of a linguistic community and other issues.

Language promotion and maintenance by the Ministry of Human Resource Development

The MHRD through its language agency called CIIL and many academic institutions across the country has set up a Linguistic Data Consortium for Indian Languages (LDCIL). This consortium, being set up in the lines of the LDC at the University of Pennsylvania (USA), will not only create and manage large Indian languages databases, it will also provide a forum for researchers in India and other countries working on Indian languages to publish and build products for use based on such databases that would not otherwise be possible.

LDC-IL is expected to:
  • Become a repository of linguistic resources in all Indian languages in the form of text, speech and lexical corpora.
  • Facilitate creation of such databases by different organizations which could contribute and enrich the main LDC-IL repository.
  • Set appropriate standards for data collection and storage of corpora for different research and development activities.
  • Support language technology development and sharing of tools for language-related data collection and management.
  • Facilitate training and manpower development in these areas through workshops, seminars etc. in technical as well as process related issues.
  • Create and maintain the LDC-IL web-based services that would be the primary gateway for accessing its resources.
  • Design or provide help in creation of appropriate language technology based on the linguistic data for mass use and
  • Provide the necessary linkages between academic institutions, individual researchers and the masses.

The Technology Development for Indian Languages (TDIL) program of the Ministry of Communications and IT (MCIT)

The MCIT started a program called TDIL in 1991 for building technology solutions for Indian languages. The stated objective of the TDIL is

(i) to develop information processing tools and techniques,
(ii) to facilitate human-machine interaction without language barrier,
(iii) to create and access multilingual knowledge resources and integrate them to develop innovative user products and services.

The TDIL has made available in the public domain many basic software tools and fonts for 22 Indian languages. On the language resources funds, TDIL is running several language corpora projects in consortium mode. Some of the significant projects are:

• Development of LRs for English to Indian Languages Machine Translation (MT) System,
• Development of LRs Indian Language to Indian Language Machine Translation System
• Development of LRS Sanskrit-Hindi Machine Translation
• Development of LRs for Robust Document Analysis & Recognition System for Indian Languages
• Development of LRs for On-line handwriting recognition system
• Development of LRs Cross-lingual Information Access
• Development of Speech Corpora/Technologies
• Parallel Language Corpora development in all 22 national languages (ILCI)

Apart from the consortium-based efforts, there have been several specific institution/organization based efforts in developing standard resources for Indian Languages. Some prominent efforts include The Hindi Wordnet developed at IIT-Bombay, POS-Tagged Corpora developed in Bangla, Hindi and Sanskrit by Microsoft Research India in collaboration with Jawaharlal Nehru University.

Given the amount of activity in the area of Language Technology Resources at the government, Institution, as well as individual researcher level, we think a Workshop for Indian Language Resources and Evaluation is not only timely but absolutely imperative. We also feel that LREC is the best possible venue for such a workshop as the situation in Europe is comparable to India in terms of linguistic diversity and identity. ELRA and its associate organizations have been extremely active and successful in addressing the challenges and opportunities such a situation can often bring with it. Collocating WILRE with LREC will give our research community to interact with and learn from those involved in similar initiatives. LREC itself will also provide an exposure to challenges, and possible solutions globally, resulting in, we hope, enriching exchange of ideas.
Thus, the main aim of WILRE will be

  • To map the status of Indian Language Resources
  • To investigate challenges related to creating and sharing various levels of language resources
  • To promote a dialogue between language resource developers and users
  • To provide opportunity for researchers from India to collaborate with researchers from other parts of the world
Description of Topic

WILRE will invite technical, policy and position paper submissions on the following topics related to Indian Language Resources:

  • Text corpora
  • Speech corpora
  • Lexicons and Machine-readable dictionaries
  • Ontologies
  • Grammars
  • Annotation of corpora
  • Language resources for basic NLP, IR and Speech Technology tasks, tools and
  • Infrastructure for constructing and sharing language resources
  • Standards or specifications for language resources  applications
  • Licensing and copyright issues

Both submission and review processes handled electronically. The review process will be blind.  The workshop website will provide the submission guidelines and the link for the electronic submission.

 
WILDRE- Workshop on Indian Language Data: Resources and Evaluation

FIRST WORKSHOP ON INDIAN LANGUAGE DATA: RESOURCES AND EVALUATION (WILDRE)   

Date: Monday, 21st May 2012     

Venue: Lütfi Kirdar Istanbul Exhibition and Congress Centre, Turkey (Organized in under the platform of LREC2012 (21-27 May 2012))   

Website: http://sanskrit.jnu.ac.in/conf/wildre

WILDRE – the first workshop on Indian Language Data: Resources and Evaluation is being organized in Istanbul, Turkey on 21st May, 2012 under the LREC platform.  India has a huge linguistic diversity and has seen concerted efforts from the Indian government and industry towards developing language resources. European Language Resource Association (ELRA) and its associate organizations have been very active and successful in addressing the challenges and opportunities related to language resource creation and evaluation. It is therefore a great opportunity for resource creators of Indian languages to showcase their work on this platform and also to interact and learn from those involved in similar initiatives all over the world.
The broader objectives of the WILDRE will be

  • To map the status of Indian Language Resources
  • To investigate challenges related to creating and sharing various levels of language resources
  • To promote a dialogue between language resource developers and users
  • To provide opportunity for researchers from India to collaborate with researchers from other parts of the world
DATES      

February 12, 2012 Paper submissions due     
March 18, 2012 Paper notification of acceptance     
March 30, 2012 Camera-ready papers due     
May 21, 2012 Workshop

SUBMISSIONS     

Papers must describe original, completed or in progress, and  unpublished work. Each submission will be reviewed by two program committee members.     

Accepted papers will be given up to 10 pages (for full papers) 5 pages (for short papers and posters) in the workshop proceedings, and will be presented oral presentation or poster.     

Papers should be formatted according to the style-sheet, which will be provided on the LREC 2012 website (http://www.lrec-conf.org/lrec2012/).   

Please submit papers in PDF/doc format to: https://www.softconf.com/lrec2012/WILDRE2012/

We are seeking submissions under the following category

  • Full papers (10 pages)
  • Short papers (work in progress – 5 pages)
  • Posters (innovative ideas/proposals, research proposal of students)
  • Demo (of working online/standalone systems)  

Though our area of interest covers all NLP/language technology related activity for Indian languages, we would like to focus on the resource creation in the following areas-

  • Text corpora
  • Speech corpora
  • Lexicons and Machine-readable dictionaries
  • Ontologies
  • Grammars
  • Annotation of corpora
  • Language resources for basic NLP, IR and Speech Technology tasks, tools and
  • Infrastructure for constructing and sharing language resources
  • Standards or specifications for language resources  applications
  • Licensing and copyright issues

Both submission and review processes will handled electronically using the Start interface of the LREC website. The workshop website will provide the submission guidelines and the link for the electronic submission.

When submitting a paper through the START page, authors will be kindly asked to provide relevant information about the resources      that have been used for the work described in their paper or that      are the outcome of their research. For further information on this initiative, please refer to http://www.lrec-conf.org/lrec2012/?LRE-Map-2012 . Authors will also be asked to contribute to the Language Library,   the new initiative of LREC2012

Conference Chairs
  • Girish Nath Jha, Jawaharlal Nehru University, New Delhi
  • Kalika Bali, Microsoft Research India Lab, Bangalore
  • Sobha L, AU-KBC Research Centre, Anna University, Chennai
Program Committee
  • A. Kumaran, MSRI, Bangalore
  • A G Ramakrishnan, I.I.Sc Bangalore
  • Amba Kulkarni, University of Hyderabad
  • Chris Cieri, LDC, University of  Pennsylvania
  • Dafydd Gibbon, Universität Bielefeld, Germany
  • Dipti Mishra Sharma, IIIT, Hyderabad
  • Girish Nath Jha, Jawaharlal Nehru University, New Delhi
  • Hema Murthy, IIT, Chennai
  • Jopseph Mariani, LIMSI-CNRS, France
  • Kalika Bali, MSRI, Bangalore
  • Khalid Choukri, ELRA, France
  • L Ramamoorthy, LDC-IL, CIIL, Mysore
  • Monojit Choudhary, MSRI Bangalore
  • Nicoletta Calzolari, ILC-CNR, Pisa, Italy
  • Niladri Shekhar Dash, ISI Kolkata
  • Shivaji Bandhopadhyay, Jadavpur University, Kolkata
  • Shyamal Das Mondal, IIT Kharagpur
  • Sobha L, AU-KBC Research Centre, Anna University
  • Soma Paul, IIIT, Hyderabad
  • Umamaheshwar Rao, University of Hyderabad
Workshop contact:

diwakar.mishra@gmail.com
Diwakar Mishra, Special Center for Sanskrit Studies, Jawaharlal Nehru University, New Delhi

 

Conference Chairs

Girish Nath Jha, Jawaharlal Nehru University, India
Kalika Bali, Microsoft Research India Lab, Bangalore
Sobha L, AU-KBC, Anna University

Details of the Organizers

Girish Nath Jha
Associate Professor in  Computational Linguistics
Special Center for Sanskrit Studies,
J.N.U., New Delhi - 110067
ph.91-11-26741308 (o) Email: girishjha@gmail.com

Mukesh and Priti Chatter Distinguished Professor of History of Science,
University of Massachusetts Dartmouth, USA
http://www.umassd.edu/indic/facultyandstaff/    

Girish Nath Jha is Associate Professor at the Special Center for Sanskrit Studies, Jawaharlal Nehru University (JNU) specializing in Computational Linguistics. He also has an honorary appointment at the Center for Indic Studies, University of Massachusetts, Dartmouth, MA, USA.  Dr Jha’s research interests include Indian languages corpora and standards, Sanskrit and Hindi linguistics, Computational Lexicography, Machine Translation, Natural Language Interfaces, e-learning, web based technologies, RDBMS techniques, software design and localization.  Dr Jha completed his doctoral degree in Linguistics (Computational Linguistics) from JNU and then did another masters degree in Linguistics (Natural Language Interface) from University of Illinois, Urbana Champaign, USA in 1999. Since then he worked as software engineer in USA before joining JNU in 2002.  He has worked as consultant for LDC (University of Pennsylvania), Microsoft Corp and Microsoft Research India among others. Dr Jha is currently leading a consortium of Indian universities to develop parallel tagged corpora for major Indian languages.

Kalika Bali
Researcher (Multilingual Systems)
Microsoft Research Labs India
Address: “Vigyan” #9 Lavelle Road, Bangalore 560025 India
Phone: +91-80-66586218  Email: kalikab@microsoft.com

Kalika Bali is a researcher with the Multilingual Systems group at Microsoft Research Labs India (MSR-India) (Bangalore). Her primary research interests are in Speech Technology and Computational Linguistics, especially for Indian Languages. A linguist by training, she has taught at the University of the South Pacific as an Assoc. Prof. She has worked in the area of research and development of Language Technology at both start-ups and established companies like Nuance, Simputer, Hewlett-Packard Labs and Microsoft Research. She has been involved in development of standards related to language technologies, and is one of the authors of UPX- an XML based standard for online handwritten datasets.  She represents Microsoft on Standards Committees related to Indian languages and has been an active participant in the formulation of LR standards for Indian languages. In her previous position at HP Labs India, she was one of a two-people team that drafted the proposal for LDCIL in India. At MSR-India, she has led projects related to resources creation and annotation. 

Sobha L.
CLRG Group
AU-KBC Research Centre
MIT campus of Anna University
Chennai-600044
Phone: +91-44-22232711 Email:sobha@au-kbc.org

Sobha Lalitha Devi is a scientist with the Information Sciences Division of AU-KBC Research Centre, Anna University, Chennai, India. Sobha’s research interest is in the field of Discourse analysis, Information Extraction and Retrieval. She specializes in the area of Anaphora Resolution. She is one of the key organizers of Discourse Anaphora and Anaphor Resolution Colloquium (DAARC).  Other than the above areas she also works in the area of Automatic detection of Plagiarism and also organizes tracks in plagiarism detection. In the area of information retrieval she along with her students started the Tamil search engine www.searchko.in. She is involved in two major consortium projects funded by the Department of Information Technology, Government of India on Cross Lingual Information Access and Indian Language to Indian Language Machine Translation System (Tamil to Hindi bidirectional) and in an European Union(EU) funded project on WIQ-EI—Web Information Quality Evaluation Initiative. She was visiting faculty to universities in UK, Spain and Portugal. She is an Erasmus Mundus coordinator for 2010-2012 and is associated with University of Wolverhampton.

 

Program Committee
  1. A. Kumaran, MSRI, Bangalore
  2. A G Ramakrishnan, I.I.Sc Bangalore
  3. Amba Kulkarni, University of Hyderabad
  4. Chris Cieri, LDC, University of  Pennsylvania
  5. Dafydd Gibbon, Universität Bielefeld, Germany
  6. Dipti Mishra Sharma, IIIT, Hyderabad
  7. Girish Nath Jha, Jawaharlal Nehru University, New Delhi
  8. Hema Murthy, IIT, Chennai
  9. Jopseph Mariani, LIMSI-CNRS, France
  10. Kalika Bali, MSRI, Bangalore
  11. Khalid Choukri, ELRA, France
  12. L Ramamoorthy, LDC-IL, CIIL, Mysore
  13. Monojit Choudhary, MSRI Bangalore
  14. Nicoletta Calzolari, ILC-CNR, Pisa, Italy
  15. Niladri Shekhar Dash, ISI Kolkata
  16. Shivaji Bandhopadhyay, Jadavpur University, Kolkata
  17. Shyamal Das Mondal, IIT Kharagpur
  18. Sobha L, AU-KBC Research Centre, Anna University
  19. Soma Paul, IIIT, Hyderabad
  20. Umamaheshwar Rao, University of Hyderabad

 

  • Prof Pushpak Bhattacharya, IIT Bombay, India (Keynote speaker)
  • Mrs Swaran Lata, Head, TDIL Program, Govt of India (Inaugural speaker)
  • Dr Khalid Choukri, CEO ELDA, France (Inaugural speaker)
  • Prof Nicoletta Calzolari, ILC-CNR, Pisa, Italy (Valedictory speaker)

    Title of keynote speech: Multiwords Processing in Indian Languages - Multiwords form about 40% of the lexical items in most natural languages. Their processing is non trivial due to non-compositionality and fixity. Indian language processing is scaling new heights these days through large scale national efforts on parallel corpora creation, wordnet building, domain specific machine translation and search and so on- collectively called TDIL (Technology Development in Indian Languages). In all these efforts extracting MWs and their processing is crucial. In this presentation we describe MW phenomena in different families of Indian languages, their pecularities and attempts at their extraction. Tools have been put in place to extract MWs- through man-machine interface. This multilingual experience is highly beneficial for Indian language technology
    About the speaker

    Dr. Pushpak Bhattacharyya is a Professor of Computer Science and Engineering at IIT Bombay. He received his B.Tech from IIT Kharagpur, M.Tech from IIT Kanpur and PhD from IIT Bombay. He has held visiting positions at MIT, Cambridge, USA, Stanford University, USA and University Joseph Fourier, Grenoble, France. Dr. Bhattacharyya's research interests include Natural Language Processing, Machine Translation and Machine Leaning. He has had more than 130 publications in top conferences and journals and has served as program chair, area chair, workshop chair and PC member of top fora like ACL, COLING, LREC, SIGIR, CIKM, NAACL, GWC and others. He has guided 7 PhDs and over 100 masters and undergraduate students in their thesis work. Dr. Bhattacharyya plays a leading role in India's large scale projects on Machine Translation, Cross Lingual Search, and Wordnet and Dictionary Development. Dr. Bhattacharyya received a number of prestigious awards including IBM Innovation Award, United Nations Research Grant, MIcrosoft Research Grant, IIT Bombay's Patwardhan Award for Technology Development and Ministry of IT and Digital India Foundation's Manthan Award. Recently he has been appointed Associate Editor of the prestigious journal, ACM Transactions on Asian Language Information Processing and also was chosen for Yahoo Faculty Award.


    Panel Discussion

    Topic: India and Europe - making a common cause in LTRs

    Panelists

  • Nicoletta Calzolari (Panel Coordinator)
  • Kahlid Choukri
  • Joseph Mariani
  • Pushpak Bhattacharya
  • Swaran Lata
  • Monojit Choudhury
  • Zygmunt Vetulani
  • Dafydd Gibbon

     

     

     

     

     

     

     

  • Workshop Programme

  • 08:30-09:45: Inaugural session
  • 08:30-08:40 – Welcome by Workshop Chairs
  • 08:40-08:55 – Inaugural Address by Mrs. Swarn Lata, Head, TDIL, Dept of IT, Govt of India
  • 08:55-09:10 – Address by Dr. Khalid Choukri, ELDA CEO
  • 0910-09:45 – Keynote Lecture by Prof Pushpak Bhattacharyya, Dept of CSE, IIT Bombay.
  • 09:45-10:30 – Paper Session I
  • Chairperson: Sobha L
  • Somnath Chandra, Swaran Lata and Swati Arora, Standardization of POS Tag Set for Indian Languages based on XML Internationalization best practices guidelines
  • Ankush Gupta, A Generic and Robust Algorithm for Paragraph Alignment and its Impact on Sentence Alignment in Parallel Corpora
  • Malarkodi C.S and Sobha Lalitha Devi, A Deeper Look into Features for NE Resolution in Indian Languages
  • 10:30 – 11:00 - Coffee break + Poster Session
  • Chairperson: Monojit Choudhury
  • Akilandeswari A, Bakiyavathi T and Sobha Lalitha Devi, ‘atu’ Difficult Pronominal in Tamil
  • Subhash Chandra, Restructuring of Painian Morphological Rules for Computer processing of Sanskrit Nominal Inflections
  • Praveen Dakwale, Himanshu Sharma and Dipti Misra Sharma, Anaphora Annotation in Hindi Dependency TreeBank
  • H. Mamata Devi, On the Development of Manipuri-Hindi Parallel Corpus
  • Madhav Gopal, Annotating Bundeli Corpus Using the BIS POS Tagset
  • Madhav Gopal and Girish Nath Jha, Developing Sanskrit Corpora Based on the National Standard: Issues and Challenges
  • Ajit Kumar and Vishal Goyal, Practical Approach For Developing Hindi-Punjabi Parallel Corpus
  • Sachin Kumar, Girish Nath Jha and Sobha Lalitha Devi, Challenges in Developing Named Entity Recognition System for Sanskrit
  • Swaran Lata and Swati Arora, Exploratory Analysis of Punjabi Tones in relation to orthographic characters: A Case Study
  • Diwakar Mishra, Kalika Bali and Girish Nath Jha, Grapheme-to-Phoneme converter for Sanskrit Speech Synthesis
  • Aparna Mukherjee, Phonetic Dictionary for Indian English
  • Sibansu Mukhapadyay, Tirthankar Dasgupta and Anupam Basu, Development of an Online Repository of Bangla Literary Texts and its Ontological Representation for Advance Search Options
  • Kumar Nripendra Pathak, Challenges in Sanskrit-Hindi Adjective Mapping
  • Nikhil Priyatam Pattisapu, Srikanth Reddy Vadepally and Vasudeva Varma, Hindi Web Page Collection tagged with Tourism Health and Miscellaneous
  • Arulmozi S, Balasubramanian G and Rajendran S, Treatment of Tamil Deverbal Nouns in BIS Tagset
  • Gurpreet Singh, Letter-to-Sound Rules for Gurmukhi Panjabi (Pa): First step towards Text-to-Speech for Gurmukhi
  • Silvia Staurengo, TschwaneLex Suite (5.0.0.414) Software to Create Italian-Hindi and Hindi-Italian Terminological Database on Food, Nutrition, Biotechnologies and Safety on Nutrition: a Case Study.
  • 11:00 – 12:00 – Paper Session II Chairperson: Kalika Bali
  • Shahid Mushtaq Bhat and Richa Srishti, Building Large Scale POS Annotated Corpus for Hindi & Urdu
  • Vijay Sundar Ram R, Bakiyavathi T, Sindhujagopalan R, Amudha K and Sobha Lalitha Devi, Tamil Clause Boundary Identification: Annotation and Evaluation
  • Manjira Sinha, Tirthankar Dasgupta and Anupam Basu, A Complex Network Analysis of Syllables in Bangla through SyllableNet
  • Pinkey Nainwani, Blurring the demarcation between Machine Assisted Translation (MAT) and Machine Translation (MT): the case of English and Sindhi
  • 12:00-12:40 – Panel discussion India and Europe - making a common cause in LTRs
  • Coordinator: Nicoletta Calzolari
  • Panelists -
  • Kahlid Choukri, Joseph Mariani, Pushpak Bhattacharya, Swaran Lata, Monojit Choudhury, Zygmunt Vetulani, Dafydd Gibbon
  • 12:40- 12:55 – Valedictory Address by Prof Nicoletta Calzolari, Director ILC-CNR, Italy
  • 12:55-13:00 – Vote of Thanks