WILDRE
7th Workshop on Indian Language Data: Resources and Evaluation

25 May 2024, Lingotto Congress Centre , Torino (Italy)

(Organized under LREC-COLING 2024, May 20-25, 2024)
  • Home
  • CFP
  • Shared Tasks
  • Organizers
  • Committee
  • Invited Speakers
  • Schedule
  • Sponsors

 

Motivation and Aim

In the past couple of decades, the Indian NLP and Speech Technology community has shown an ever increasing interest in the development of Language Resources for Indian Languages. This has primarily been due to the fact that as the community grew, increasing research in and development of Language Technology brought out the acute awareness of a serious lack of appropriate resources across the languages of India. A number of initiatives have been taken to address this issue, by the Government of India as well as academia and the industry. Many of these initiatives have targeted specific NLP and Speech technologies, inculcating collaborations between several academic institutions across the country, and active involvement of industry partners. As expected, when a number of resources are simultaneously being developed by several research groups across many languages, the need for standards also takes on some urgency. In the past few years years, the Govt. of India, in consultation with the experts from academia and industry have taken lead in developing appropriate standards for NLP resources. This concentrated effort has resulted in a number resources, standards, tools and technologies becoming available for many Indian languages in the past few years. While the activity in the Indian Language community may still not be comparable to for example, the work done on European languages, we firmly believe that the community has come of age and is at a point where sharing of ideas and experience is necessary, not only within the community but with other communities working in similar situations, so that India can move forward in planning for the future language technology resources and requirement while maintaining its linguistic diversity.

India has 4 major language families as reported by the 2011 GOI census. These have contributed 22 constitutionally recognized (‘scheduled’) languages – Indo Aryan (15 scheduled languages sppken by 78.05 % speakers), Dravidian (4 scheduled languages spoken by 19.64 % speakers), Austro-Asiatic (1 scheduled language spoken by 1.11% speakers), Tibeto Burman (2 scheduled languages spoken by 1.01% speakers) being the major ones. Out of these scheduled languages, Hindi has the ‘official’ status in addition to having the ‘scheduled’ status. Other 21 scheduled langauges may be official languages in the states they are spoken. Besides these, India has 270 mother tongues (with speaker base >=10000) reported by the census 2011, and more than 1700 languages and dialects. Of the major Indian languages, Hindi is spoken in 10 (out of a total of 28) states of India with a total population of over 60% followed by English, Bangla and Telugu. There are more than 18 scripts in India which need to be standardized and supported by technology. Devanagari is the largest script being used by more than 10 languages.

Indian languages are under the exclusive control of respective states they are spoken in. Therefore every state may decide on measures to promote its language. However, since these 22 languages are scheduled languages, the center (union of India) also has responsibility towards each of them, though it has certain additional responsibility towards Hindi which is scheduled as well official language of the Indian union. From time to time, minor/neglected languages claim constituent status. The situation becomes more complex when such a language becomes the rallying point for the demand for a new state or autonomous region.

This complex linguistic scene in India is a source of tremendous pressure on the Indian government to not only have comprehensive language policies, but also to create resources for their maintenance and development. In the age of information technology, there is a greater need to have a fine balance between allocation of resources to each language keeping in view the political compulsions, electoral potential of a linguistic community and other issues.

Language promotion and maintenance by the Ministry of Education (MoE)

Commission for Scientific and Technical Terminology (CSTT)

The GOI set up a standing commission called the in 1961 in pursuance of a Presidential Order dated April 27, 1960 to evolve scientific and technical termonlogies in Hindi and other Indian langauges. The CSTT is currently a subordinate office under MoE and has the following responsibilities.

CSTT has been working for last 62 years to :
  • Evolve terminologies in all scheduled Indian langauges in the scientific and technical domains
  • Collaborate with State Governments, Universities, Regional Text-Book Boards and State Granth Academies to facilitate evolution of scientific and technical terminology and reference material in Hindi and Indian Languages.
    Till date, CSTT has published more than 350 glossaries, definitional dictionaries, text-books, reference materials and monographs in 22 scheduled languages. CSTT publishes quarterly journals named Vigyan Garima Sindhu and Gyan Garima Sindhu and many more works of similar nature. CSTT has also taken care of administrative and various departmental glossaries that are widely used by various Government Departments, Institutions, Research Laboratories, Autonomous Organization, PSUs etc. CSTT regularly organizes workshops, seminars, symposium, conferences, orientation and training programmes to increase the use and popularize the standard terminology of Hindi and other Indian languages.
Linguistic Data Consortium for Indian Languages (LDC-IL)

The MoE through its language agency called CIIL and many academic institutions across the country has set up a Linguistic Data Consortium for Indian Languages (LDCIL). This consortium, being set up in the lines of the LDC at the University of Pennsylvania (USA), will not only create and manage large Indian languages databases, it will also provide a forum for researchers in India and other countries working on Indian languages to publish and build products for use based on such databases that would not otherwise be possible.

LDC-IL is expected to:
  • Become a repository of linguistic resources in all Indian languages in the form of text, speech and lexical corpora.
  • Facilitate creation of such databases by different organizations which could contribute and enrich the main LDC-IL repository.
  • Set appropriate standards for data collection and storage of corpora for different research and development activities.
  • Support language technology development and sharing of tools for language-related data collection and management.
  • Facilitate training and manpower development in these areas through workshops, seminars etc. in technical as well as process related issues.
  • Create and maintain the LDC-IL web-based services that would be the primary gateway for accessing its resources.
  • Design or provide help in creation of appropriate language technology based on the linguistic data for mass use and
  • Provide the necessary linkages between academic institutions, individual researchers and the masses.

The Technology Development for Indian Languages (TDIL) program of the Ministry of Communications and IT (MCIT)

The MCIT started a program called TDIL in 1991 for building technology solutions for Indian languages. The stated objective of the TDIL is

(i) to develop information processing tools and techniques,
(ii) to facilitate human-machine interaction without language barrier,
(iii) to create and access multilingual knowledge resources and integrate them to develop innovative user products and services.

The TDIL has made available in the public domain many basic software tools and fonts for 22 Indian languages. On the language resources funds, TDIL is running several language corpora projects in consortium mode. Some of the significant projects are:

• Development of LRs for English to Indian Languages Machine Translation (MT) System,
• Development of LRs Indian Language to Indian Language Machine Translation System
• Development of LRS Sanskrit-Hindi Machine Translation
• Development of LRs for Robust Document Analysis & Recognition System for Indian Languages
• Development of LRs for On-line handwriting recognition system
• Development of LRs Cross-lingual Information Access
• Development of Speech Corpora/Technologies
• Parallel Language Corpora development in all 22 national languages (ILCI)

Apart from the consortium-based efforts, there have been several specific institution/organization based efforts in developing standard resources for Indian Languages. Some prominent efforts include The Hindi Wordnet developed at IIT-Bombay, POS-Tagged Corpora developed in Bangla, Hindi and Sanskrit by Microsoft Research India in collaboration with Jawaharlal Nehru University, New Delhi.

WILDRE-1 (under LREC-2012, Istanbul, May 21-17, 2012)

Given the amount of activity in the area of Language Technology Resources at the government, Institution, as well as individual researcher level, we organized the First Workshop in Istanbul in 2012. The workshop was a huge success in terms of large participation and number of submissions. For the half day workshop, we selected 8 full papers and 18 posters. The workshop featured three distinguished speakers in the inaugural session - Mrs. Swarn Lata (Head, TDIL, Dept of IT, Govt of India), Khalid Choukri, ELDA CEO, Prof. Pushpak Bhattacharyya, IIT Bombay. The workshop also featured a panel discussion on India and Europe - making a common cause in LTRs in which seven distinguished panelists participated - Kahlid Choukri, Joseph Mariani, Pushpak Bhattacharya, Swaran Lata, Monojit Choudhury, Zygmunt Vetulani, Dafydd Gibbon. The valedictory address was given by Nicoletta Calzolari, Director ILC-CNR, Italy.

WILDRE-2 (under LREC-2014, Reykjavik, Iceland, May 26-31, 2014)

The 2nd Workshop for Indian Language Resources and Evaluation was organized on 27 May 2014, Harpa Conference Centre, Reykjavik, Iceland. The workshop was a big success with 7 full papers and 11 posters/demo selected for presentation in the half day workshop. Workshop featured prominent speakers like the inaugural address by Nicoletta Calzolari and keynote by Dafydd Gibbon. The panel discussion on “India and Europe - making a common cause in LTRs” was coordinated by Hans Uszkoreit and included among panelists the scholars like Joseph Mariani, Shyam Aggarwal, Zygmunt Vetulani, Dafydd Gibbon and Panchanan Mohanty. The second workhop was remarkable on another count. It saw a collaboration emerging between Indian and European partners on two platforms – the IMAGACT and the TypeCraft which led to joint poster presentations by the researchers from India and Europe. The seminar ended by valedictory address by Mrs Swaran Lata, head of the TDIL program of government of India.

WILDRE-3 (under LREC-2016, Portorož, Slovenia, May 23-28 2016)

The 3rd Workshop for Indian Language Resources and Evaluation was organized on 24 May 2016, Grand Hotel Bernardin Conference Center, Portorož, Slovenia. The workshop was a big success with 7 full papers 5 short papers and 11 poster/demo selected for presentation in the half day workshop. Workshop featured prominent speakers like the inaugural address and keynote by Nicoletta Calzolari. The panel discussion on "Structured Language Resources (SLRs) in India and Europe - avenues for closer collaboration" was coordinated by Jan Hajik and included among panelists the scholars like Joseph Mariani, Zygmunt Vetulani, Jalpa Zaladi and Sunayana Sitaram.The seminar ended by valedictory address by Zygmunt Vetulani, Adam Mickiewicz University, Poznan, Poland.

WILDRE-4 (under LREC-2018, Miyazaki (Japan) May 07-12 2018)

The 4th Workshop for Indian Language Resources and Evaluation was organized on 12 May 2018, Phoenix Seagaia Resort, Miyazaki (Japan). The workshop was a big success with 2 full papers 3 short papers and 10 poster/demo selected for presentation in the half day workshop. Workshop featured prominent speakers like the inaugural address and keynote by Khalid Choukri (ELRA, France) and Chris Cierri (LDC, Philadelphia, USA) respectively. The panel discussion on " Language Technology Resource – Exploring new frontiers of collaborative R & D" was coordinated by Zygmunt Vetulani (Adam Mickiewicz University, Poland) and included among panelists the scholars like Dan Van Esch (Google), Kalika Bali (Microsoft Research India), Alessandro Panunzi (University of Florence, Italy).The seminar ended by valedictory address by Joseph Marianni (LIMSI-CNRS, Paris).

WILDRE-5 (under LREC-2020, Marseille (France) May 11-16, 2020)

The 5th Workshop for Indian Language Resources and Evaluation was organized on 24th May 2020 (Online). The workshop was a big success with 5 papers and 7 posters selected for presentation in the half-day workshop. The workshop featured prominent speakers like the inaugural address and keynote by M Jagdish Kumar (VC, JNU) and Anoop Kunchukuttan (Microsoft) respectively. The panel discussion on "New directions for Indian language technology resources" was coordinated by Kalika Bali (Microsoft Research India) and included among panelists the scholars like Monojit Choudhary (Microsoft Research India), Pushpak Bhattacharya (IIT Bombay/Patna), Dafydd Gibbon (Universität Bielefeld, Germany), SS Aggarwal (KIIT), Zygmunt Vetulani (Adam Mickiewicz University, Poland), Patrick Paroubek (LIMSI-CNRS, France), Vijay Kumar (TDIL, Govt of India). The seminar ended with a valedictory address by Panchanan Mohanty, (GLA, Mathura).

WILDRE-6 (under LREC-2022, Marseille (France) June 20-25, 2022)

The 6th Workshop for Indian Language Resources and Evaluation was organized on 20th June 2022. The workshop was a big success with 3 papers and 13 posters selected for presentation in the half-day workshop. The workshop featured prominent speakers like the inaugural address and keynote by Monojit Choudhary (MSRI, Bangalore). The seminar ended with a valedictory address by Chris Cierri (LDC, Philadelphia, USA).

Broader objectives of WILDRE-7 will be
  • To map the status of Indian Language Resources
  • To investigate challenges related to creating and sharing various levels of language resources
  • To promote a dialogue between language resource developers and users
  • To provide opportunity for researchers from India to collaborate with researchers from other parts of the world

Description of the Topic

    WILDRE-7 will have a special focus on Demos of Indian Language Technology. In the past few years, as more resources have been developed and made available, there has been an increased activity in developing usable technology using these. WILDRE-7 would therefore like to encourage and widen the Demo track to allow the community to showcase their demos and have mutually beneficial interactions with each other as well as resource developers.

WILDRE will invite technical, policy and position paper submissions on the following topics related to Indian Language Resources:

  • Digital Humanities, heritage computing
  • Corpora - text, speech, multimodal, methodologies, annotation and tools
  • Lexicons and Machine-readable dictionaries
  • Ontologies
  • Grammars
  • Language resources for basic NLP, IR and Speech Technology tasks, tools and Infrastructure for constructing and sharing language resources
  • Standards or specifications for language resources applications
  • Licensing and copyright issues
Shared Task
    Following the success of the six WILDRE workshops, WILDRE-7 will include two shared tasks for Indian languages. The organizers of shared tasks will provide datasets and evaluation platforms to evaluate systems developed by the participants.

Both submission and review processes handled electronically. The review process will be double-blind.

 
WILDRE7- Workshop on Indian Language Data: Resources and Evaluation

7th WORKSHOP ON INDIAN LANGUAGE DATA: RESOURCES AND EVALUATION (WILDRE)

Date: Saturday, 25th May 2024

Venue: Lingotto Congress Centre, Torino, Italy (Organized under the platform of LREC-COLING 2024 (20-25 May 2024))

Website:

  • Main website - http://sanskrit.jnu.ac.in/conf/wildre7
  • Submit papers on - https://softconf.com/lrec-coling2024/WILDRE-7
  • LREC-COLING website - https://lrec-coling-2024.org/

    WILDRE – the 7th workshop on Indian Language Data: Resources and Evaluation is being organized in Torino (Italy) on 25th May 2024 under the LREC-COLING 2024 platform. India has a huge linguistic diversity and has seen concerted efforts from the Indian government and industry towards developing language resources. European Language Resource Association (ELRA) and its associate organizations have been very active and successful in addressing the challenges and opportunities related to language resource creation and evaluation. It is therefore a great opportunity for resource creators of Indian languages to showcase their work on this platform and also to interact and learn from those involved in similar initiatives all over the world. The broader objectives of the WILDRE will be

    • To map the status of Indian Language Resources
    • To investigate challenges related to creating and sharing various levels of language resources
    • To promote a dialogue between language resource developers and users
    • To provide opportunity for researchers from India to collaborate with researchers from other parts of the world
  • DATES

    February 28 March 06, 2024 Paper submissions due
    March 28, 2024 Paper notification of acceptance
    April 10, 2024 Camera-ready papers due
    May 25, 2024 Workshop

    SUBMISSIONS

    Papers must describe original, completed or in progress, and unpublished work. Each submission will be reviewed by three program committee members.

    Accepted papers will be given up to 10 pages (for full papers) 5 pages (for short papers and posters) in the workshop proceedings, and will be presented oral presentation or poster.

    Papers should be formatted according to the style-sheet, which is provided on the LREC-COLING 2024 website (https://lrec-coling-2024.org/authors-kit/). Paper should be completely anonymised and anything pointing to the author(s) of the paper should be completely removed. Papers should be submitted in PDF format to the LREC-COLING website.

    We are seeking submissions under the following category

  • Full papers (10 pages)
  • Short papers (work in progress – 5 pages)
  • Posters (innovative ideas/proposals, research proposal of students - 1 page poster sample)
  • Demo (of working online/standalone systems)

    WILDRE-7 will have a special focus on Demos of Indian Language Technology. In the past few years, as more resources have been developed and made available, there has been an increased activity in developing usable technology using these. WILDRE-7 would like to encourage and widen the Demo track to allow the community to showcase their demos and have mutually beneficial interactions with each other as well as resource developers. WILDRE-7 will invite technical, policy and position paper submissions on the following topics related to Indian Language Resources:

  • Digital Humanities, heritage computing
  • Corpora - text, speech, multimodal, methodologies, annotation and tools
  • Lexicons and Machine-readable dictionaries
  • Ontologies
  • Grammars
  • Language resources for basic NLP, IR and Speech Technology tasks, tools and Infrastructure for constructing and sharing language resources
  • Standards or specifications for language resources applications
  • Licensing and copyright issues

    Both submission and review processes handled electronically. The review process will be double-blind. The workshop website will provide the submission guidelines and the link for the electronic submission. When submitting a paper from the START page, authors will be asked to provide essential information about resources (in a broad sense, i.e. also technologies, standards, evaluation kits, etc.) that have been used for the work described in the paper or are a new result of your research. Moreover, ELRA encourages all LREC-COLING authors to share the described LRs (data, tools, services, etc.), to enable their reuse, replicability of experiments, including evaluation ones, etc.

    For further information on this initiative, please refer to https://lrec-coling-2024.org/

    Contact:

    Dr. Atul Kr. Ojha, University of Galway, Ireland & Panlingua Language Processing LLP, India shashwatup9k@gmail.com

  •  

    Seventh Workshop on Indian Language Data: Resources and Evaluation (WILDRE-7) Shared Tasks

    The seventh Workshop on Indian Language Data: Resources and Evaluation (WILDRE-7) at LREC-COLING-2024 will include two shared tasks on the sentiment analysis and machine translation topics:

    Code-mixed Less-Resourced Sentiment analysis (Code-mixed)

    This shared task addresses the complexities of code-mixed data from less-resourced similar languages and focuses on sentiment analysis. The task builds on code-mixed sentiment analysis but introduces language pairs and triplets of less-resourced closely related languages, Magahi-Hindi-English, Maithili-Hindi, Bangla-English, and Hindi-English. These four languages come from the Indo-Aryan language family and are spoken in eastern India.
    Keeping a record of the challenges of processing closely related languages in code-mixed and low-resourced settings. We challenge the participants to explore using different Machine learning and Deep learning approaches to train the model on the given training and validation dataset while testing on a surprise language. In this context, we will provide Hindi-English, Bangla-English and Magahi-Hindi-English datasets for training and validation. However, the Maithili-Hindi dataset will be provided as a surprise language pair to the test sets. This task will allow the participants to use any approach to train their model but robust enough to perform on a closely related language dataset. This would also allow us to understand the language representation in various code-mixed settings and the preference of the language by the speakers to express their emotions in each language pair.

    Discource Machine Translation (DiscoMT)

    The primary objective of this shared task is to advance the state-of-the-art in machine translation by promoting research and development of systems that can seamlessly handle discourse-level information with a focus on Indian languages. Participants are encouraged to explore innovative approaches that go beyond sentence-level translation and incorporate discourse markers, anaphora resolution, and cohesive structures to produce more contextually aware and natural-sounding translations.
    In the shared task, participants will be provided training and dev datasets containing Indian languages' parallel texts at both sentence and discourse levels. These datasets may include diverse genres and domains to ensure the generalizability of the proposed models. The focus will be on languages with varying discourse structures, encouraging the exploration of language-specific challenges in discourse-aware translation. The evaluation set will be designed to test the models' ability to handle unseen examples of discourse and maintain coherence in various communicative contexts. The submitted system will be evaluated on both automatic machine translation metrics (such as BLEU, CHRF++ and TER) and human evaluation metrics designed to assess discourse coherence.

    Shard Task Dates

    Dec 22, 2023: Registration
    Jan 09, 2024: Train and Validation Data set Release [to get the data please register]
    Feb 15, 2024: Test Set Release
    Feb 23, 2024: System Submission Due
    Feb 29, 2024: System Results
    March 15, 2024: System Description Paper Due
    March 28, 2024: Paper notification of acceptance

    Task Organizers

    Code-mixed:

    • Priya Rani, SFI Centre for Research and Training in AI, DSI, University of Galway
    • Gaurav Negi, Insight SFI Research Centre for Data Analytics, DSI, University of Galway
    • Shardul Suryawanshi, Insight SFI Research Centre for Data Analytics, DSI, University of Galway
    • Saroj Jha, IIT-Patna
    • John P. McCrae, Insight SFI Research Centre for Data Analytics, DSI, University of Galway
    • Paul Buitelaar, Insight SFI Research Centre for Data Analytics, DSI, University of Galway

    Overview:

    • Atul Kr. Ojha, Girish Nath Jha, Sobha L., Kalika Bali
    Contact

    For questions related to shared tasks, please send an email to wildre-shared-tasks@googlegroups.com.
    For urgent/specific queries on the workshop and/or shared tasks please contact Dr. Atul Kr. Ojha at atulkumar.ojha@insight-centre.org.

     

    Conference Chairs and Organizing Committee

    Girish Nath Jha, Chairman, Commission for Scientific and Technical Terminology, MoE, GOI (on deputation from Jawaharlal Nehru University, India)
    Kalika Bali, Microsoft Research India Lab, Bangalore
    Sobha L, AU-KBC, Anna University
    Atul Kr. Ojha, Insight SFI Research Centre for Data Analytics, Data Science Institute, University of Galway, Ireland & Panlingua, India

    Details of the Conference Chairs

    Girish Nath Jha
    Chairman,
    Commission for Scientific and Technical Terminology, MoE, GOI
       &
    Professor in Computational Linguistics,
    School of Sanskrit and Indic Studies,
    J.N.U., New Delhi - 110067
    Phone: +91-11-26741308 (o) Email: girishjha@gmail.com

    Prof. Girish Nath Jha teaches Computational Linguistics at the School of Sanskrit and Indic Studies in Jawaharlal Nehru University (JNU) and is currently the Chairman, Commission for Scientific and Technical Terminology, MoE, GOI. He also holds concurrent appointments in JNU’s Center of Linguistics, Special Center of E-Learning and is an Associated Faculty in the ABV School of Management and Entrepreneurship. Prof Jha was previously the director of JNU’s International Collaboration during 2016-18. His research interests include Indian languages corpora and standards, Sanskrit and Hindi linguistics, Science & Technology in ancient texts, Lexicography, Machine Translation, e-learning, web based technologies, RDBMS, software design and localization. Details on his work can be obtained from http://sanskrit.jnu.ac.in. Prof Jha has done collaborative research with the Center for Indic Studies, University of Massachusetts, Dartmouth, MA, USA as "Mukesh and Priti Chatter Distinguished Professor of History of Science" during 2009-12, was visiting professor at the Yogyakarta State University, Indonesia in 2013. He has been awarded DAAD fellowships in 2014 and 2016 to teach Computational Linguistics in the Digital Humanities department at University of Würzburg, Germany and has been a visiting Professor at the University of Florence in the summer of 2016. Prof. Jha did his M.A., M.Phil. and Ph.D. in Linguistics (Computational Linguistics) from JNU and then got another masters degree in Linguistics (specializing in Natural Language Interface) from University of Illinois, Urbana Champaign, USA in 1999. Since then he worked as software engineer and software development specialist in USA before joining JNU in 2002. Prof Jha has books published from publishers like Springer Verlag, Cambridge Scholar Publishing and has over 133 research papers/presentations/publications and over 178 invited talks. Prof Jha has had several consultancies including those from Nuance, Swiftkey, Microsoft Research USA, Microsoft Research India, Microsoft Corporation, Linguistic Data Consortium (University of Pennsylvania), University of Massachusetts Dartmouth, EZDI among others. Prof Jha has completed several sponsored projects for Indian language technology development and has led a consortium of 17 Indian universities/institutes for developing corpora and standards for Indian languages sponsored by Ministry of Electronics and IT (MEITY), Govt. of India. Prof Jha has been chair/co-chair for at least 13 international seminars/conferences and has been nominated member of more than 30 committees. He was nominated to the editorial board of a leading journal from Springer and has been a reviewer of many leading journals and proceedings in the area of NLP. He has supervised 42 M.Phil. and 46 Ph.D. students. Prof Jha's efforts in collaboration with software industry has led to the development of key technologies for Indian languages including English-Urdu MT for Microsoft Bing Translator, predictive keyboards for several Indian languages by Swiftkey. His awards include Datta Peetha award for Sanskrit linguistics (2017), KECSS Felicitation award for promotion of Sharada script (2016).

    Kalika Bali
    Researcher (Multilingual Systems)
    Microsoft Research Labs India
    Address: “Vigyan” #9 Lavelle Road, Bangalore 560025 India
    Phone: +91-80-66586218 Email: kalikab@microsoft.com

    Kalika Bali is a Principal Researcher at Microsoft Research India working in the areas of Machine Learning, Natural Language Systems and Applications, as well as Technology for Emerging Markets. Her research interests lie broadly in the area of Speech and Language Technology especially in the use of linguistic models for building technology that offers a more natural Human-Computer as well as Computer-Mediated interactions, and technology for Low Resource Languages. She is currently working on Project Mélange which tries to understand, process and generate Code-mixed language data for both text and speech. She is also interested in how social and pragmatic functions affect language use, in code-mixed as well as monolingual conversations, and how to build effective computational models of sociolinguistics and pragmatics that can lead to more aware Artificial Intelligence. She is very passionate about NLP and Speech technology for Indian Languages. She believe that local language technology especially with speech interfaces, can help millions of people gain entry into a world that is till now almost inaccessible to them. She has served, and continues to serve, on several government and other committees that work on Indian Language Technologies as well as Linguistic Resources and Standards for NLP/Speech.

    Sobha L.
    CLRG Group
    AU-KBC Research Centre
    MIT campus of Anna University
    Chennai-600044
    Phone: +91-44-22232711 Email: sobha@au-kbc.org

    Sobha Lalitha Devi is a scientist with the Information Sciences Division of AU-KBC Research Centre, Anna University, Chennai, India. Sobha’s research interest is in the field of Discourse analysis, Information Extraction and Retrieval. She specializes in the area of Anaphora Resolution. She is one of the key organizers of Discourse Anaphora and Anaphor Resolution Colloquium (DAARC). Other than the above areas she also works in the area of Automatic detection of Plagiarism and also organizes tracks in plagiarism detection. In the area of information retrieval she along with her students started the Tamil search engine www.searchko.in. She is involved in two major consortium projects funded by the Department of Information Technology, Government of India on Cross Lingual Information Access and Indian Language to Indian Language Machine Translation System (Tamil to Hindi bidirectional) and in an European Union(EU) funded project on WIQ-EI—Web Information Quality Evaluation Initiative. She was visiting faculty to universities in UK, Spain and Portugal. She is an Erasmus Mundus coordinator for 2010-2012 and is associated with University of Wolverhampton.

     

    Program Committee (to be updated soon)
    • Adil Amin Kak, Kashmir University
    • Anil Kumar Singh, IIT BHU, Benaras
    • Anupam Basu, Director, NIIT, Durgapur
    • Anoop Kunchukuttan, Microsoft AI and Research, India
    • Arul Mozhi, University of Hyderabad
    • Asif Iqbal, IIT Patna, Patna
    • Atul Kr. Ojha, University of Galway, Ireland & Panlingua Language Processing LLP, India
    • Bogdan Babych, Heidelberg University, Germany
    • Dafydd Gibbon, Universität Bielefeld, Germany
    • Daan van Esch, Google, USA
    • Dipti Mishra Sharma, IIIT, Hyderabad
    • Diwakr Mishra, Amazon-Banglore, India
    • Elizabeth Sherley, IITM-Kerala, Trivandrum
    • Eveline Wandl-Vogt, Austrian Academy of Sciences, Austria
    • Georg Rehm, DFKI, Germany
    • Girish Nath Jha, Chairman, Commission for Scientific and Technical Terminology, MoE, GOI and JNU, New Delhi
    • Jolanta Bachan, Adam Mickiewicz University, Poland
    • Joseph Mariani, LIMSI-CNRS, France
    • Kalika Bali, MSRI, Bangalore
    • Khalid Choukri, ELRA, France
    • Lars Hellan, NTNU, Norway
    • Malhar Kulkarni, IIT Bombay
    • Manji Bhadra, Bankura University, West Bengal
    • Massimo Monaglia, University of Florence, Italy
    • Monojit Choudhary, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi
    • Narayan Choudhary, CIIL, Mysore
    • Niladri Shekhar Dash, ISI Kolkata
    • Panchanan Mohanty, GLA, Mathura
    • Rajeev R R, ICFOSS, Trivandrum
    • Shantipriya Parida, Silo AI, Finland
    • Shagun Sinha, Amity University, Noida, India
    • Shivaji Bandhopadhyay, Director, NIT, Silchar
    • Sobha L, AU-KBC Research Centre, Anna University
    • Subhash Chandra, Delhi University
    • Swaran Lata, Retired Head, TDIL, MCIT, Govt of India
    • Vijay Sundar Ram, AU-KBC Research Centre, Anna University
    • Virach Sornlertlamvanich, Thammasat Univeristy, Bangkok, Thailand
    • Zygmunt Vetulani, Adam Mickiewicz University, Poland
    TBD

     

     

     

     

     

     

     

    Workshop Program

    Saturday, May 25, 2024 (Time in GMT+2/CEST)

    14:00–14:05Welcome by Workshop Chairs
    14:05–14:15 Inaugural session
    Neeta Prasad, Head, Language Division, MoE
    14:15–15:00Keynote Lecture
    15:00–16:00 Oral Session-I
    15:00–15:20Towards Disfluency Annotated Corpora for Indian Languages
    Chayan Kochar, Vandan Vasantlal Mujadia, Pruthwik Mishra, Dipti Misra Sharma
    15:20–15:40EmoMix-3L: A Code-Mixed Dataset for Bangla-English-Hindi for Emotion Detection
    Nishat Raihan, Dhiman Goswami, Antara Mahmud, Antonios Anastasopoulos, Marcos Zampieri
    15:40–16:00Findings of the WILDRE Shared Task on Code-mixed Less-resourced Sentiment Analysis for Indo-Aryan Language
    Priya Rani, Gaurav Negi, Saroj Jha, Shardul Suryawanshi, Atul Kr. Ojha, Paul Buitelaar, John P. McCrae
    16:00–16:30Coffee break/Poster Session
    16:00–16:30Multilingual Bias Detection and Mitigation for Indian Languages
    Ankita Maity, Anubhav Sharma, Rudra Dhar, Tushar Abhishek, Manish Gupta and Vasudeva Varma
    16:00–16:30Dharmaśāstra Informatics: Concept Mining System for Socio-Cultural Facet in Ancient India
    Arooshi Nigam, Subhash Chandra
    16:00–16:30Exploring News Summarization and Enrichment in a Highly Resource-Scarce Indian Language: A Case Study of Mizo
    bhinaba Bala, Ashok Urlana, Rahul Mishra, Parameswari Krishnamurthy
    16:00–16:30Finding the Causality of an Event in News Articles
    Sobha Lalitha Devi, Pattabhi RK Rao
    16:00–16:30Creating Corpus of Low Resource Indian Languages for Natural Language Processing: Challenges and Opportunities
    Pratibha Dongare
    16:00–16:30FZZG at WILDRE-7: Fine-tuning Pre-trained Models for Code-mixed, Less-resourced Sentiment Analysis
    Gaurish Thakkar, Marko Tadić, Nives Mikelic Preradovic
    16:00–16:30MLInitiative@WILDRE7: Hybrid Approaches with Large Language Models for Enhanced Sentiment Analysis in Code-Switched and Code-Mixed Texts
    Hariram Veeramani, Surendrabikram Thapa, Usman Naseem
    16:00–16:30Toxic Comment Classification in Hindi using different Neural Network and Transformer Models
    Devendra Kumar Tayal, Ayush Gupta, Sistla Venkata Sai Shruthi, Abhilasha Singh, Shaily Sehgal
    16:30–16:55Oral Session-II
    16:30–16:55Aalamaram: A Large-Scale Linguistically Annotated Treebank for the Tamil Language
    A M Abirami, Wei Qi Leong, Hamsawardhini Rengarajan, D Anitha, R Suganya, Himanshu Singh, Kengatharaiyer Sarveswaran, William Chandra Tjhi, Rajiv Ratn Shah
    16:55–17:40Panel discussion
    Pushpak Bhattacharya (IIT Bombay), Kalika Bali (Microsoft Research India), Sobha L. (AU-KBC, Anna University), Zygmunt Vetulani (Adam Mickiewicz University, Poland), Daan van Esch (Google, USA), Girish Nath Jha (Chairman, Commission for Scientific and Technical Terminology, MoE, GOI)
    Topic – Newer directions for Indian language technology resources
    Panel Coordinator: Atul Kr. Ojha (University of Galway)
    17:40–17:50Vote of Thanks

     

     

     

     

     




  • MSR has been a consistent sponsor of all the WILDRE events so far

       


     

    Benefits to our Sponsors

  • Opportunity to demo your technology
  • Present a poster of your reserach
  • Opportunity to participate in te panel discussion

    Please contact Prof. Girish Nath Jha (girishjha@jnu.ac.in) for sposnorship related queries