Constantine Lignos

Assistant Professor of Computational Linguistics
Michtom School of Computer Science, Brandeis University
Email: lastname at brandeis dot edu

I direct the Broadening Linguistic Technologies Lab at Brandeis University, where I am affiliated with the Michtom School of Computer Science and the Computational Linguistics Program. The overarching goal of my research is to broaden the depth and breadth of human language technology, with a focus on understudied problems in natural language processing (NLP).

My current work focuses on eliminating the barriers to useful language technology for every living written language, especially lower-resourced and minoritized languages. I have previously worked on human-robot interaction and the representation of language in the mind, including language acquisition, processing, and change.

I did my graduate work in Computer Science at The University of Pennsylvania (Ph.D. 2013), advised by Mitch Marcus and Charles Yang. I then completed a post-doctoral fellowship at The Children’s Hospital of Philadelphia exploring clinical applications of statistical models of language processing. I was a researcher at BBN Technologies and USC Information Sciences Institute. In summer 2019, I joined the computational linguistics faculty at Brandeis University.

Links

Lab GitHub: BLTLab
My GitHub: ConstantineLignos

Recent News

Jan 01, 2025	I’ve been elected Vice President of the ACL Special Interest Group on Writing Systems and Written Language (SIGWrit).
Jan 01, 2025	After 15 years of simple HTML bliss, I’ve launched my new website!

Selected Publications

Language Model Priors and Data Augmentation Strategies for Low-resource Machine Translation: A Case Study Using Finnish to Northern Sámi

Jonne Sälevä, and Constantine Lignos

In Findings of the Association for Computational Linguistics: ACL 2024, Aug 2024

Abs

We investigate ways of using monolingual data in both the source and target languages for improving low-resource machine translation. As a case study, we experiment with translation from Finnish to Northern Sámi.Our experiments show that while conventional backtranslation remains a strong contender, using synthetic target-side data when training backtranslation models can be helpful as well.We also show that monolingual data can be used to train a language model which can act as a regularizer without any augmentation of parallel data.
CoNLL#: Fine-grained Error Analysis and a Corrected Test Set for CoNLL-03 English

Andrew Rueda, Elena Alvarez-Mellado, and Constantine Lignos

In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), May 2024

Abs

Modern named entity recognition systems have steadily improved performance in the age of larger and more powerful neural models. However, over the past several years, the state-of-the-art has seemingly hit another plateau on the benchmark CoNLL-03 English dataset. In this paper, we perform a deep dive into the test outputs of the highest-performing NER models, conducting a fine-grained evaluation of their performance by introducing new document-level annotations on the test set. We go beyond F1 scores by categorizing errors in order to interpret the true state of the art for NER and guide future work. We review previous attempts at correcting the various flaws of the test set and introduce CoNLL#, a new corrected version of the test set that addresses its systematic and most prevalent errors, allowing for low-noise, interpretable error analysis.
ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages Using Wikidata

Jonne Sälevä, and Constantine Lignos

In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), May 2024

Abs

We introduce ParaNames, a massively multilingual parallel name resource consisting of 140 million names spanning over 400 languages. Names are provided for 16.8 million entities, and each entity is mapped from a complex type hierarchy to a standard type (PER/LOC/ORG). Using Wikidata as a source, we create the largest resource of this type to date. We describe our approach to filtering and standardizing the data to provide the best quality possible. ParaNames is useful for multilingual language processing, both in defining tasks for name translation/transliteration and as supplementary data for tasks such as named entity recognition and linking. We demonstrate the usefulness of ParaNames on two tasks. First, we perform canonical name translation between English and 17 other languages. Second, we use it as a gazetteer for multilingual named entity recognition, obtaining performance improvements on all 10 languages evaluated.
QueryNER: Segmentation of E-commerce Queries

Chester Palen-Michel, Lizzie Liang, Zhe Wu, and Constantine Lignos

In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), May 2024

Abs

We present QueryNER, a manually-annotated dataset and accompanying model for e-commerce query segmentation. Prior work in sequence labeling for e-commerce has largely addressed aspect-value extraction which focuses on extracting portions of a product title or query for narrowly defined aspects. Our work instead focuses on the goal of dividing a query into meaningful chunks with broadly applicable types. We report baseline tagging results and conduct experiments comparing token and entity dropping for null and low recall query recovery. Challenging test sets are created using automatic transformations and show how simple data augmentation techniques can make the models more robust to noise. We make the QueryNER dataset publicly available.
LR-Sum: Summarization for Less-Resourced Languages

Chester Palen-Michel, and Constantine Lignos

In Findings of the Association for Computational Linguistics: ACL 2023, Jul 2023

Abs

We introduce LR-Sum, a new permissively-licensed dataset created with the goal of enabling further research in automatic summarization for less-resourced languages.LR-Sum contains human-written summaries for 40 languages, many of which are less-resourced. We describe our process for extracting and filtering the dataset from the Multilingual Open Text corpus (Palen-Michel et al., 2022).The source data is public domain newswire collected from from Voice of America websites, and LR-Sum is released under a Creative Commons license (CC BY 4.0), making it one of the most openly-licensed multilingual summarization datasets. We describe abstractive and extractive summarization experiments to establish baselines and discuss the limitations of this dataset.
Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and Approaches to Modeling

Elena Álvarez-Mellado, and Constantine Lignos

In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022

Abs

This work presents a new resource for borrowing identification and analyzes the performance and errors of several models on this task. We introduce a new annotated corpus of Spanish newswire rich in unassimilated lexical borrowings—words from one language that are introduced into another without orthographic adaptation—and use it to evaluate how several sequence labeling models (CRF, BiLSTM-CRF, and Transformer-based models) perform. The corpus contains 370,000 tokens and is larger, more borrowing-dense, OOV-rich, and topic-varied than previous corpora available for this task. Our results show that a BiLSTM-CRF model fed with subword embeddings along with either Transformer-based embeddings pretrained on codeswitched data or a combination of contextualized word embeddings outperforms results obtained by a multilingual BERT-based model.
Multilingual Open Text Release 1: Public Domain News in 44 Languages

Chester Palen-Michel, June Kim, and Constantine Lignos

In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Jun 2022

Abs

We present a Multilingual Open Text (MOT), a new multilingual corpus containing text in 44 languages, many of which have limited existing text resources for natural language processing. The first release of the corpus contains over 2.8 million news articles and an additional 1 million short snippets (photo captions, video descriptions, etc.) published between 2001–2022 and collected from Voice of America‘s news websites. We describe our process for collecting, filtering, and processing the data. The source material is in the public domain, our collection is licensed using a creative commons license (CC BY 4.0), and all software used to create the corpus is released under the MIT License. The corpus will be regularly updated as additional documents are published.
SeqScore: Addressing Barriers to Reproducible Named Entity Recognition Evaluation

Chester Palen-Michel, Nolan Holley, and Constantine Lignos

In Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, Nov 2021

Abs

To address a looming crisis of unreproducible evaluation for named entity recognition, we propose guidelines and introduce SeqScore, a software package to improve reproducibility. The guidelines we propose are extremely simple and center around transparency regarding how chunks are encoded and scored. We demonstrate that despite the apparent simplicity of NER evaluation, unreported differences in the scoring procedure can result in changes to scores that are both of noticeable magnitude and statistically significant. We describe SeqScore, which addresses many of the issues that cause replication failures.