Proceedings of the 1st Workshop on Ecology, Environment, and Natural Language Processing (NLP4Ecology2025)
Selle kollektsiooni püsiv URIhttps://hdl.handle.net/10062/107174
Sirvi
Viimati lisatud
Kirje 1st Workshop on Ecology, Environment, and Natural Language Processing. Proceedings of NLP4Ecology2025(University of Tartu Library, 2025-03-02) Basile, Valerio; Bosco, Cristina; Grasso, Francesca; Ibrahim, Muhammad Okky; Skeppstedt, Maria; Stede, ManfredKirje The Accuracy, Robustness, and Readability of LLM-Generated Sustainability-Related Word Definitions(University of Tartu Library, 2025-03) Heiman, Alice; Basile, Valerio; Bosco, Cristina; Grasso, Francesca; Ibrahim, Muhammad Okky; Skeppstedt, Maria; Stede, ManfredA common language with shared standard definitions is essential for effective climate conversations. However, there is concern that LLMs may misrepresent and/or diversify climate-related terms. We compare 305 official IPCC glossary definitions with those generated by OpenAI's GPT-4o-mini and investigate their adherence, robustness, and readability using a combination of SBERT sentence embeddings and statistical measures. The LLM definitions received average adherence and robustness scores of $0.58 \pm 0.15$ and $0.96 \pm 0.02$, respectively. Both sustainability-related terminologies remain challenging to read, with model-generated definitions varying mainly among words with multiple or ambiguous definitions. Thus, the results highlight the potential of LLMs to support environmental discourse while emphasizing the need to align model outputs with established terminology for clarity and consistency.Kirje Efficient Scientific Full Text Classification: The Case of EICAT Impact Assessments(University of Tartu Library, 2025-03) Brinner, Marc Felix; Zarrieß, Sina; Basile, Valerio; Bosco, Cristina; Grasso, Francesca; Ibrahim, Muhammad Okky; Skeppstedt, Maria; Stede, ManfredThis study explores strategies for efficiently classifying scientific full texts using both small, BERT-based models and local large language models like Llama-3.1 8B. We focus on developing methods for selecting subsets of input sentences to reduce input size while simultaneously enhancing classification performance. To this end, we compile a novel dataset consisting of full-text scientific papers from the field of invasion biology, specifically addressing the impacts of invasive species. These papers are aligned with publicly available impact assessments created by researchers for the International Union for Conservation of Nature (IUCN). Through extensive experimentation, we demonstrate that various sources like human evidence annotations, LLM-generated annotations or explainability scores can be used to train sentence selection models that improve the performance of both encoder- and decoder-based language models while optimizing efficiency through the reduction in input length, leading to improved results even if compared to models like ModernBERT that are able to handle the complete text as input. Additionally, we find that repeated sampling of shorter inputs proves to be a very effective strategy that, at a slightly increased cost, can further improve classification performance.Kirje Towards Addressing Anthropocentric Bias in Large Language Models(University of Tartu Library, 2025-03) Grasso, Francesca; Locci, Stefano; Di Caro, Luigi; Basile, Valerio; Bosco, Cristina; Grasso, Francesca; Ibrahim, Muhammad Okky; Skeppstedt, Maria; Stede, ManfredThe widespread use of Large Language Models (LLMs), particularly among non-expert users, has raised ethical concerns about the propagation of harmful biases. While much research has addressed social biases, few works, if any, have examined anthropocentric bias in Natural Language Processing (NLP) technology. Anthropocentric language prioritizes human value, framing non-human animals, living entities, and natural elements solely by their utility to humans; a perspective that contributes to the ecological crisis. In this paper, we evaluate anthropocentric bias in OpenAI’s GPT-4o across various target entities, including sentient beings, non-sentient entities, and natural elements. Using prompts eliciting neutral, anthropocentric, and ecocentric perspectives, we analyze the model’s outputs and introduce a manually curated glossary of 424 anthropocentric terms as a resource for future ecocritical research. Our findings reveal a strong anthropocentric bias in the model's responses, underscoring the need to address human-centered language use in AI-generated text to promote ecological well-being.Kirje No AI on a Dead Planet: Sentiment and Emotion Analysis Across Reddit Communities on AI and the Environment(University of Tartu Library, 2025-03) Longo, Arianna; Longo, Alessandro Y.; Basile, Valerio; Bosco, Cristina; Grasso, Francesca; Ibrahim, Muhammad Okky; Skeppstedt, Maria; Stede, ManfredThis paper investigates how different online communities perceive and discuss the environmental impact of AI through sentiment analysis and emotion detection. We analyze Reddit discussion from r/artificial and r/climatechange, using pre-trained models fine-tuned on social media data. Our analysis reveals distinct patterns in how these communities engage with AI's environmental implications: the AI community demonstrates a shift from predominantly neutral and positive sentiment in posts to more balanced perspectives in comments, while the climate community maintains a more critical stance throughout discussions. The findings contribute to our understanding of how different communities conceptualize and respond to the environmental challenges of AI development.Kirje Analyzing the Online Communication of Environmental Movement Organizations: NLP Approaches to Topics, Sentiment, and Emotions(University of Tartu Library, 2025-03) Barz, Christina; Siegel, Melanie; Hanss, Daniel; Basile, Valerio; Bosco, Cristina; Grasso, Francesca; Ibrahim, Muhammad Okky; Skeppstedt, Maria; Stede, ManfredThis project employs state-of-the-art Natural Language Processing (NLP) techniques to analyze the online communication of international Environmental Movement Organizations (EMOs). First, we introduce our overall EMO dataset and describe it through topic modeling. Second, we evaluate current sentiment and emotion classification models for our specific dataset. Third, as we are currently in our annotation process, we evaluate our current progress and issues to determine the most effective approach for creating a high-quality annotated dataset that captures the nuances of EMO communication. Finally, we emphasize the need for domain-specific datasets and tailored NLP tools and suggest refinements for our annotation process moving forward.Kirje Quantification of Biodiversity from Historical Survey Text with LLM-based Best-Worst-Scaling(University of Tartu Library, 2025-03) Haider, Thomas; Perschl, Tobias; Rehbein, Malte; Basile, Valerio; Bosco, Cristina; Grasso, Francesca; Ibrahim, Muhammad Okky; Skeppstedt, Maria; Stede, ManfredIn this paper, we evaluate methods to determine biodiversity via quantity estimation from historical survey text. To that end, we formulate classification tasks and finally show that this problem can be successfully framed as regression based on best-worst-scaling with LLMs. We find that this approach is more cost effective and similarly robust to a fine-grained multi-class approach, allowing automated quantity estimation across species.Kirje Entity Linking using LLMs for Automated Product Carbon Footprint Estimation(University of Tartu Library, 2025-03) Castle, Steffen; Moreno Schneider, Julian; Basile, Valerio; Bosco, Cristina; Grasso, Francesca; Ibrahim, Muhammad Okky; Skeppstedt, Maria; Stede, ManfredGrowing concerns about climate change and sustainability are driving manufacturers to take significant steps toward reducing their carbon footprints. For these manufacturers, a first step towards this goal is to identify the environmental impact of the individual components of their products. We propose a system leveraging large language models (LLMs) to automatically map components from manufacturer Bills of Materials (BOMs) to Life Cycle Assessment (LCA) database entries by using LLMs to expand on available component information. Our approach reduces the need for manual data processing, paving the way for more accessible sustainability practices.Kirje Thematic Categorization on Pineapple Production in Costa Rica: An Exploratory Analysis through Topic Modeling(University of Tartu Library, 2025-03) Beckles, Valentina Tretti; Heidke, Adrian Vergara; Basile, Valerio; Bosco, Cristina; Grasso, Francesca; Ibrahim, Muhammad Okky; Skeppstedt, Maria; Stede, ManfredCosta Rica is one of the largest producers and exporters of pineapple in the world. This status has encouraged multinational companies to use plantations in this Central American country for experimentation and the cultivation of new varieties, such as the Pinkglow pineapple. However, pineapple monoculture has significant socio-environmental impacts on the regions where it is cultivated.In this exploratory study, we aimed to analyze how pineapple production is portrayed on the Internet. To achieve this, we collected a corpus of texts in Spanish and English from online sources in two phases: using the BootCat tool and manual search on newspaper websites. The Hierarchical Dirichlet Process (HDP) topic model was then applied to identify dominant topics within the corpus. These topics were subsequently classified into thematic categories, and the texts were categorized accordingly. The findings indicate that environmental issues related to pineapple cultivation are underrepresented on the Internet, particularly in comparison to the extensive focus on topics related to pineapple production and marketing.Kirje Communicating urgency to prevent environmental damage: insights from a linguistic analysis of the WWF24 multilingual corpus(University of Tartu Library, 2025-03) Bosco, Cristina; Pagano, Adriana Silvina; Chierchiello, Elisa; Basile, Valerio; Bosco, Cristina; Grasso, Francesca; Ibrahim, Muhammad Okky; Skeppstedt, Maria; Stede, ManfredContemporary environmental discourse focuses on effectively communicating ecological vulnerability to raise public awareness and encourage positive actions. Hence there is a need for studies to support accurate and adequate discourse production, both by humans and computers. Two main challenges need to be tackled. On the one hand, the language used to communicate about environment issues can be very complex for human and automatic analysis, there being few resources to train and test NLP tools. On the other hand, in the current international scenario, most texts are written in multiple languages or translated from a major to minor language, resulting in different meanings in different languages and cultural contexts. This paper presents a novel parallel corpus comprising the text of World Wide Fund (WWF) 2024 Annual Report in English and its translations into Italian and Brazilian Portuguese, and analyses their linguistic features.Kirje Large Language Models as Annotators of Named Entities in Climate Change and Biodiversity: A Preliminary Study(University of Tartu Library, 2025-03) Volkanovska, Elena; Basile, Valerio; Bosco, Cristina; Grasso, Francesca; Ibrahim, Muhammad Okky; Skeppstedt, Maria; Stede, ManfredThis paper examines whether few-shot techniques for Named Entity Recognition (NER) utilising existing large language models (LLMs) as their backbone can be used to reliably annotate named entities (NEs) in scientific texts on climate change and biodiversity. A series of experiments aim to assess whether LLMs can be integrated into an end-to-end pipeline that could generate token- or sentence-level NE annotations; the former being an ideal-case scenario that allows for seamless integration of existing with new token-level features in a single annotation pipeline. Experiments are run on four LLMs, two NER datasets, two input and output data formats, and ten and nine prompt versions per dataset. The results show that few-shot methods are far from being a silver bullet for NER in highly specialised domains, although improvement in LLM performance is observed for some prompt designs and some NE classes. Few-shot methods would find better use in a human-in-the-loop scenario, where an LLM's output is verified by a domain expert.Kirje Mining for Species, Locations, Habitats, and Ecosystems from Scientific Papers in Invasion Biology: A Large-Scale Exploratory Study with Large Language Models(University of Tartu Library, 2025-03) D'Souza, Jennifer; Laubach, Zachary; Mustafa, Tarek Al; Zarrieß, Sina; Frühstückl, Robert; Illari, Phyllis; Basile, Valerio; Bosco, Cristina; Grasso, Francesca; Ibrahim, Muhammad Okky; Skeppstedt, Maria; Stede, ManfredThis study explores the use of large language models (LLMs), specifically GPT-4o, to extract key ecological entities—species, locations, habitats, and ecosystems—from invasion biology literature. This information is critical for understanding species spread, predicting future invasions, and informing conservation efforts. Without domain-specific fine-tuning, we assess the potential and limitations of GPT-4o, out-of-the-box, for this task, highlighting the role of LLMs in advancing automated knowledge extraction for ecological research and management.Kirje Perspectives on Forests and Forestry in Finnish Online Discussions - A Topic Modeling Approach to Suomi24(University of Tartu Library, 2025-03) Peura, Telma; Krizsán, Attila; Kuusalu, Salla-Riikka; Laippala, Veronika; Basile, Valerio; Bosco, Cristina; Grasso, Francesca; Ibrahim, Muhammad Okky; Skeppstedt, Maria; Stede, ManfredThis paper explores how forests and forest industry are perceived on the largest online discussion forum in Finland, Suomi24 ('Finland24'). Using 30,636 posts published in 2014–2020, we investigate what kind of topics and perspectives towards forest management can be found. We use BERTopic as our topic modeling approach and evaluate the results of its different modular combinations. As the dataset is not labeled, we demonstrate the validity of our best model through illustrating some of the topics about forest use. The results show that a combination of UMAP and K-means leads to the best topic quality. Our exploratory qualitative analysis indicates that the posts reflect polarized discourses between the forest industry and forest conservation adherents.Kirje From Data to Grassroots Initiatives: Leveraging Transformer-Based Models for Detecting Green Practices in Social Media(University of Tartu Library, 2025-03) Glazkova, Anna; Zakharova, Olga; Basile, Valerio; Bosco, Cristina; Grasso, Francesca; Ibrahim, Muhammad Okky; Skeppstedt, Maria; Stede, ManfredGreen practices are everyday activities that support a sustainable relationship between people and the environment. Detecting these practices in social media helps track their prevalence and develop recommendations to promote eco-friendly actions. This study compares machine learning methods for identifying mentions of green waste practices as a multi-label text classification task. We focus on transformer-based models, which currently achieve state-of-the-art performance across various text classification tasks. Along with encoder-only models, we evaluate encoder-decoder and decoder-only architectures, including instruction-based large language models. Experiments on the GreenRu dataset, which consists of Russian social media texts, show the prevalence of the mBART encoder-decoder model. The findings of this study contribute to the advancement of natural language processing tools for ecological and environmental research, as well as the broader development of multi-label text classification methods in other domains.Kirje 1st Workshop on Ecology, Environment, and Natural Language Processing.(University of Tartu Library, 2025-03) Basile, Valerio; Bosco, Cristina; Grasso, Francesca; Ibrahim, Muhammad Okky; Skeppstedt, Maria; Stede, Manfred