publications
preprints
2023
- Conversations in Galician: a Large Language Model for an Underrepresented LanguageEliseo Bao, Anxo Pérez, and Javier ParapararXiv preprint, 2023
The recent proliferation of Large Conversation Language Models has highlighted the economic significance of widespread access to this type of AI technologies in the current information age. Nevertheless, prevailing models have primarily been trained on corpora consisting of documents written in popular languages. The dearth of such cutting-edge tools for low-resource languages further exacerbates their underrepresentation in the current economic landscape, thereby impacting their native speakers. This paper introduces two novel resources designed to enhance Natural Language Processing (NLP) for the Galician language. We present a Galician adaptation of the Alpaca dataset, comprising 52,000 instructions and demonstrations. This dataset proves invaluable for enhancing language models by fine-tuning them to more accurately adhere to provided instructions. Additionally, as a demonstration of the dataset utility, we fine-tuned LLaMA-7B to comprehend and respond in Galician, a language not originally supported by the model, by following the Alpaca format. This work contributes to the research on multilingual models tailored for low-resource settings, a crucial endeavor in ensuring the inclusion of all linguistic communities in the development of Large Language Models. Another noteworthy aspect of this research is the exploration of how knowledge of a closely related language, in this case, Portuguese, can assist in generating coherent text when training resources are scarce. Both the Galician Alpaca dataset and Cabuxa-7B are publicly accessible on our Huggingface Hub, and we have made the source code available to facilitate replication of this experiment and encourage further advancements for underrepresented languages.
- Explainable Depression Symptom Detection in Social MediaEliseo Bao, Anxo Pérez, and Javier ParapararXiv preprint, 2023
Users of social platforms often perceive these sites as supportive spaces to post about their mental health issues. Those conversations contain important traces about individuals’ health risks. Recently, researchers have exploited this online information to construct mental health detection models, which aim to identify users at risk on platforms like Twitter, Reddit or Facebook. Most of these models are centred on achieving good classification results, ignoring the explainability and interpretability of the decisions. Recent research has pointed out the importance of using clinical markers, such as the use of symptoms, to improve trust in the computational models by health professionals. In this paper, we propose using transformer-based architectures to detect and explain the appearance of depressive symptom markers in the users’ writings. We present two approaches: i) train a model to classify, and another one to explain the classifier’s decision separately and ii) unify the two tasks simultaneously using a single model. Additionally, for this latter manner, we also investigated the performance of recent conversational LLMs when using in-context learning. Our natural language explanations enable clinicians to interpret the models’ decisions based on validated symptoms, enhancing trust in the automated process. We evaluate our approach using recent symptom-based datasets, employing both offline and expert-in-the-loop metrics to assess the quality of the explanations generated by our models. The experimental results show that it is possible to achieve good classification results while generating interpretable symptom-based explanations.
theses
2022
- B.Sc. ThesisRanking of Reddit users using Relevance Models for depressive disordersEliseo BaoJul 2022
Depressive disorders are one of the most common groups of illnesses in the world. Although it is true that effective treatments exist, either due to the lack of resources or the stigma that is still associated, in many cases the consequences for those suffering from this type of disorders are devastating. Knowing that the language manifested by people suffering from this type of diseases can denote evidence of their mental health, the aim of this project is to exploit the possibilities of Relevance-Based Language Models to be used for early detection. Specifically, taking CLEF eRisk collections as a starting point, the goal is to build depression vocabularies. These vocabularies identify terms of weight and relevance in people with depressive tendencies, and must undergo phases of evaluation and comparison with other validated lexicons. In addition, we focus in being able to perform ranking, i.e., from texts written by a number of people, to establish a ranking for them according to the possible degree of depression. For the management of the project, an agile methodology has been used, so that it has been possible to adapt the project according to the results obtained in the experimentation. Satisfactory results have been achieved, especially in terms of ranking, as well as new avenues for experimentation and expansion have been set.