Developing a generative AI tool for dialogue data collection in online environments

This study investigates the development and implementation of a generative AI tool for collecting online audio data to assess L2 English fluency, intelligibility, and comprehensibility among frequent online gamers. A pilot study revealed participants' foreign language speaking anxiety and preference for computer-mediated communication over human interaction. In response, we created a conversational agent that simulates dialogue and gathers audio data for research purposes. The program is accessible via a webpage, utilizing the browser API to collect audio data. The participant's audio is transcribed by whisper.cpp, and the resulting text is input into a libre and self-hosted Large Language Model running on llama.cpp. The LLM's output, acting as the dialogue partner, is fed into text-to-speech software and sent back to the browser, creating an interactive dialogue. The program is designed using libre software to ensure independence from commercial interests, compliance with European privacy data laws, and complete control of data throughout the process. Preliminary tests indicate such programs could enhance participants' comfort and willingness to communicate while reducing speaking anxiety. It could also replace the variable biases of human conversation partners with more stable, observable, and reproducible biases. As an application of a novel technology, the use of libre, self-hosted generative AI in collecting online dialogue data requires future research, particularly in evaluating the reliability compared to actual dialogue situations between humans. However, this study suggests that generative AI can be a valuable tool for collecting audio data from participants in online and particularly closed environments. 

» Titles and abstracts