Full Paper

Synthetic Signal Identification in LLM Datasets

Download PDF Read Online
Abstract

This research addresses the quality of training data in LLMs using methods from signaling theory and the talk page metadata of Wikipedia articles. The significance of the method is to lower the cost of information quality assessment in datasets. Natural language processing on metadata text generated sentiment, reading complexity, and self-reference scores as contributions to the computationally derived signals. Results showed that it is possible to understand indicators of information quality using textual computation over the metadata in article pages.

Author information

Jim Hahn
Penn Libraries, United States

Cite this article

Hahn, J. (2024). Synthetic Signal Identification in LLM Datasets. International Conference on Dublin Core and Metadata Applications, 2024. https://doi.org/10.23106/dcmi.952493684

DOI : 10.23106/dcmi.952493684

CC-0 Logo Metadata and citations of this article is published under the Creative Commons Zero Universal Public Domain Dedication (CC0), allowing unrestricted reuse. Anyone can freely use the metadata from DCPapers articles for any purpose without limitations.
CC-BY Logo This article full-text is published under the Creative Commons Attribution 4.0 International License (CC BY 4.0). This license allows use, sharing, adaptation, distribution, and reproduction in any medium or format, provided that appropriate credit is given to the original author(s) and the source is cited.