Abstract
This research addresses the quality of training data in LLMs using methods from signaling theory and the talk page metadata of Wikipedia articles. The significance of the method is to lower the cost of information quality assessment in datasets. Natural language processing on metadata text generated sentiment, reading complexity, and self-reference scores as contributions to the computationally derived signals. Results showed that it is possible to understand indicators of information quality using textual computation over the metadata in article pages.
Author information
Cite this article
Hahn, J. (2024). Synthetic Signal Identification in LLM Datasets. International Conference on Dublin Core and Metadata Applications, 2024. https://doi.org/10.23106/dcmi.952493684
DOI : 10.23106/dcmi.952493684
Published
Metadata and citations of this article is published under the Creative Commons Zero
Universal Public Domain Dedication (CC0), allowing unrestricted reuse. Anyone can freely
use the metadata from DCPapers articles for any purpose without limitations.
This article full-text is published under the Creative Commons Attribution 4.0
International License (CC BY 4.0). This license allows use, sharing, adaptation,
distribution, and reproduction in any medium or format, provided that appropriate credit
is given to the original author(s) and the source is cited.