Synthetic Signal Identification in LLM Datasets

Jim Hahn

doi:10.23106/dcmi.952493684

Full Paper

Synthetic Signal Identification in LLM Datasets

Jim Hahn

Download PDF Read Online

Abstract

This research addresses the quality of training data in LLMs using methods from signaling theory and the talk page metadata of Wikipedia articles. The significance of the method is to lower the cost of information quality assessment in datasets. Natural language processing on metadata text generated sentiment, reading complexity, and self-reference scores as contributions to the computationally derived signals. Results showed that it is possible to understand indicators of information quality using textual computation over the metadata in article pages.

Author information

Jim Hahn

Penn Libraries, United States

ORCID Google Scholar Semantic Scholar

Cite this article

Select Citation Style

Hahn, J. (2024). Synthetic Signal Identification in LLM Datasets. International Conference on Dublin Core and Metadata Applications, 2024. https://doi.org/10.23106/dcmi.952493684

DOI : 10.23106/dcmi.952493684

Published: 2024-12-20
https://doi.org/10.23106/dcmi.952493684

Metadata and citations of this article is published under the Creative Commons Zero Universal Public Domain Dedication (CC0), allowing unrestricted reuse. Anyone can freely use the metadata from DCPapers articles for any purpose without limitations.

This article full-text is published under the Creative Commons Attribution 4.0 International License (CC BY 4.0). This license allows use, sharing, adaptation, distribution, and reproduction in any medium or format, provided that appropriate credit is given to the original author(s) and the source is cited.