human-labeled data
An approach to AI development that relies heavily on human input, such as manually labeling data or coding specific rules. 'The Bitter Lesson' argues this approach is ultimately inferior to scalable computation.
entitydetail.created_at
7/12/2025, 4:40:58 AM
entitydetail.last_updated
7/12/2025, 5:01:47 AM
entitydetail.research_retrieved
7/12/2025, 5:01:47 AM
Summary
Human-labeled data involves augmenting raw data samples with informative tags, typically by human annotators. It is fundamental for training supervised machine learning models, as the quality of these labels directly impacts model performance. However, acquiring human-labeled data is significantly more expensive and time-consuming than obtaining raw data, and it can suffer from quality issues and biases. In the current AI landscape, particularly with the rise of large language models (LLMs), there's a debate about its long-term value, with some advocating for scalable computation and synthetic data, aligning with "The Bitter Lesson" principle. Despite this, human-labeled data remains crucial for fine-tuning and alignment of advanced models like those from OpenAI and Meta.
Referenced in 1 Document
Research Data
Extracted Attributes
Cost
Significantly more expensive to obtain than raw unlabeled data.
Purpose
Crucial for training supervised machine learning models.
Challenges
Expensive, time-consuming, quality issues, and potential for bias.
Definition
Samples augmented with informative tags, often created by humans assessing unlabeled data.
Role in LLMs
Used for fine-tuning and building alignment data, despite pretraining often relying on unstructured data.
Historical Role
Played a crucial role in the 'AI 1.0' era for machine learning models.
Impact on Models
Quality directly influences the performance of supervised machine learning models.
Labeling Methods
Manual labeling, crowdsourcing, Human-in-the-Loop (HITL) labeling, and programmatic labeling (weak supervision).
Investment Thesis Debate
Some argue for a 'short half-life' on its value, suggesting that scalable computation and synthetic data may be more advantageous than systems heavily reliant on human input.
Timeline
- Human-labeled data played a crucial role in the earlier era of AI, known as 'AI 1.0', where machine learning models heavily relied on such data. (Source: Web Search Results (IJCAI))
Historical
- A study by Joy Buolamwini and Timnit Gebru demonstrated bias in facial analysis datasets (IJB-A and Adience) used to train facial recognition algorithms, highlighting issues with human-labeled data quality. (Source: Wikipedia)
2018-XX-XX
- The paper 'The Importance of Human-Labeled Data in the Era of LLMs' was published in IJCAI, discussing its continued relevance and challenges. (Source: Web Search Results (IJCAI, arXiv))
2023-XX-XX
- Companies like OpenAI and Meta reported using human labelers to collect data for fine-tuning their GPT models and Llama 3, respectively, despite the rise of synthetic data. (Source: Web Search Results (Snorkel AI))
Ongoing
Wikipedia
View on WikipediaLabeled data
Labeled data is a group of samples that have been tagged with one or more labels. Labeling typically takes a set of unlabeled data and augments each piece of it with informative tags. For example, a data label might indicate whether a photo contains a horse or a cow, which words were uttered in an audio recording, what type of action is being performed in a video, what the topic of a news article is, what the overall sentiment of a tweet is, or whether a dot in an X-ray is a tumor. Labels can be obtained by having humans make judgments about a given piece of unlabeled data. Labeled data is significantly more expensive to obtain than the raw unlabeled data. The quality of labeled data directly influences the performance of supervised machine learning models in operation, as these models learn from the provided labels.
Web Search Results
- [PDF] The Importance of Human-Labeled Data in the Era of LLMs - IJCAI
1 Introduction Human-labeled data played a crucial role in the earlier era of AI, known as ”AI 1.0,” where machine learning mod-els heavily relied on such data [Deng et al., 2009]. The celebrated supervised learning framework [Vapnik, 1999; LeCun et al., 2015] was designed and developed exactly for this paradigm. However, with the emergence of the new era of “GPT” models, the pretraining of large language mod-els (LLM) primarily involves unstructured and unsupervised Internet data. [...] For human labeling, we have a well-established “insecurity” of human-labeled data and a num-ber of “safety” protocols have been established to make sure the human-generated data meets certain performance require-ments. These efforts include building incentive mechanisms [Liu and Chen, 2016; Witkowski et al., 2013], human spot-checking/auditing mechanisms [Shah and Zhou, 2015] and automatic error analysis in human labels [Zhu et al., 2022a; Zhu et al., 2021b]. More sophisticated systems can be [...] 8 Challenges and Opportunities Quality control of human-labeled data. Human labels continue to face quality issues and in Section 3 we have highlighted that this issue persists in building alignment data for LLMs. Careless annotations will not only drop but also creates a false sense of security [Zhu et al., 2023].
- Data labeling: a practical guide (2024) - Snorkel AI
Traditional data labeling processes are often expensive and time-consuming, requiring significant manual effort. However, using machine learning models to label data automatically can enhance efficiency by training on a human-labeled subset and progressively taking on more labeling tasks as its confidence increases. [...] ML researchers and practitioners at AI startups and large enterprises have relied on curating additional labeled data to achieve better performance on specific tasks. OpenAI has reported using human labelers to collect the data for fine-tuning its GPT models and reportedly hired hundreds or thousands of additional contractors since ChatGPT was released. Meta reported using 10 million human-annotated examples to train Llama 3. [...] Programmatic labeling (also known as “weak supervision”) stands apart from other data labeling approaches because it can incorporate all of them. This approach combines sources of supervision—ranging from human-supplied labels to noisy high-level heuristics to LLMs—to create large probabilistic data sets orders of magnitude faster than manual labeling.
- The Importance of Human-Labeled Data in the Era of LLMs - arXiv
| | | | --- | --- | | Subjects: | Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG) | | Cite as: | arXiv:2306.14910 [cs.CL] | | | (or arXiv:2306.14910v1 [cs.CL] for this version) | | | Focus to learn more arXiv-issued DOI via DataCite | | Journal reference: | IJCAI 2023 Early Career Spotlight | ## Submission history ## Access Paper: license icon ### References & Citations ## BibTeX formatted citation [...] ### Bookmark BibSonomy logo Reddit logo # Bibliographic and Citation Tools # Code, Data and Media Associated with this Article # Demos # Recommenders and Search Tools # arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. [...] Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs. arXiv Operational Status Get status notifications via email or slack
- Labeled data - Wikipedia
Labels can be obtained by having humans make judgments about a given piece of unlabeled data. Labeled data is significantly more expensive to obtain than the raw unlabeled data. The quality of labeled data directly influences the performance of supervised machine learning models in operation, as these models learn from the provided labels. ## Crowdsourced labeled data [...] Labeled data is a group of samples "Sample (statistics)") that have been tagged with one or more labels. Labeling typically takes a set of unlabeled data and augments each piece of it with informative tags. For example, a data label might indicate whether a photo contains a horse or a cow, which words were uttered in an audio recording, what type of action is being performed in a video, what the topic of a news article is, what the overall sentiment of a tweet is, or whether a dot in an X-ray [...] misclassified if the labeled data available to train has not been representative of the population,. In 2018, a study by Joy Buolamwini and Timnit Gebru demonstrated that two facial analysis datasets that have been used to train facial recognition algorithms, IJB-A and Adience, are composed of 79.6% and 86.2% lighter skinned humans respectively.
- Why data labeling is crucial for AI model accuracy - Telnyx
Approaches to data labeling --------------------------- There are several data labeling approaches, each with advantages and challenges. ### Manual data labeling Manual data labeling involves human labelers examining and assigning labels to each data point. This approach ensures high-quality and precise labels but is time-consuming and expensive. ### Automated data labeling [...] Automated data labeling uses machine learning models to label data automatically. This approach is fast and cost-effective but may struggle with unseen data and can propagate errors. ### Human-in-the-Loop (HITL) labeling HITL labeling combines automated labeling with human oversight. This approach leverages the strengths of both humans and machines to improve accuracy and efficiency. Best practices for data labeling -------------------------------- [...] To overcome these challenges, various strategies can be employed: 1. Combining human and machine efforts: Using HITL labeling can balance the strengths of humans and machines. 2. Automated labeling tools: Leveraging tools like Labelbox and SuperAnnotate can streamline the labeling process and reduce costs. 3. Continuous training and feedback: Updating instructions and golden datasets as edge cases are encountered can improve labeling quality.