NLPapers
Why Does ChatGPT Fall Short in Providing Truthful Answers?
Hallucinated but Factual! Inspecting the Factuality of Hallucinations in Abstractive Summarization
AgentBench: Evaluating LLMs as Agents
Survey of Hallucination in Natural Language Generation
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The Troubling Emergence of Hallucination in Large Language Models - An Extensive Definition, Quantification, and Prescriptive Remediations
G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment
Evaluating the Rationale Understanding of Critical Reasoning in Logical Reading Comprehension
WikiQA: A Challenge Dataset for Open-Domain Question Answering
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Robust MT Evaluation with Sentence-level Multilingual Augmentation
How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Fakeddit: A New Multimodal Benchmark Dataset for Fine-grained Fake News Detection
Evaluating LLMs at Detecting Errors in LLM Responses
GAIA: a benchmark for General AI Assistants
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Deep Learning in Spiking Neural Networks