OpinioRAG: Towards Generating User-Centric Opinion Highlights from Large-scale Online Reviews

Introduction

We study the problem of generating opinion highlights from large volumes of user reviews—often exceeding thousands per entity—where existing methods either fail to scale or produce generic, one-size-fits-all summaries that overlook users’ personalized information needs. Consumers typically seek context-specific insights tailored to their preferences (e.g., room cleanliness, public transport proximity, fitness facilities, or pet-friendly policies), but current approaches lack the flexibility to generate such query-specific summaries, limiting their usefulness in real-world decision-making.

To address this gap, we introduce OpinioRAG, a scalable, training-free framework that combines RAG-based evidence retrieval with large language models to generate accurate and user-centric summaries from long-form reviews. We further propose novel reference-free verification metrics tailored for sentiment-rich domains, enabling fine-grained and context-sensitive evaluation of factual consistency. To support research in this area, we contribute OpinioBank, the first large-scale dataset pairing expert summaries with thousands of user reviews and manually annotated queries. Through extensive experiments, we identify key challenges, provide actionable insights, and position OpinioRAG as a robust framework for structured, personalized opinion summarization at scale.

OpinioRAG Framework

OpinioRAG is a scalable, training-free framework that combines the attributability and scalability of retrieval-augmented methods with the coherence and fluency of large language models. Built on the OpinioBank dataset, it generates user-centric opinion highlights from thousands of reviews, structured around specific user queries.

The framework operates in two sequential stages: (1) Retriever — extracts relevant review sentences as evidence for each query, reducing noise while maintaining comprehensive coverage; and (2) Synthesizer — uses LLMs to generate concise, structured, query-specific highlights in a key-point style. This design enables controllability (query-focused outputs), scalability (handling large corpora efficiently), modularity (plug-and-play retrievers and LLMs), and verifiability (fine-grained factual alignment through structured evidence).

OpinioBank Dataset

OpinioBank is a large-scale, high-quality dataset designed to support user-centric opinion summarization from extensive long-form reviews, such as those targeted by the OpinioRAG framework. Unlike existing datasets focusing on short-form or synthetic review–summary pairs, OpinioBank contains entities with over a thousand user reviews each, paired with unbiased expert reviews and manually annotated queries. This makes it the first benchmark of its kind for advancing model development and evaluation on noisy, repetitive, and stylistically diverse inputs. For details, see our COLM 2025 paper.

View on Hugging Face

Data Collection Pipeline

Data Sources
User Reviews: Collected from TripAdvisor, which provides reviews on average three times longer than other platforms—ideal for studying long-form summarization with book-length inputs exceeding 100K tokens.
Expert Reviews: Collected from Oyster, which provides professional hotel reviews written after on-site inspections and multi-source evaluations, ensuring high factual quality.

Data Preparation
We paired entities across platforms using unique identifiers (e.g., addresses, postal codes) and crawled both user and expert reviews. Using a predefined list of gold query terms (e.g., room, location, ocean views), we manually annotated sentences to ensure query diversity and granularity. Queries without matches were removed to maintain alignment quality. Metadata from both review text (ratings, helpful votes, date) and reviewer profiles (number of reviews, cities visited, helpful votes) were integrated to support analyses of credibility and temporal trends. A summary of the dataset statistics is provided below.

OpinioBank Dataset Statistics

The full dataset and accompanying metadata are available through the Hugging Face link above.

Structured Verification Module

We introduce a novel, reference-free, and modular verification module that systematically evaluates whether LLM-generated highlights are faithfully grounded in the retrieved user review evidence. This is achieved by decomposing both the retrieved sentences and the generated highlights into Aspect–Opinion–Sentiment (AOS) triplets of the form aspect:opinion:sentiment. This structured decomposition enables fine-grained, interpretable alignment between evidence and generated text.

The verification relies on three complementary metrics designed for sentiment-rich domains, capturing different facets of factual consistency:

Aspect Relevance (AR): checks whether key aspects mentioned in the evidence are preserved in the highlight.
Sentiment Factuality (SF): verifies that the sentiment polarity of the highlight matches the dominant non-neutral sentiment expressed in the evidence.
Opinion Faithfulness (OF): assesses whether the expressed opinion terms match—or semantically align with—those in the evidence for the same aspect and sentiment.

Together, these metrics enable fine-grained, interpretable, and reference-free evaluation, making the verification module easily extensible to stronger AOS extractors in the future.

🤗

Hugging Face Model

Open in Colab

BibTeX

@inproceedings{nayeem2025opiniorag,
          title={Opinio{RAG}: Towards Generating User-Centric Opinion Highlights from Large-scale Online Reviews},
          author={Mir Tafseer Nayeem and Davood Rafiei},
          booktitle={Second Conference on Language Modeling},
          year={2025},
          url={https://openreview.net/forum?id=R94bCTckhV}
}