 
      We study the problem of generating opinion highlights from large volumes of user reviews—often exceeding thousands per entity—where existing methods either fail to scale or produce generic, one-size-fits-all summaries that overlook users’ personalized information needs. Consumers typically seek context-specific insights tailored to their preferences (e.g., room cleanliness, public transport proximity, fitness facilities, or pet-friendly policies), but current approaches lack the flexibility to generate such query-specific summaries, limiting their usefulness in real-world decision-making.
To address this gap, we introduce OpinioRAG, a scalable, training-free framework that combines RAG-based evidence retrieval with large language models to generate accurate and user-centric summaries from long-form reviews. We further propose novel reference-free verification metrics tailored for sentiment-rich domains, enabling fine-grained and context-sensitive evaluation of factual consistency. To support research in this area, we contribute OpinioBank, the first large-scale dataset pairing expert summaries with thousands of user reviews and manually annotated queries. Through extensive experiments, we identify key challenges, provide actionable insights, and position OpinioRAG as a robust framework for structured, personalized opinion summarization at scale.
OpinioRAG is a scalable, training-free framework that combines the attributability and scalability of retrieval-augmented methods with the coherence and fluency of large language models. Built on the OpinioBank dataset, it generates user-centric opinion highlights from thousands of reviews, structured around specific user queries.
The framework operates in two sequential stages: (1) Retriever — extracts relevant review sentences as evidence for each query, reducing noise while maintaining comprehensive coverage; and (2) Synthesizer — uses LLMs to generate concise, structured, query-specific highlights in a key-point style. This design enables controllability (query-focused outputs), scalability (handling large corpora efficiently), modularity (plug-and-play retrievers and LLMs), and verifiability (fine-grained factual alignment through structured evidence).
OpinioBank is a large-scale, high-quality dataset designed to support user-centric opinion summarization from extensive long-form reviews, such as those targeted by the OpinioRAG framework. Unlike existing datasets focusing on short-form or synthetic review–summary pairs, OpinioBank contains entities with over a thousand user reviews each, paired with unbiased expert reviews and manually annotated queries. This makes it the first benchmark of its kind for advancing model development and evaluation on noisy, repetitive, and stylistically diverse inputs. For details, see our COLM 2025 paper.
            Data Sources
            User Reviews: Collected from TripAdvisor, which provides reviews on average three times longer than other platforms—ideal for studying long-form summarization with book-length inputs exceeding 100K tokens.
            Expert Reviews: Collected from Oyster, which provides professional hotel reviews written after on-site inspections and multi-source evaluations, ensuring high factual quality.
          
            Data Preparation
            We paired entities across platforms using unique identifiers (e.g., addresses, postal codes) and crawled both user and expert reviews. 
            Using a predefined list of gold query terms (e.g., room, location, ocean views), we manually annotated sentences to ensure query diversity and granularity. 
            Queries without matches were removed to maintain alignment quality. 
            Metadata from both review text (ratings, helpful votes, date) and reviewer profiles (number of reviews, cities visited, helpful votes) were integrated to support analyses of credibility and temporal trends. 
            A summary of the dataset statistics is provided below.
          
             
          
The full dataset and accompanying metadata are available through the Hugging Face link above.
            We introduce a novel, reference-free, and modular verification module that systematically evaluates whether LLM-generated highlights are faithfully grounded in the retrieved user review evidence. 
            This is achieved by decomposing both the retrieved sentences and the generated highlights into 
            Aspect–Opinion–Sentiment (AOS) triplets of the form 
            aspect:opinion:sentiment. 
            This structured decomposition enables fine-grained, interpretable alignment between evidence and generated text.
          
The verification relies on three complementary metrics designed for sentiment-rich domains, capturing different facets of factual consistency:
Together, these metrics enable fine-grained, interpretable, and reference-free evaluation, making the verification module easily extensible to stronger AOS extractors in the future.
@inproceedings{nayeem2025opiniorag,
          title={Opinio{RAG}: Towards Generating User-Centric Opinion Highlights from Large-scale Online Reviews},
          author={Mir Tafseer Nayeem and Davood Rafiei},
          booktitle={Second Conference on Language Modeling},
          year={2025},
          url={https://openreview.net/forum?id=R94bCTckhV}
}