DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval

Abstract

Large Language Model (LLM)-based agents have shown strong capabilities in automating data science workflows. However, many rigorous statistical methods implemented in the R ecosystem remain underutilized due to limitations in model knowledge and tool retrieval. Existing Retrieval-Augmented Generation (RAG) approaches rely primarily on function-level semantic similarity and often ignore data distribution characteristics that determine statistical method applicability, leading to inaccurate retrieval. To address this limitation, we propose DARE (Distribution-Aware Retrieval Embedding) a lightweight and plug-and-play retrieval model that incorporates data distribution into function representations for R package retrieval. Our main contributions are threefold: (i) We construct RPKB, a curated knowledge base of 8,191 high-quality R packages with functions spanning diverse statistical domains. (ii) We introduce DARE, an embedding model that incorporates data distribution for accurate R package retrieval. (iii) We design RCodingAgent, an R-oriented LLM agent, together with a suite of R-based statistical analysis tasks to systematically evaluate LLM agents under realistic analytical scenarios. Experimental results show that DARE achieves an NDCG@10 of 93.47%, outperforming open-source state-of-the-art embedding models by up to 17% on R package retrieval while using only 23M parameters. Furthermore, integrating DARE into RCodingAgent significantly improves downstream statistical analysis performance across multiple analytical tasks. This work bridges modern LLM-based automation with established statistical computing ecosystems.

Figure 1: DCRE significantly improves retrieval accuracy compared with general-purpose embedding models by incorporating distribution-aware representations.

Figure 2: Overview of the DCRE framework. DCRE jointly models semantic intent and data distribution constraints to enable reliable statistical function retrieval.

Figure 3: Pipline of Constructing Evaluation Tasks.

Figure 4: An example of integrating DARE into Agent.

Experimental Results

Model Params NDCG@10 MRR@10 Recall@10 Recall@1
Snowflake/arctic-embed-l335M0.79320.75100.92350.6549
intfloat/e5-large-v2335M0.75130.70860.88380.6152
jina-embeddings-v2-base-en137M0.74290.69650.88730.5969
BAAI/bge-m3568M0.73080.68430.87580.5847
mxbai-embed-large-v1335M0.70680.65650.86390.5508
UAE-Large-V1335M0.70660.65560.86580.5479
gte-large-en-v1.5435M0.66390.61220.82570.5040
all-mpnet-base-v2110M0.66060.60570.83300.4937
Base Model (MiniLM)23M0.61270.55530.79360.4412
DCRE (Ours) 23M 0.9347 0.9176 0.9863 0.8739

Figure 3: Efficiency comparison measured by queries-per-second (QPS). Despite its small parameter size, DCRE achieves both superior accuracy and fast retrieval speed.

Downstream Agent Performance

Model RCodingAgent (w/o DARE) RCodingAgent with DARE
claude-haiku-4.5 6.25% 56.25% (50.00%)
deepseek-v3.2 18.75% 56.25% (37.50%)
gpt-5.2 25.00% 62.50% (37.50%)
grok-4.1-fast 18.75% 75.00% (56.25%)
mimo-v2-flash 12.50% 62.50% (50.00%)
minimax-m2.1 12.50% 68.75% (56.25%)
Flag Counter