DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval

Abstract

Large Language Model (LLM)-based agents have shown strong capabilities in automating data science workflows. However, many rigorous statistical methods implemented in the R ecosystem remain underutilized due to limitations in model knowledge and tool retrieval. Existing Retrieval-Augmented Generation (RAG) approaches rely primarily on function-level semantic similarity and often ignore data distribution characteristics that determine statistical method applicability, leading to inaccurate retrieval. To address this limitation, we propose DARE (Distribution-Aware Retrieval Embedding) a lightweight and plug-and-play retrieval model that incorporates data distribution into function representations for R package retrieval. Our main contributions are threefold: (i) We construct RPKB, a curated knowledge base of 8,191 high-quality R packages with functions spanning diverse statistical domains. (ii) We introduce DARE, an embedding model that incorporates data distribution for accurate R package retrieval. (iii) We design RCodingAgent, an R-oriented LLM agent, together with a suite of R-based statistical analysis tasks to systematically evaluate LLM agents under realistic analytical scenarios. Experimental results show that DARE achieves an NDCG@10 of 93.47%, outperforming open-source state-of-the-art embedding models by up to 17% on R package retrieval while using only 23M parameters. Furthermore, integrating DARE into RCodingAgent significantly improves downstream statistical analysis performance across multiple analytical tasks. This work bridges modern LLM-based automation with established statistical computing ecosystems.

Figure 1: DCRE significantly improves retrieval accuracy compared with general-purpose embedding models by incorporating distribution-aware representations.

Figure 2: Overview of the DCRE framework. DCRE jointly models semantic intent and data distribution constraints to enable reliable statistical function retrieval.

Figure 3: Pipline of Constructing Evaluation Tasks.

Figure 4: An example of integrating DARE into Agent.

Experimental Results

Model	Params	NDCG@10	MRR@10	Recall@10	Recall@1
Snowflake/arctic-embed-l	335M	0.7932	0.7510	0.9235	0.6549
intfloat/e5-large-v2	335M	0.7513	0.7086	0.8838	0.6152
jina-embeddings-v2-base-en	137M	0.7429	0.6965	0.8873	0.5969
BAAI/bge-m3	568M	0.7308	0.6843	0.8758	0.5847
mxbai-embed-large-v1	335M	0.7068	0.6565	0.8639	0.5508
UAE-Large-V1	335M	0.7066	0.6556	0.8658	0.5479
gte-large-en-v1.5	435M	0.6639	0.6122	0.8257	0.5040
all-mpnet-base-v2	110M	0.6606	0.6057	0.8330	0.4937
Base Model (MiniLM)	23M	0.6127	0.5553	0.7936	0.4412
DCRE (Ours)	23M	0.9347	0.9176	0.9863	0.8739

Figure 3: Efficiency comparison measured by queries-per-second (QPS). Despite its small parameter size, DCRE achieves both superior accuracy and fast retrieval speed.

Downstream Agent Performance

Model	RCodingAgent (w/o DARE)	RCodingAgent with DARE
claude-haiku-4.5	6.25%	56.25% (50.00%)
deepseek-v3.2	18.75%	56.25% (37.50%)
gpt-5.2	25.00%	62.50% (37.50%)
grok-4.1-fast	18.75%	75.00% (56.25%)
mimo-v2-flash	12.50%	62.50% (50.00%)
minimax-m2.1	12.50%	68.75% (56.25%)