By Yue Xu1, Sirihaasa Nallamothu2, Haoyang Liu2, Haohan Wang2 — 1Columbia University, 2UIUC
GenePrep is an automated multi‑agent tool to streamline preprocessing and analysis of large‑scale gene expression data (GEO, TCGA). Provide a dataset and trait–condition pairs; GenePrep validates data, selects pairs, performs statistical tests, and outputs reproducible logs and CSV results with minimal scripting.
End-to-end workflow for GenePrep.
CLI Preview
$ conda create -n agent python=3.10
$ pip install -r requirements.txt
$ python main.py --version 1 \
--model gemini-2.0-flash-002 --api 1
Commands shown for illustration; see manual for full options.
Automation
End‑to‑end workflow: validation, pair selection, tests, and result generation.
Modular Agents
Plan, execute, and debug with minimal manual scripting using modular agents.
Reproducibility
Terminal‑style logs and CSV outputs for transparent, repeatable results.
Read the full GenePrep paper below:
You can also open the PDF in a new tab.
@inproceedings{nallamothu2025geneprep,
title = {GenePrep: Unified Genomic Data Formatter for Statistical Analysis},
author = {Yue Xu and Sirihaasa Nallamothu and Haoyang Liu and Haohan Wang},
booktitle = {Proceedings of the Machine Learning in Computational Biology Conference (MLCB)},
year = {2025},
month = jun,
address = {Vancouver, Canada},
publisher = {MLCB},
url = {https://www.mlcb.io/2025},
abstract = {GenePrep is an automated multi-agent system that streamlines preprocessing and analysis of large-scale gene expression data from GEO and TCGA. By simply installing the GenePrep package, users can perform end-to-end workflows including data validation, trait–condition pair selection, statistical testing, and result generation. Given a dataset and trait–condition pairs, GenePrep identifies genes associated with traits while accounting for conditions. Its modular agents enable iterative planning, execution, and debugging with minimal manual scripting. GenePrep reduces preprocessing overhead and improves reproducibility. This work extends on tools to create teams of AI scientists and automate gene expression data analysis, as presented in Liu et al. (2024, 2025).}
}