GenePrep

Unified Genomic Data Formatter for Statistical Analysis

By Yue Xu1, Sirihaasa Nallamothu2, Haoyang Liu2, Haohan Wang21Columbia University, 2UIUC

Overview

GenePrep is an automated multi‑agent tool to streamline preprocessing and analysis of large‑scale gene expression data (GEO, TCGA). Provide a dataset and trait–condition pairs; GenePrep validates data, selects pairs, performs statistical tests, and outputs reproducible logs and CSV results with minimal scripting.

Workflow

GenePrep workflow

End-to-end workflow for GenePrep.

Quick Start

CLI Preview

$ conda create -n agent python=3.10
$ pip install -r requirements.txt
$ python main.py --version 1 \
    --model gemini-2.0-flash-002 --api 1

Commands shown for illustration; see manual for full options.

Features

Automation

End‑to‑end workflow: validation, pair selection, tests, and result generation.

Modular Agents

Plan, execute, and debug with minimal manual scripting using modular agents.

Reproducibility

Terminal‑style logs and CSV outputs for transparent, repeatable results.

Paper

Read the full GenePrep paper below:

You can also open the PDF in a new tab.

BibTeX

@inproceedings{nallamothu2025geneprep,
        title        = {GenePrep: Unified Genomic Data Formatter for Statistical Analysis},
        author       = {Yue Xu and Sirihaasa Nallamothu and Haoyang Liu and Haohan Wang},
        booktitle    = {Proceedings of the Machine Learning in Computational Biology Conference (MLCB)},
        year         = {2025},
        month        = jun,
        address      = {Vancouver, Canada},
        publisher    = {MLCB},
        url          = {https://www.mlcb.io/2025},
        abstract     = {GenePrep is an automated multi-agent system that streamlines preprocessing and analysis of large-scale gene expression data from GEO and TCGA. By simply installing the GenePrep package, users can perform end-to-end workflows including data validation, trait–condition pair selection, statistical testing, and result generation. Given a dataset and trait–condition pairs, GenePrep identifies genes associated with traits while accounting for conditions. Its modular agents enable iterative planning, execution, and debugging with minimal manual scripting. GenePrep reduces preprocessing overhead and improves reproducibility. This work extends on tools to create teams of AI scientists and automate gene expression data analysis, as presented in Liu et al. (2024, 2025).}
      }