arXiv: Preprints citing preprints

I’ve been reading a lot of AI research lately. Something has been nagging at me: the modern AI research ecosystem has quietly abandoned peer review, and the field has largely stopped noticing. I’m wrestling with understanding what this means, why it bothers me, and maybe recognizing I need to adjust my expectations in these times of incredible, rapid AI-assisted research on AI and AI-related topics.

A Brief History of arXiv

arXiv (pronounced “archive”) launched in 1991 as a preprint server for physicists who wanted to share results quickly before the long formal publication process concluded. The idea was simple and good: don’t lock up knowledge for 18 months behind a slow editorial queue. Let researchers read and build on each other’s work in near real-time.

It worked beautifully for physics. It has since become the dominant dissemination channel for computer science, mathematics, and AI. And in principle, that’s still a good thing. We now have a venue for research that is open, fast, and accessible to everyone regardless of institutional subscription status.

Is this OK? Or has something gone sideways?


The Citation Chain Problem

Here’s the pattern I keep running into: I open a recent AI preprint (something on large language model reasoning, reinforcement learning, or model training) and I look at its reference list. Nearly every citation reads something like:

arXiv preprint arXiv:2501.XXXXX

Not “published in NeurIPS 2025.” Not “Journal of Machine Learning Research, Vol. 22.” Just: preprint.

Some of the most widely cited papers in AI right now are unreviewed technical reports. The Qwen2.5 technical report (Qwen Team, 2024), describing one of the most widely used open-weights model families, is an arXiv preprint with (checking Google Scholar) has over 11,000 citations at the time of this post! The Qwen3 technical report (Yang et al., 2025)? Also a preprint. A paper I’ve recently been studying with my students, TTRL -a paper on test-time reinforcement learning that entire lines of follow-up work have been building on (Zuo et al., 2025) – initially circulated as a preprint before being accepted to NeurIPS 2025, meaning it spent months in heavy citation before clearing any formal bar. The NeurIPS reference has a little over 200 citations already. With respect to the utility of arXiv, as a Mandalorian would put it, “This is the way.”

And then there’s the category that doesn’t even seek conference review: industry technical reports from Meta, Google, Alibaba, and DeepSeek, describing frontier model families. The Llama 3 technical report from Meta (Grattafiori et al., 2024) is an arXiv preprint. DeepSeek-R1 (Guo et al., 2025) – the paper on incentivizing reasoning in large language models via reinforcement learning – circulated as a preprint for months before eventually being published in Nature in September 2025. That eventual publication is worth noting: the community built heavily on it, launched dozens of follow-up preprints, and treated it as established science well before any independent reviewer had looked at it. The formal peer review was really a lagging footnote to a citation trail that was already quite deep and impressive.

What does this means in practice for the AI researcher today? A new paper cites 30 sources, many of which were unreviewed at the time of citation. That new paper is itself a preprint. And within weeks, other preprints will cite it. The epistemic dependency chain is unvalidated all the way down. Should this be of concern?


Where Peer Review Has Gone

Traditional peer review assumed a sequential process: experiment → submission → expert review → revise → publish → cite. That model has effectively collapsed in AI/ML for a few compounding reasons:

  • Speed asymmetry. Peer review takes 6–18 months. The field moves in weeks. By the time a paper clears formal review, it may already have dozens of preprint descendants.
  • Venue congestion. Top conferences like NeurIPS, ICML, and ICLR, to name a few, accept roughly 20–25% of submissions at best. This means a large fraction of legitimate, solid work never clears the bar, not because it’s wrong, but because the venues are already overwhelmed.
  • Industry reports bypassing the queue entirely. Many of the most-cited papers in AI right now aren’t rejected conference submissions languishing on arXiv — they’re technical reports from industrial labs that were never intended for peer-reviewed venues. They describe systems their authors built and deployed, not experiments submitted for external scrutiny. This is a different kind of problem from slow review: it’s the deliberate absence of any review at all.
  • Thin conference review. Even papers that do get accepted at major venues typically receive 2–3 reviews, often written in 2–3 weeks by reviewers who may themselves be preprint-only researchers working in adjacent subfields. Having been a reviewer, it is a time-consuming process. The rewards are intrinsic, of course. It forces me to keep up to date, but in a field like AI today that is inundated with research, the process is too slow. arXiv is not the adversarial, months-long scrutiny that characterizes review in medicine or biology.

The result: the word “preprint” has functionally lost its meaning as a cautionary label. In AI, is a preprint now considered a de facto publication? It surely seems to be the case. It gets cited like one, benchmarked against like one, and built upon like one, often before anyone outside the authors’ own institution has carefully checked the work.


Citation Laundering

There’s a deeper problem I’d call citation laundering: a claim gets repeated across enough preprints that it acquires the social authority of an established fact without ever acquiring the epistemic warrant.

Consider how this plays out in practice. Paper A (a preprint) reports substantial accuracy gains on a reasoning benchmark using a new training method. Papers B, C, D, and E (all preprints) each cite Paper A as a foundational result and build refinements on top of it. Paper F then cites B through E, and its introduction reads as though A’s result is settled science. At no point in this chain has anyone outside the original research group independently verified the foundational claim. If Paper A overstates its result (through cherry-picked benchmarks, a subtle flaw in experimental design, or a training setup that doesn’t generalize at all), then all of B through F inherit that flaw.

This is not a hypothetical. There is a growing body of work raising exactly these concerns. Papers like “No Free Lunch: Rethinking Internal Feedback for LLM Reasoning” (Zhang et al., 2025) and “How Far Can Unsupervised RLVR Scale LLM Training?” (He et al., 2026) examine specific training paradigms — methods that use internal model signals instead of external reward supervision — and show that the gains they report tend to follow a rise-then-fall pattern: performance improves early in training, then collapses below the pre-training baseline. Aggregate benchmark numbers, looked at before that collapse sets in, would appear to be a genuine advance. Notably, both of those critical papers are also arXiv preprints.

The specific findings in those papers are about a particular class of methods, not a sweeping indictment of all AI benchmarking. But they illustrate the general risk well: a training approach can produce results that look compelling at the wrong moment in its training curve, get cited heavily in that window, and the problematic training dynamics surface only later when someone looks harder.

Perhaps this is the new form of peer review in rapidly evolving fields such as AI: release a preprint on arXiv, let it get some press, make the rounds on the socials where people (including the authors themselves) can post and hype up their work, or much more preferably, start discussed on review sites (e.g. https://gotit.pub) until another preprint comes out to critique and improve on prior work.


The Case for arXiv (Being Fair)

I don’t want to be entirely one-sided here, because the alternative, i.e. returning to traditional journal timelines, is not better. In fact, I could argue it is not suitable for AI at this time (though I would disagree with myself on that latter point!)

Peer review has its own well-documented failure modes. It is slow, biased toward established labs and prestigious institutions, and has failed catastrophically to catch replication crises in fields like social psychology and biomedical research that used it faithfully. Peer review is a filter, not a guarantee.

The open-access aspect of arXiv is also genuinely democratizing. Researchers at institutions without expensive journal subscriptions can fully participate in the conversation. That matters.

And the AI community does exercise a form of informal community review. As I mentioned above, it’s pretty safe to say that preprints risk getting publicly scrutinized on social media. Competing labs will work to replicate (or fail to replicate) the work, and will be challenged by follow-up work. For code and results that can be independently reproduced, this is actually quite fast and sometimes more effective than formal review. DeepSeek-R1 is again instructive: within days of its preprint release, multiple groups were attempting to reproduce its results, flag discrepancies, and extend its methods. The stronger the hype behind a paper (remember that huge dive NVDA took after DeepSeek?!?) So, that review process is real, and seems to work… most of the time. It just happens in public, messily, over months, rather than privately before publication.

The limitation of that informal process is also real, though: it works well for claims that can be reproduced from public code and standard benchmarks. It performs poorly for claims that hinge on proprietary data, undisclosed training details, or subtle methodological choices not visible in the paper. Those claims can circulate unchallenged for a long time.


The Problem Is the Conflation

Here’s my actual concern, stated plainly: arXiv is enormously useful as a communication tool, but it has been mistaken for a validation tool.

When a paper citing dozens of unreviewed technical reports is itself cited as authoritative in another preprint, we are building an increasingly tall structure on an unverified foundation. For fast-moving engineering claims that get stress-tested by replication at competing labs, this is uncomfortable but arguably tolerable. For deeper scientific claims about what these models actually learn, how they generalize, and whether reported gains reflect genuine capability acquisition, it is a meaningful epistemic risk that the field has largely chosen to accept without much reflection.

What would be better? At minimum: more careful hedging when citing unreviewed work, especially in introductions and related-work sections where preprints often get laundered into settled background. Journals and conference proceedings could normalize explicitly flagging which cited works were unreviewed at time of publication. Science journalists covering AI results could routinely note preprint status the way they note funding sources. None of these are radical changes in the process. It just requires some agility.

I know, I know. Agility is to academia as oil is to water.

Regardless, the speed of AI research is genuinely exciting. But speed without validation is just noise that moves fast.

It’s a reminder for all of us to be cautious and critical of all new preprints. Maybe we should have been doing this with all new research anyway.

Past Research Projects

The following are research projects that, for one reason or another, ended up falling down the priority list and are no longer being actively worked on. I list them here as a possible conversation starter with students looking for interesting work

  • [IN PREP] Cowen R, Mitchel MW, Hare-Harris A, King BR. Incorporation of Brown’s stages of syntactic and morphological development in a word prediction model of conversational speech from young children
  • [IN PREP] – Cowen R, Mitchel MW, Hare-Harris A, King BR. An adaptive n-gram based stochastic word prediction model for conversational speech.
  • [IN PREP]- Hare A, Essae E, King BR, Ledbetter DH, Martin CL. Determining the dosage effect of copy number variants in the human genome.
  • [IN PREP] – Ren C, King BR – Protein residue contact map prediction using bagged decision trees

Current Student Research

These are ongoing projects as of Summer 2019


Bhagawat Acharya ’20 – Using deep learning for handwriting text recognition.

  • This is a collaborative, interdisciplinary project with Katherine Faull (Comparative Humanities and German Studies) and Carrie Pirmann (Research Services Librarian). We are working together to develop an improved handwriting translation pipeline to increase the HTR throughput of 17-18th century Moravian handwritten literature that is part of the Moravian archives.
  • Funding – Bucknell Emerging Scholars Summer Research Program

Taehwan Kim ’20 – Using Deep Learning to Forecast Monthly Extreme Temperatures over the United States

  • Undoubtedly, climate change is one of most pressing, disconcerting issues of our time. Collaborating with atmospheric science and aerosol science expert Dabrina Dutcher, Assistant Prof. in Chemistry and Chemical Engineering, we are exploring the use of deep learning to develop advanced models that can improve future temperature predictions
  • Funding – Katherine Mabis McKenna Environmental Internship

Lily Romano ’20 – Software for Aerosol Analysis

  • We are developing a new software toolkit to aid in the aerosol research of my colleagues in Chemical Engineering, Dabrina Dutcher, PhD and Timothy Raymond, PhD. Lily is resuming work that was initiated by former student Khai Nguyen ’18 on the software, including advancing the data analysis tools available for aerosol researchers.
  • Funding – Clare Boothe Luce Research Scholars Program

Kartikeya Sharma ’20 – Trajectory Gaze Path Analysis on Eye Tracking Data for Autism Spectrum Disorder Studies

  • This is a collaborative project with my colleagues, Vanessa Troiani, PhD and Antoinette Sabatino DiCriscio, PhD at the Geisinger Autism Developmental Medicine Institute. The primary aim is to develop a toolkit for the eye tracking research community that incorporates my novel method for extracting scanpath trends from group-level eye tracking data.
  • Funding – Ciffolillo Healthcare Technology Inventors Program

Yili Wang ’21 – Using deep learning to identify discriminative features of images with high interest of autistic children

  • This is a collaborative project with my colleague Vanessa Troiani, PhD at Geisinger Autism and Developmental Medicine Institute. This is also a continuation of a project with former student Tongyu Yang `17, who is continuing to assist with the effort
  • Funding – Bucknell Program for Undergraduate Research (PUR)

These are projects that are unfinished for a variety of reasons:

Summer 2016

It has been quite some time since I’ve updated current events. Thanks to our students, we have had a pretty active summer…

  • Robert Cowen is continuing his work with me on word prediction models. We have good results and are writing our first paper. The first draft should be complete by the beginning of September.
  • Morgan Eckenroth has started work on the development of a virtual reality app (using Google Cardboard) that will be used by autistic children to help assess (and hopefully retrain) biases in their visual processing
  • Khai Nguyen is working on a collaborative project, funded together by the College of Engineering, Chemical Engineering, and Computer Science. The aim of the project is to develop a new application for aerosol researchers in Chem Eng.
  • Ryan Stecher is working on a collaborative project with Dr. Aaron Mitchel in Psychology to develop and finalize a web-based series of perception tests.
  • Tongyu Yang has been investigating the use of deep learning to help autism researchers better understand why autistic children have substantial interest in certain types of images

Son Pham, ’17

Project: Using Deep Learning to Automatically Learn Feature Representation and Build a Better Classification Model on Protein Sequential Data
Started: Summer 2015
Funding: Bucknell University PUR

ABSTRACT

In theory, deep learning is not new. However, it has recently become one of the most exciting directions that machine learning has witnessed in years. It has had a tremendous impact on image classification. However, there are very few methods that have investigated its use on strictly sequential data, such as those found in biological sequences. This study will aim to investigate the use of deep learning to induce a protein sequence classifier that can outperform existing methods.

ACHIEVEMENTS

  • Poster Presentation – Sigma Xi 2015 Summer Research Symposium
  • Poster Presentation – Fifth Annual Susquehanna Valley Undergraduate Research Symposium, SVURS 2015, August 4, Bucknell University, Lewisburg, PA
  • Poster Presentation – Presented at 15th Annual Kalman Research Symposium, April 2, 2016, Bucknell University, Lewisburg, PA

POST GRADUATION UPDATES

Son graduated with his degrees in Computer Science and Engineering, together with Digital Studio Arts. He went on to work for Amazon as an Software Engineering Intern, then took a position at Google working with machine learning. Son graduated with the aim of going back to graduate school in 1-2 years.

Chuqiao Ren, ’15

Project: A novel ensemble classifier for protein contact map prediction
Duration: Summer 2013 – Spring 2015
Funding: Bucknell University Program for Undergraduate Research, BRK Startup Fund, Geisinger BGRI Grant, CS Dept. Fund

ABSTRACT

One of the greatest challenges in bioinformatics is how to predict the 3-D structure of a protein by understanding the relationship between a sequence and its amino acid structure.  A protein contact map is a useful way of representing protein 3-D conformations. It is based on a distance matrix, which is a symmetric matrix that contains the Euclidean distance between each pair of C-alpha atoms in each residue in the folded protein.  

Our goal is to improve existing machine learning algorithms for predicting a protein contact map from protein sequence, and develop a novel algorithm that improves the performance of existing contact map predictors.

ACHIEVEMENTS

  • Honors Thesis – Successfully defended, April 2015
  • Short paper and poster – ACB BCB ’14 – ACM International Conference on Bioinformatics, Computational Biology and Biomedicine, Sept 20-23, Newport Beach, CA [link]
  • Poster Presentation – Fourth Annual Susquehanna Valley Undergraduate Research Symposium, SVURS 2014, August 5, Geisinger Research, Danville, PA
  • Poster – Kalman Research Symposium 2014, March 29, Bucknell University, Lewisburg, PA.

POST GRADUATION UPDATES

Chuqiao successfully defended her honors thesis in April, 2015. She is staying for a bit longer this summer to help finish a journal publication and submit before she departs us. She is currently planning on pursuing her graduate degree in computer science at Columbia University, starting Fall 2015. Congratulations, Chuqiao!

Summer 2015

We have an active summer in store. Three students are working on entirely different research projects, while Rachel Ren is wrapping up her work.

  • Son Pham is working on investigating the use of Deep Learning for protein sequence classification. Deep Learning has recently gained substantial recognition due to its success with automated image recognition and speech classification. Very few have examined its use in bioinformatics. Son will help me explore this untapped area in bioinformatics.
  • Jason Hammett will be applying data mining techniques to years of regional climate data, including local stats for the Susquehanna River, to develop explanatory and predictive models for anomalistic weather events around the Susquehanna River Valley.
  • Robert Cowen will be continuing the wonderful work that I started with Bucknell Student Stephanie Gonthier last year on word prediction. Robert will be collaborating with myself and speech pathologists at the Geisinger-Bucknell Autism and Developmental Medicine Institute (ADMI) to develop a preliminary version of a new augmentative and alternative communication (AAC) app that will utilize my word prediction model. This first version will be developed to run on Android tablets.
  • Rachel Ren is graciously staying for a month after graduating to help submit a paper based on her extensive work completed for her honors thesis. Stayed tuned!

Spring 2015

Rachel Ren successfully defended her honors thesis, titled, “Predicting Protein Contact Maps by Bagging Decision Trees”. Congratulations, Rachel! Additionally, Rachel will be attending graduate school starting in the fall at Columbia University, where she will pursue a Masters in Computer Science. Rachel intends to focus on research in machine learning.

Congratulations, Rachel! Bucknell is proud of you! We wish you the very best as you pursue your graduate work.

Stephanie Gonthier, ’15

Project: Using statistical learning to improve word prediction for augmentative and alternative communication
Duration: Summer 2014
Funding: Bucknell University Program for Undergraduate Research, Geisinger BGRI Grant

ABSTRACT

There are a multitude of reasons why people may be unable to communicate effectively through verbal speech, including disorders like ALS, MS, Cerebral Palsy and Autism. Some people use augmentative and alternative communication (AAC), which is simply any mode of communication besides verbal speech, including gestures, writing, facial expressions, pointing to pictures and so on. In recent decades, the field of AAC has been flooded by electronic devices which generate speech for these people based on combinations of pictures, symbols and/or words that are stored on the device. Unfortunately, these devices do present problems; notably, the communication rate with a device is reduced to a fraction of the communication rate of normal speakers. The average user of a device is only able to communicate 10 words per minute, compared to the 130-200 words per minute of an average speaker [ref]. This stark contrast can leave users frustrated, reducing the utility of such devices. The aim of this research is to develop a novel algorithm that would increase the communication rate for users of AAC devices.

ACHIEVEMENTS

  • Oral PresentationFourth Annual Susquehanna Valley Undergraduate Research Symposium, SVURS 2014, August 5, Geisinger Research, Danville, PA
    Winner for oral presentation – One of three chosen out of 86 submissions!
  • Poster Presentation – 2014 Sigma Xi Summer Student Research Symposium, July 24, Bucknell University, Lewisburg, PA

POST GRADUATION UPDATES

Elizabeth Dwornik, ’14

Project: Named-Entity Recognition
Duration: Summer 2013 – Spring 2014
Funding: Bucknell University Program for Undergraduate Research

ABSTRACT

Liz is working on a system that can annotate all of the named entities within a text. There are good systems that can identify named entities, however, identifying the type of named entity is a more challenging problem. Many successful systems use simple database lookup techniques and identify entities from a master gazetteer. We are working on a system that can distinguish among different types of named entities without a gazetteer. Our initial efforts will focus on distinguishing entities between location, organization, or person. We plan to start by developing a large set of regular expressions that can be used to classify the different types of entities.

ACHIEVEMENTS

  • Poster: Kalman Research Symposium 2013, April 13, Bucknell University, Lewisburg, PA

POST GRADUATION UPDATES

Liz pursued graduate school studies at Carnegie Mellon University, starting Fall 2014. She enrolled in the Software Management program in the Information Networking Institute. Congratulations, Liz!