Mapping the Increasing Use of LLMs in Scientific Papers

https://arxiv.org/abs/2404.01268

Computer Science > Computation and Language

arXiv:2404.01268 (cs)

View a PDF of the paper titled Mapping the Increasing Use of LLMs in Scientific Papers, by Weixin Liang and 13 other authors

View PDF HTML (experimental)

Abstract:Scientific publishing lays the foundation of science by disseminating research findings, fostering collaboration, encouraging reproducibility, and ensuring that scientific knowledge is accessible, verifiable, and built upon over time. Recently, there has been immense speculation about how many people are using large language models (LLMs) like ChatGPT in their academic writing, and to what extent this tool might have an effect on global scientific practices. However, we lack a precise measure of the proportion of academic writing substantially modified or produced by LLMs. To address this gap, we conduct the first systematic, large-scale analysis across 950,965 papers published between January 2020 and February 2024 on the arXiv, bioRxiv, and Nature portfolio journals, using a population-level statistical framework to measure the prevalence of LLM-modified content over time. Our statistical estimation operates on the corpus level and is more robust than inference on individual instances. Our findings reveal a steady increase in LLM usage, with the largest and fastest growth observed in Computer Science papers (up to 17.5%). In comparison, Mathematics papers and the Nature portfolio showed the least LLM modification (up to 6.3%). Moreover, at an aggregate level, our analysis reveals that higher levels of LLM-modification are associated with papers whose first authors post preprints more frequently, papers in more crowded research areas, and papers of shorter lengths. Our findings suggests that LLMs are being broadly used in scientific writings.

Submission history

From: Weixin Liang [view email]
[v1] Mon, 1 Apr 2024 17:45:15 UTC (3,875 KB)

Full-text links:

Access Paper:

Current browse context:

cs.CL

export BibTeX citation

Bookmark

BibSonomy logo Reddit logo

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

{
"by": "rntn",
"descendants": 0,
"id": 40246319,
"score": 3,
"time": 1714734493,
"title": "Mapping the Increasing Use of LLMs in Scientific Papers",
"type": "story",
"url": "https://arxiv.org/abs/2404.01268"
}
{
"author": "Weixin Liang",
"date": "2024-04-01T12:00:00.000Z",
"description": "Scientific publishing lays the foundation of science by disseminating research findings, fostering collaboration, encouraging reproducibility, and ensuring that scientific knowledge is accessible, verifiable, and built upon over time. Recently, there has been immense speculation about how many people are using large language models (LLMs) like ChatGPT in their academic writing, and to what extent this tool might have an effect on global scientific practices. However, we lack a precise measure of the proportion of academic writing substantially modified or produced by LLMs. To address this gap, we conduct the first systematic, large-scale analysis across 950,965 papers published between January 2020 and February 2024 on the arXiv, bioRxiv, and Nature portfolio journals, using a population-level statistical framework to measure the prevalence of LLM-modified content over time. Our statistical estimation operates on the corpus level and is more robust than inference on individual instances. Our findings reveal a steady increase in LLM usage, with the largest and fastest growth observed in Computer Science papers (up to 17.5%). In comparison, Mathematics papers and the Nature portfolio showed the least LLM modification (up to 6.3%). Moreover, at an aggregate level, our analysis reveals that higher levels of LLM-modification are associated with papers whose first authors post preprints more frequently, papers in more crowded research areas, and papers of shorter lengths. Our findings suggests that LLMs are being broadly used in scientific writings.",
"image": "https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png",
"logo": null,
"publisher": "arXiv.org",
"title": "Mapping the Increasing Use of LLMs in Scientific Papers",
"url": "https://arxiv.org/abs/2404.01268v1"
}
{
"url": "https://arxiv.org/abs/2404.01268v1",
"title": "Mapping the Increasing Use of LLMs in Scientific Papers",
"description": "Scientific publishing lays the foundation of science by disseminating research findings, fostering collaboration, encouraging reproducibility, and ensuring that scientific knowledge is accessible,...",
"links": [
"https://arxiv.org/abs/2404.01268v1",
"https://arxiv.org/abs/2404.01268"
],
"image": "https://static.arxiv.org/icons/twitter/arxiv-logo-twitter-square.png",
"content": "<div>\n <div>\n <p>\n </p><h2>Computer Science &gt; Computation and Language</h2>\n <p></p>\n <p><strong>arXiv:2404.01268</strong> (cs)\n </p>\n<div>\n <div><p><span>Authors:</span><a target=\"_blank\" href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Liang,+W\">Weixin Liang</a>, <a target=\"_blank\" href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Zhang,+Y\">Yaohui Zhang</a>, <a target=\"_blank\" href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Wu,+Z\">Zhengxuan Wu</a>, <a target=\"_blank\" href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Lepp,+H\">Haley Lepp</a>, <a target=\"_blank\" href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Ji,+W\">Wenlong Ji</a>, <a target=\"_blank\" href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Zhao,+X\">Xuandong Zhao</a>, <a target=\"_blank\" href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Cao,+H\">Hancheng Cao</a>, <a target=\"_blank\" href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Liu,+S\">Sheng Liu</a>, <a target=\"_blank\" href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=He,+S\">Siyu He</a>, <a target=\"_blank\" href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Huang,+Z\">Zhi Huang</a>, <a target=\"_blank\" href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Yang,+D\">Diyi Yang</a>, <a target=\"_blank\" href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Potts,+C\">Christopher Potts</a>, <a target=\"_blank\" href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Manning,+C+D\">Christopher D Manning</a>, <a target=\"_blank\" href=\"https://arxiv.org/search/cs?searchtype=author&amp;query=Zou,+J+Y\">James Y. Zou</a></p></div> <p>View a PDF of the paper titled Mapping the Increasing Use of LLMs in Scientific Papers, by Weixin Liang and 13 other authors</p>\n <p><a target=\"_blank\" href=\"https://arxiv.org/pdf/2404.01268\">View PDF</a>\n <a target=\"_blank\" href=\"https://arxiv.org/html/2404.01268v1\">HTML (experimental)</a></p><blockquote>\n <span>Abstract:</span>Scientific publishing lays the foundation of science by disseminating research findings, fostering collaboration, encouraging reproducibility, and ensuring that scientific knowledge is accessible, verifiable, and built upon over time. Recently, there has been immense speculation about how many people are using large language models (LLMs) like ChatGPT in their academic writing, and to what extent this tool might have an effect on global scientific practices. However, we lack a precise measure of the proportion of academic writing substantially modified or produced by LLMs. To address this gap, we conduct the first systematic, large-scale analysis across 950,965 papers published between January 2020 and February 2024 on the arXiv, bioRxiv, and Nature portfolio journals, using a population-level statistical framework to measure the prevalence of LLM-modified content over time. Our statistical estimation operates on the corpus level and is more robust than inference on individual instances. Our findings reveal a steady increase in LLM usage, with the largest and fastest growth observed in Computer Science papers (up to 17.5%). In comparison, Mathematics papers and the Nature portfolio showed the least LLM modification (up to 6.3%). Moreover, at an aggregate level, our analysis reveals that higher levels of LLM-modification are associated with papers whose first authors post preprints more frequently, papers in more crowded research areas, and papers of shorter lengths. Our findings suggests that LLMs are being broadly used in scientific writings.\n </blockquote>\n </div>\n <div>\n <h2>Submission history</h2><p> From: Weixin Liang [<a target=\"_blank\" href=\"https://arxiv.org/show-email/8748a53e/2404.01268\">view email</a>] <br /> <strong>[v1]</strong>\n Mon, 1 Apr 2024 17:45:15 UTC (3,875 KB)<br />\n</p></div>\n </div>\n<div> <div>\n <p><a></a>\n <span>Full-text links:</span></p><h2>Access Paper:</h2>\n <ul>\n <p>\nView a PDF of the paper titled Mapping the Increasing Use of LLMs in Scientific Papers, by Weixin Liang and 13 other authors</p><li><a target=\"_blank\" href=\"https://arxiv.org/pdf/2404.01268\">View PDF</a></li><li><a target=\"_blank\" href=\"https://arxiv.org/html/2404.01268v1\">HTML (experimental)</a></li><li><a target=\"_blank\" href=\"https://arxiv.org/src/2404.01268\">TeX Source\n </a></li></ul>\n </div>\n <div><p>\n Current browse context: </p><p>cs.CL</p>\n </div>\n<p><span>export BibTeX citation</span>\n</p>\n<div>\n <p></p><h3>Bookmark</h3><p></p><p><a target=\"_blank\" href=\"http://www.bibsonomy.org/BibtexHandler?requTask=upload&amp;url=https://arxiv.org/abs/2404.01268&amp;description=Mapping%20the%20Increasing%20Use%20of%20LLMs%20in%20Scientific%20Papers\" title=\"Bookmark on BibSonomy\">\n <img src=\"https://arxiv.org/static/browse/0.3.4/images/icons/social/bibsonomy.png\" alt=\"BibSonomy logo\" />\n </a>\n <a target=\"_blank\" href=\"https://reddit.com/submit?url=https://arxiv.org/abs/2404.01268&amp;title=Mapping%20the%20Increasing%20Use%20of%20LLMs%20in%20Scientific%20Papers\" title=\"Bookmark on Reddit\">\n <img src=\"https://arxiv.org/static/browse/0.3.4/images/icons/social/reddit.png\" alt=\"Reddit logo\" />\n </a>\n</p></div> </div>\n<div><p>\n <label>Bibliographic Tools</label></p><div>\n <h2>Bibliographic and Citation Tools</h2>\n <div>\n <p><label>\n <span></span>\n <span>Bibliographic Explorer Toggle</span>\n </label>\n </p>\n </div>\n </div>\n <p>\n <label>Code, Data, Media</label></p><div>\n <h2>Code, Data and Media Associated with this Article</h2>\n </div>\n <p>\n <label>Demos</label></p><div>\n <h2>Demos</h2>\n </div>\n <p>\n <label>Related Papers</label></p><div>\n <h2>Recommenders and Search Tools</h2>\n </div>\n <p>\n <label>\n About arXivLabs\n </label></p><div>\n <h2>arXivLabs: experimental projects with community collaborators</h2>\n <p>arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.</p>\n <p>Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.</p>\n <p>Have an idea for a project that will add value for arXiv's community? <a target=\"_blank\" href=\"https://info.arxiv.org/labs/index.html\"><strong>Learn more about arXivLabs</strong></a>.</p>\n </div>\n </div>\n</div>",
"author": "",
"favicon": "https://arxiv.org/static/browse/0.3.4/images/icons/favicon-16x16.png",
"source": "arxiv.org",
"published": "",
"ttr": 91,
"type": "website"
}