Iterative reasoning preference optimization

https://arxiv.org/abs/2404.19733

View a PDF of the paper titled Iterative Reasoning Preference Optimization, by Richard Yuanzhe Pang and 5 other authors

View PDF HTML (experimental)

Abstract:Iterative preference optimization methods have recently been shown to perform well for general instruction tuning tasks, but typically make little improvement on reasoning tasks (Yuan et al., 2024, Chen et al., 2024). In this work we develop an iterative approach that optimizes the preference between competing generated Chain-of-Thought (CoT) candidates by optimizing for winning vs. losing reasoning steps that lead to the correct answer. We train using a modified DPO loss (Rafailov et al., 2023) with an additional negative log-likelihood term, which we find to be crucial. We show reasoning improves across repeated iterations of this scheme. While only relying on examples in the training set, our approach results in increasing accuracy on GSM8K, MATH, and ARC-Challenge for Llama-2-70B-Chat, outperforming other Llama-2-based models not relying on additionally sourced datasets. For example, we see a large improvement from 55.6% to 81.6% on GSM8K and an accuracy of 88.7% with majority voting out of 32 samples.

Submission history

From: Richard Yuanzhe Pang [view email]
[v1] Tue, 30 Apr 2024 17:28:05 UTC (8,118 KB)
[v2] Tue, 7 May 2024 17:25:08 UTC (8,109 KB)
[v3] Wed, 26 Jun 2024 01:28:35 UTC (8,129 KB)

{
"by": "Jimmc414",
"descendants": 4,
"id": 40219265,
"kids": [
40219320,
40219458
],
"score": 19,
"time": 1714534813,
"title": "Iterative reasoning preference optimization",
"type": "story",
"url": "https://arxiv.org/abs/2404.19733"
}
{
"author": "Richard Yuanzhe Pang",
"date": "2024-04-30T12:00:00.000Z",
"description": "Iterative preference optimization methods have recently been shown to perform well for general instruction tuning tasks, but typically make little improvement on reasoning tasks (Yuan et al., 2024, Chen et al., 2024). In this work we develop an iterative approach that optimizes the preference between competing generated Chain-of-Thought (CoT) candidates by optimizing for winning vs. losing reasoning steps that lead to the correct answer. We train using a modified DPO loss (Rafailov et al., 2023) with an additional negative log-likelihood term, which we find to be crucial. We show reasoning improves across repeated iterations of this scheme. While only relying on examples in the training set, our approach results in increasing accuracy on GSM8K, MATH, and ARC-Challenge for Llama-2-70B-Chat, outperforming other Llama-2-based models not relying on additionally sourced datasets. For example, we see a large improvement from 55.6% to 81.6% on GSM8K and an accuracy of 88.7% with majority voting out of 32 samples.",
"image": "https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png",
"logo": "https://logo.clearbit.com/arxiv.org",
"publisher": "arXiv.org",
"title": "Iterative Reasoning Preference Optimization",
"url": "https://arxiv.org/abs/2404.19733v3"
}
{
"url": "https://arxiv.org/abs/2404.19733",
"title": "Iterative Reasoning Preference Optimization",
"description": "Iterative preference optimization methods have recently been shown to perform well for general instruction tuning tasks, but typically make little improvement on reasoning tasks (Yuan et al.,...",
"links": [
"https://arxiv.org/abs/2404.19733v3",
"https://arxiv.org/abs/2404.19733"
],
"image": "https://static.arxiv.org/icons/twitter/arxiv-logo-twitter-square.png",
"content": "<div>\n <p>View a PDF of the paper titled Iterative Reasoning Preference Optimization, by Richard Yuanzhe Pang and 5 other authors</p>\n <p><a target=\"_blank\" href=\"https://arxiv.org/pdf/2404.19733\">View PDF</a>\n <a target=\"_blank\" href=\"https://arxiv.org/html/2404.19733v3\">HTML (experimental)</a></p><blockquote>\n <span>Abstract:</span>Iterative preference optimization methods have recently been shown to perform well for general instruction tuning tasks, but typically make little improvement on reasoning tasks (Yuan et al., 2024, Chen et al., 2024). In this work we develop an iterative approach that optimizes the preference between competing generated Chain-of-Thought (CoT) candidates by optimizing for winning vs. losing reasoning steps that lead to the correct answer. We train using a modified DPO loss (Rafailov et al., 2023) with an additional negative log-likelihood term, which we find to be crucial. We show reasoning improves across repeated iterations of this scheme. While only relying on examples in the training set, our approach results in increasing accuracy on GSM8K, MATH, and ARC-Challenge for Llama-2-70B-Chat, outperforming other Llama-2-based models not relying on additionally sourced datasets. For example, we see a large improvement from 55.6% to 81.6% on GSM8K and an accuracy of 88.7% with majority voting out of 32 samples.\n </blockquote>\n </div><div>\n <h2>Submission history</h2><p> From: Richard Yuanzhe Pang [<a target=\"_blank\" href=\"https://arxiv.org/show-email/56a52b6c/2404.19733\">view email</a>] <br /> <strong><a target=\"_blank\" href=\"https://arxiv.org/abs/2404.19733v1\">[v1]</a></strong>\n Tue, 30 Apr 2024 17:28:05 UTC (8,118 KB)<br />\n <strong><a target=\"_blank\" href=\"https://arxiv.org/abs/2404.19733v2\">[v2]</a></strong>\n Tue, 7 May 2024 17:25:08 UTC (8,109 KB)<br />\n <strong>[v3]</strong>\n Wed, 26 Jun 2024 01:28:35 UTC (8,129 KB)<br />\n</p></div>",
"author": "",
"favicon": "https://arxiv.org/static/browse/0.3.4/images/icons/favicon-16x16.png",
"source": "arxiv.org",
"published": "",
"ttr": 43,
"type": "website"
}