{"title": "Regret Analysis for Continuous Dueling Bandit", "book": "Advances in Neural Information Processing Systems", "page_first": 1489, "page_last": 1498, "abstract": "The dueling bandit is a learning framework where the feedback information in the learning process is restricted to noisy comparison between a pair of actions. In this paper, we address a dueling bandit problem based on a cost function over a continuous space. We propose a stochastic mirror descent algorithm and show that the algorithm achieves an $O(\\sqrt{T\\log T})$-regret bound under strong convexity and smoothness assumptions for the cost function. Then, we clarify the equivalence between regret minimization in dueling bandit and convex optimization for the cost function. Moreover, considering a lower bound in convex optimization, it is turned out that our algorithm achieves the optimal convergence rate in convex optimization and the optimal regret in dueling bandit except for a logarithmic factor.", "full_text": "Regret Analysis for Continuous Dueling Bandit\n\nWataru Kumagai\n\nCenter for Advanced Intelligence Project\n\nRIKEN\n\n1-4-1, Nihonbashi, Chuo, Tokyo 103-0027, Japan\n\nwataru.kumagai@riken.jp\n\nAbstract\n\np\n\nThe dueling bandit is a learning framework wherein the feedback information in\nthe learning process is restricted to a noisy comparison between a pair of actions.\nIn this research, we address a dueling bandit problem based on a cost function over\na continuous space. We propose a stochastic mirror descent algorithm and show\nthat the algorithm achieves an O(\nT log T )-regret bound under strong convexity\nand smoothness assumptions for the cost function. Subsequently, we clarify the\nequivalence between regret minimization in dueling bandit and convex optimiza-\ntion for the cost function. Moreover, when considering a lower bound in convex\noptimization, our algorithm is shown to achieve the optimal convergence rate in\nconvex optimization and the optimal regret in dueling bandit except for a logarith-\nmic factor.\n\n1\n\nIntroduction\n\nInformation systems and computer algorithms often have many parameters which should be tuned.\nWhen cost or utility are explicitly given as numerical values or concrete functions, the system pa-\nrameters can be appropriately determined depending on the the values or the functions. However,\nin a human-computer interaction system, it is dif\ufb01cult or impossible for users of the system to pro-\nvide user preference as numerical values or concrete functions. Dueling bandit is introduced to\nmodel such situations in Yue and Joachims (2009) and enables us to appropriately tune the param-\neters based only on comparison results on two parameters by the users. In the learning process of\na dueling bandit algorithm, the algorithm chooses a pair of parameters called actions (or arms) and\nreceives only the corresponding comparison result. Since dueling bandit algorithms do not require\nan individual evaluation value for each action, they can be applied for wider areas that cannot be\nformulated using the conventional bandit approach.\nWhen action cost (or user utility) implicitly exists, the comparison between two actions is modeled\nvia a cost (or utility) function, which represents the degree of the cost (or utility), and a link function,\nwhich determines the noise in the comparison results. We refer to such a modeling method as cost-\nbased (or utility-based) approach and employ it in this research. Yue and Joachims (2009) \ufb01rst\nintroduced the utility-based approach as a model for a dueling bandit problem.\nrelates to function optimization with noisy comparisons\nThe cost-based dueling bandit\n(Jamieson et al., 2012; Matsui et al., 2016) because in both frameworks an oracle compare two ac-\ntions and the feedback from the oracle is represented by binary information. In particular, the same\nalgorithm can be applied to both frameworks. However, as different performance measures are ap-\nplied to the algorithms in function optimization and dueling bandit, it has not been demonstrated that\nan algorithm that works ef\ufb01ciently in one framework will also perform well in the other framework.\nThis study clari\ufb01es relation between function optimization and dueling bandit thorough their regret\nanalysis.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f1.1 Problem Setup\n\n{\n\nIn the learning process of the dueling bandit problem, a learner presents two points, called actions in\na space A, to an oracle and the oracle returns one-bit feedback to the learner based on which action\nwins (i.e., which action is more preferable for the oracle). Here, we denote by a \u227b a\n\u2032 the event that\n\u2032 happens. In other words, we assume that the\na wins a\nfeedback from the oracle follows the following two-valued random variable:\n\n) the probability that a \u227b a\n\n\u2032 and by P (a \u227b a\n\u2032\n\n\u2032\n\n1 w:p: P (a \u227b a\n\u2032\n0 w:p:\n\n1 (cid:0) P (a \u227b a\n\u2032\n\n)\n\n);\n\n) :=\n\nF (a; a\nwhere the probability P (a \u227b a\n\u2032\n) is determined by the oracle. We refer to this type of feedback as\nnoisy comparison feedback. Unlike conventional bandit problems, the leaner has to make a decision\nthat is based only on the noisy comparison feedback and cannot access the individual values of\nthe cost (or utility) function. We further assume that each comparison between a pair of actions is\nindependent of other comparisons.\nThe learner makes a sequence of decisions based on the noisy comparisons provided by the oracle.\n\u2032\nAfter receiving F (at; a\nt+1). As a\nperformance measure for an action a, we introduce the minimum win probability:\n\n\u2032\nt) at time t, the learner chooses the next two action (at+1; a\n\n(1)\n\nWe next quantify the performance of the algorithm using the expected regret as follows:1)\n\nRegDB\n\nT = sup\na2A\n\nE\n\nf(P\n\n(cid:3)\n\n(a) (cid:0) P\n\n(cid:3)\n\n(at)) + (P\n\n(cid:3)\n\n(a) (cid:0) P\n\n(cid:3)\n\n:\n\n(2)\n\n]\nt))g\n\u2032\n(a\n\n(cid:3)\n\nP\n\n(a) = inf\n\na\u20322A P (a \u227b a\n\n\u2032\n\n):\n\n[\nT\u2211\n\nt=1\n\n1.2 Modeling Assumption\nIn this section, we clarify some of the notations and assumptions. Let an action space A (cid:26) Rd be\ncompact convex set with non-empty interior. We denote the Euclidean norm by \u2225 (cid:1) \u2225.\nAssumption 1. There exist functions f : A ! R and (cid:27) : R ! [0; 1] such that the probability in\nnoisy comparison feedback can be represented as follows:\n\nP (a \u227b a\n\u2032\n\n\u2032\n) = (cid:27)(f (a\n\n) (cid:0) f (a)):\n\n(3)\n\nIn the following, we call f in Assumption 1 a cost function and and (cid:27) a link function. Here, the\ncost function and the link function are \ufb01xed for each query to the oracle. In this sense, our setting is\ndifferent from online optimization where the objective function changes.\nDe\ufb01nition 1. (Strong Convexity) A function f : Rd ! R is (cid:11)-strongly convex over the set A (cid:26) Rd\nif for all x; y 2 A it holds that\n\nf (y) (cid:21) f (x) + \u2207f (x)\n\n\u22a4\n\n(y (cid:0) x) +\n\n\u2225y (cid:0) x\u22252:\n\nDe\ufb01nition 2. (Smoothness) A function f : Rd ! R is (cid:12)-smooth over the set A (cid:26) Rd if for all\nx; y 2 A it holds that\n\nf (y) (cid:20) f (x) + \u2207f (x)\n\u22a4\n\n(y (cid:0) x) +\n\n\u2225y (cid:0) x\u22252:\n\nAssumption 2. The cost function f : A ! R is twice continuously differentiable, L-Lipschitz,\n(cid:11)-strongly convex and (cid:12)-smooth with respect to the Euclidean norm.\n\n(cid:3) of the cost function f since f is strictly\nFrom Assumption 2, there exists a unique minimizer a\n\u2032\nconvex. We set B := supa;a\u20322A f (a\nAssumption 3. The link function (cid:27) : R ! [0; 1] is three times differentiable and rotation-symmetric\n(i.e., (cid:27)((cid:0)x) = 1(cid:0) (cid:27)(x)). Its \ufb01rst derivative is positive and monotonically non-increasing on [0; B].\n1) Although the regret in (2) appears super\ufb01cially different from that in Yue and Joachims (2009), two regrets\n\n) (cid:0) f (a).\n\ncan be shown to coincide with each other under Assumptions 1-3 in Subsection 1.2.\n\n2\n\n(cid:11)\n2\n\n(cid:12)\n2\n\n\fFor examples, the standard logistic distribution function, the cumulative standard Gaussian distri-\nbution function and the linear function (cid:27)(x) = (1 + x)=2 can be taken to be link functions that\nsatisfy Assumption 3. We note that link functions often behave like cumulative probability distribu-\ntion functions. This is because the sign of the difference between two noisy function values can be\nregarded as the feedback (1) which satis\ufb01es Assumption 1, and then, the link function (cid:27) coincides\nwith the cumulative probability distribution function of the noise (see Section 2 of Jamieson et al.\n(2012) for more details). We will discuss the relation of noisy comparison feedback to noisy function\nvalues in Section 5.\n1.3 Related Work and Our Contributions\n\np\n\n\u221a\n\nDueling bandit on the continuous action space relates with various optimization methods. We sum-\nmarize related studies in the following.\nDueling bandit problem: Yue and Joachims (2009) formulated information retrieval systems as\na dueling bandit problem. They reduced this to a problem of optimizing an \u201calmost\"-concave\nfunction and presented a stochastic gradient ascent algorithm based on one-point bandit feedback.\nSubsequently, they showed that their algorithm achieves an O(T 3=4)-regret bound under the dif-\nferentiability and the strict concavity for a utility function. Ailon et al. (2014) presented reduction\nmethods from dueling bandit to the conventional bandit under the strong restriction that the link\nT log3 T )-regret bound. We\nfunction is linear and showed that their algorithm achieves an O(\nnote that dueling bandit has a number of other formulations (Yue and Joachims, 2011; Yue et al.,\n2012; Busa-Fekete et al., 2013, 2014; Urvoy et al., 2013; Zoghi et al., 2014; Jamieson et al., 2015).\nOptimization with one-point bandit feedback: In conventional bandit settings, various convex op-\ntimization methods have been studied. Flaxman et al. (2005) showed that the gradient of smoothed\nversion of a convex function can be estimated from a one-point bandit feedback and proposed a\nstochastic gradient descent algorithm which achieves an O(T 3=4) regret bound under the Lipschitz-\nness condition. Moreover, assuming the strong convexity and the smoothness for the convex function\np\nsuch as (2), Hazan and Levy (2014) proposed a stochastic mirror descent algorithm which achieves\nT log T ) regret bound and showed that the algorithm is near optimal because the upper\nan O(\nbound matched the lower bound of \u2126(\nT ) derived by Shamir (2013) up to a logarithmic factor in\nbandit convex optimization.\nOptimization with two-point bandit feedback: Dueling bandit algorithms require two actions at\neach round in common with two-point bandit optimization. In the context of online optimization,\nAgarwal et al. (2010) \ufb01rst considered convex optimization with two-point feedback. They proposed\na gradient descent-based algorithm and showed that the algorithm achieves the regret bounds of un-\nder the Lipschitzness condition and O(log T ) under the strong convexity condition. In stochastic\np\nconvex optimization, Duchi et al. (2015) showed that a stochastic mirror descent algorithm achieves\nT ) regret bound under the Lipschitzness (or the smoothness) condition and proved the up-\nan O(\nper bound to be optimal deriving a matching lower bound \u2126(\nT ). Moreover, in both of online\nand stochastic convex optimization, Shamir (2017) showed that a gradient descent-based algorithm\nachieves an O(\nT ) regret bound with optimal dependence on the dimension under the Lipschitz-\nness condition. However, those two-point bandit algorithms strongly depend on the availability of\nthe difference of function values and cannot be directly applied to the case of dueling bandit where\nthe difference of function values is compressed to one bit in noisy comparison feedback.\nOptimization with noisy comparison feedback: The cost-based dueling bandit relates to function\noptimization with noisy comparisons (Jamieson et al., 2012; Matsui et al., 2016) because in both\nframeworks, the feedback is represented by preference information. Jamieson et al. (2012) proposed\na coordinate descent algorithm and proved that the convergence rate of the algorithm achieved an\noptimal order.2) Matsui et al. (2016) proposed a Newton method-based algorithm and proved that its\nconvergence rate was almost equivalent to that of Jamieson et al. (2012). They further showed that\ntheir algorithm could easily be parallelized and performed better numerically than the dueling bandit\nalgorithm in Yue and Joachims (2009). However, since they considered only the unconstrained case\nin which A = Rd, it is not possible to apply their algorithm to the setting considered here, in which\nthe action space is compact.\n\np\n\np\n\n2)The optimal order changes depending on the model parameter (cid:20) (cid:21) 1 of the pairwise comparison oracle in\n\nJamieson et al. (2012).\n\n3\n\n\fOptimization with one-bit feedback: The optimization method of the dueling bandit algorithm is\nbased on one-bit feedback. In related work, Zhang et al. (2016) considered stochastic optimization\nunder one-bit feedback. However, since their approach was restricted to the problem of linear opti-\nmization with feedback generated by the logit model, it cannot be applied to the problem addressed\nin the current study.\nOur contributions: In this paper, we consider the cost-based dueling bandit under Assumptions\n1-3. While the formulation is similar to that of Yue and Joachims (2009), Assumptions 2 and 3 are\nstronger than those used in that study. On the other hand, we impose the weaker assumption on\nthe link function than that of Ailon et al. (2014). Yue and Joachims (2009) showed that a stochastic\ngradient descent algorithm can be applied to dueling bandit. Thus, it is naturally expected that a\nstochastic mirror descent algorithm, which achieves the (near) optimal order in convex optimiza-\ntion with one/two-point bandit feedback, can be applied to dueling bandit setting and achieves good\nperformance. Following this intuition, we propose a mirror descent-based algorithm. Our key con-\ntributions can be summarized as follows:\n\np\n\nT log T )-regret bound for our algorithm in dueling bandit.\n\n(cid:15) We propose a stochastic mirror descent algorithm with noisy comparison feedback.\n(cid:15) We provide an O(\n(cid:15) We clarify the relation between the cost-based dueling bandit and convex optimization in terms\n(cid:15) We show that the convergence rate of our algorithm is O(\nlog T =T ) in convex optimization.\n(cid:15) We derive a lower bound in convex optimization with noisy comparison feedback and show\n\nof their regrets and show that our algorithm can be applied to convex optimization.\n\n\u221a\n\nour algorithm to be near optimal in both dueling bandit and convex optimization.\n\n2 Algorithm and Main Result\n\n2.1 Stochastic Mirror Descent Algorithm\n\nWe \ufb01rst prepare the notion of a self-concordant function on which our algorithm is constructed (see\ne.g., Nesterov et al. (1994), Appendix F in Griva et al. (2009)).\nDe\ufb01nition 3. A function R : int(A) ! R is considered self-concordant if the following two condi-\ntions hold:\n\n1. R is three times continuously differentiable and convex, and approaches in\ufb01nity along any\n\nsequence of points approaching the boundary of int(A).\n\n(cid:12)(cid:12)\n\n2. For every h 2 Rd and x 2 int(A), j\u22073R(x)[h; h; h]j (cid:20) 2(h\nt1=t2=t3=0:\n\n\u22073R(x)[h; h; h] := @3R\n\n(x + t1h + t2h + t3h)\n\n@t1@t2@t3\n\n\u22a4\u22072R(x)h) 3\n\n2 holds, where\n\nIn addition to these two conditions, if R satis\ufb01es the following condition for a positive real number\n(cid:23), it is called a (cid:23)-self-concordant function:\n\n3. For every h 2 Rd and x 2 int(A), j\u2207R(x)\n\u22a4\n\nhj (cid:20) (cid:23) 1\n\n\u22a4\u22072R(x)h) 1\n2 .\n\n2 (h\n\nIn this paper, we assume the Hessian \u22072R(a) of a (cid:23)-self-concordant function to be full-rank over\nA and \u2207R(int(A)) = Rd. Bubeck and Eldan (2014) showed that such a (cid:23)-self-concordant function\nsatisfying (cid:23) = (1 + o(1))d will always exist as long as the dimension d is suf\ufb01ciently large. We next\npropose Algorithm 1, which we call NC-SMD. This can be regarded as stochastic mirror descent\nwith noisy comparison feedback.\n\u221a\nWe make three remarks on Algorithm 1. First, the function Rt is self-concordant though not (cid:23)-\nself-concordant. The second remark is as follows. Let us denote the local norms by \u2225a\u2225w =\na\u22a4\u22072R(w)a. Then, if R is a self-concordant function for A, the Dikin Ellipsoid fa\n\u2032 (cid:0)\n\u2032 2 Aj \u2225a\na\u2225a (cid:20) 1g centered at a is entirely contained in int(A) for any a 2 int(A). Thus, a\n\u2032\nt := at +\n2 ut in Algorithm 1 is contained in int(A) for any at 2 int(A) and a unit vector ut.\n\u22072Rt(at)\n(cid:0) 1\n\u2032\nThis shows a comparison between actions at and a\nt to be feasible. Our third remark is as follows.\nSince the self-concordant function Rt at round t depends on the past actions faigt\ni=1, it may be\nthought that those past actions are stored during the learning process. However, note that only \u2207Rt\n\n4\n\n\fAlgorithm 1 Noisy Comparison-based Stochastic Mirror Descent (NC-SMD)\nInput: Learning rate (cid:17), (cid:23)-self-concordant function R, time horizon T , tuning parameters (cid:21); (cid:22)\n\u2211\nInitialize: a1 = argmina2AR(a).\nfor t = 1 to T do\nUpdate Rt(a) = R(a) + (cid:21)(cid:17)\nPick a unit vector ut uniformly at random\nt := at + \u22072Rt(at)\n(cid:0) 1\n\u2032\nCompare at and a\nt; at)d\u22072Rt(at) 1\n\u2032\nSet ^gt = F (a\n2 ut\nt (\u2207Rt(at) (cid:0) (cid:17)^gt)\nSet at+1 = \u2207R(cid:0)1\nend for\nOutput: aT +1\n\n\u2032\n2 ut and receive F (a\nt; at)\n\n\u2225a (cid:0) ai\u22252 + (cid:22)\u2225a\u22252\n\nt\ni=1\n\n2\n\nand \u22072Rt are used in the algorithm; \u2207Rt depends only on\nthe past actions. Thus, only the sum of past actions must be stored, rather than all past actions.\n\n\u2211\ni=1 at and \u22072Rt does not depend on\n\nt\n\n\u2032\n\np\n\n\u2032 of the link function is bounded as l0 (cid:20) (cid:27)\n\n2.2 Main Result: Regret Bound\nFrom Assumption 2 and the compactness of A, the diameter R and B := supa;a\u20322A f (a\n) (cid:0) f (a)\nare \ufb01nite. From Assumption 3, there are exist positive constants l0, L0, B2 and L2 such that the \ufb01rst\n\u2032 (cid:20) L0 on [(cid:0)B; B] and the second derivative\nderivative (cid:27)\n\u2032\u2032 is bounded above by B2 and L2-Lipschitz on [(cid:0)B; B]. We use the constants below.\n(cid:27)\nThe following theorem shows that with appropriate parameters, NC-SMD (Algorithms 1) achieves\nan O(\n. When the tuning parameters satisfy (cid:21) (cid:20) l0(cid:11)=2,\nTheorem 4. We set C := (cid:23) + B2L2+(L+1)L0(cid:12)\n\u221a\n\n)2 and the total number T of rounds satis\ufb01es T (cid:21) C log T . Algorithm 1 with a (cid:23)-\n\nself-concordant function and the learning parameter (cid:17) = 1\n2d\nbound under Assumptions 1-3:\n\nachieves the following regret\n\n(cid:22) (cid:21) (\n\nT log T )-regret bound.\n\n\u221a\n\n0L2=(cid:21)\n\nC log T\n\nL3\n\n2(cid:21)\n\nT\n\nRegDB\n\nT\n\n(cid:20) 4d\n\nCT log T + 2LL0R:\n\n(4)\n\n3 Regret Analysis\n\nWe prove Theorem 4 in this section. The proofs of lemmas in this section are provided in supple-\nmentary material.\n\n3.1 Reduction to Locally-Convex Optimization\n\nWe \ufb01rst reduce the dueling bandit problem to a locally-convex optimization problem. We de\ufb01ne\nPb(a) := (cid:27)(f (a) (cid:0) f (b)) for a; b 2 A and Pt(a) := Pat(a). For a cost function f and a self-\nconcordant function R, we set a\n:= argmina2Af (a), a1 := argmina2AR(a) and a\n(cid:3)\n(cid:3)\nT := 1\nT a1 +\n(1 (cid:0) 1\n[\nT\u2211\nLemma 5. The regret of Algorithm 1 is bounded as follows:\n\n(cid:3). The regret of dueling bandit is bounded as follows.\n\n]\n\nT )a\n\nRegDB\n\nT\n\n(cid:20) 2E\n\n(Pt(at) (cid:0) Pt(a\n(cid:3)\nT ))\n\n+\n\nLL0(cid:12)\n\n(cid:21)(cid:17)\n\nt=1\n\nlog T + 2LL0R:\n\n(5)\n\nThe following lemma shows that Pb inherits the smoothness of f globally.\nLemma 6. The function Pb is (L0(cid:12) + B2L2)-smooth for an arbitrary b 2 A.\nLet B be the unit Euclidean ball, B(a; (cid:14)) the ball centered at a with radius (cid:14) and L(a; b) the line\n; (cid:14)) \\ A. The\nsegment between a and b. In addition, for a; b 2 A, let A(cid:14)(a; b) := [a\u20322L(a;b)B(a\n\u2032\nfollowing lemma guarantees the local strong convexity of Pb.\n2 l0(cid:11)-strongly convex on A(cid:14)(a\n(cid:3)\nLemma 7. The function Pb is 1\n\n; b) when (cid:14) (cid:20) l0(cid:11)\n\n.\n\n2L3\n\n0L2\n\n5\n\n\f3.2 Gradient Estimation\n2 x for x 2 B is included in A due to the properties of the Dikin\nWe note that at + \u22072Rt(at)\n(cid:0) 1\nellipsoid. We introduce the smoothed version of Pt over int(A):\n:= Ex2B[Pt(a + \u22072Rt(at)\n(cid:0) 1\n2 x)]\n2 x) (cid:0) f (at))]:\n= Ex2B[(cid:27)(f (a + \u22072Rt(at)\n(cid:0) 1\nNext, we adopt the following estimator for the gradient of ^Pt:\n\n(6)\n(7)\n\n^Pt(a)\n\n^gt := F (at + \u22072Rt(at)\n\n2 ut; at)d\u22072Rt(at)\n(cid:0) 1\n\n1\n\n2 ut;\n\nwhere ut is drawn from the unit surface S uniformly. We then derive the unbiasedness of ^gt as\nfollows.\nLemma 8.\n\nE[^gtjat] = \u2207 ^Pt(at):\n\n3.3 Regret Bound with Bregman Divergence\n\nFrom Lemma 5, the regret analysis in dueling bandit is reduced to the minimization problem of the\nregret-like value of Pt. Since Pt is globally smooth and locally strongly convex from Lemmas 6\nand 7, we can employ convex-optimization methods. Moreover, since ^gt is an unbiased estimator\nfor the gradient of the smoothed version of Pt from Lemma 8, it is expected that stochastic mirror\ndescent (Algorithm 1) with ^gt is effective to the minimization problem. In the following, making use\nof the property of stochastic mirror descent, we bound the regret-like value of Pt by the Bregman\ndivergence, and subsequently prove Theorem 4.\nDe\ufb01nition 9. Let R be a continuously differentiable strictly convex function on int(A). Then, the\nBregman divergence associated with R is de\ufb01ned by\n\n)2, the regret of Algorithm 1 is bounded for any\n\n(a (cid:0) b):\n\nE\n\nL3\n\n0L2=(cid:21)\n\n]\n[\n\nDR(a; b) = R(a) (cid:0) R(b) (cid:0) \u2207R(b)\n\u22a4\n\nLemma 10. When (cid:21) (cid:20) l0(cid:11)=2 and (cid:22) (cid:21)(\n\n(Pt(at) (cid:0) Pt(a))\n\na 2 int(A) as follows:\n\n[\nT\u2211\n(\n(\u2207R(at) (cid:0) (cid:17)^gt;\u2207R(at))\nR(a) (cid:0) R(a1) + E\nt (a) := supx2Rd\u27e8x; a\u27e9 (cid:0) Rt(a) is the Fenchel dual of Rt.\n\n(cid:20) 1\n(cid:17)\nwhere R(cid:3)\nThe Bregman divergence in Lemma 10 is bounded as follows.\nLemma 11. When (cid:17) (cid:20) 1\n\nT\u2211\n\nDR(cid:3)\n\nt=1\n\nt=1\n\nt\n\n])\n\n+\n\nL0(cid:12) + B2L2\n\n(cid:21)(cid:17)\n\nlog T;\n\n(8)\n\n2d , the sequence at output by Algorithm 1 satis\ufb01es\nDR(cid:3)\n\n(\u2207Rt(at) (cid:0) (cid:17)^gt;\u2207Rt(at)) (cid:20) 4d2(cid:17)2:\n\nt\n\n(9)\n[Proof of Theorem 4] From Lemma 4 of Hazan and Levy (2014), the (cid:23)-self-concordant function R\nsatis\ufb01es\n\nR(a\nT ) (cid:0) R(a1) (cid:20) (cid:23) log\n(cid:3)\n\u2032(cid:0) a)g is the Minkowsky function. Since (cid:25)a1 (a\n) := inffr (cid:21) 0ja + r\nT ) (cid:20) 1(cid:0) T\n(cid:0)1(a\n(cid:3)\nwhere (cid:25)a(a\n(cid:3)\nT , we obtainR(a\nfrom the de\ufb01nition of a\nT ) (cid:0) R(a1) (cid:20) (cid:23) log T:\n(cid:3)\nNote that the condition (cid:17) (cid:20) 1\n(\n5, 10 and 11, we have\n(\n\n2d in Lemma 11 is satis\ufb01ed due to T (cid:21) C log T . Combining Lemmas\n\n1 (cid:0) (cid:25)a1 (a\n(cid:3)\nT )\n\n(cid:23) log T + 4d2(cid:17)2T\n\nL0(cid:12) + D(cid:27)\u2032\u2032L2\n\n(cid:20) 2\n(cid:17)\n\nLL0(cid:12)\nl0(cid:11)(cid:17)\n\n+ 2LL0R\n\nRegDB\n\nlog T +\n\n)\n\n)\n\n(cid:0)1\n\n+\n\n1\n\nT\n\n\u2032\n\n;\n\n(cid:20)\n\n2(cid:23) +\n\nB2L2 + (L + 1)L0(cid:12)\n\n(cid:21)\n\n(cid:17)\n\n(cid:21)(cid:17)\nlog T\n\n+ 8d2T (cid:17) + 2LL0R:\n\nThus, when (cid:17) is de\ufb01ned in Theorem 4, the regret bound (4) is obtained.\n\n6\n\n\f4 Convergence Rate in Convex Optimization\n\nIn the previous sections, we considered the minimization problem for the regret of dueling bandit.\nIn this section, as an application of the approach, we show that the averaged action of NC-SMD\n(Algorithm 1) minimize the cost function f in (3).\nTo derive the convergence rate of our algorithm, we introduce the regret of function optimization and\nestablish a connection between the regrets of dueling bandit and function optimization. In convex\n\u2032\noptimization with noisy comparison feedback, the learner chooses a pair (at; a\nt) of actions in the\n\u2032\nlearning process and suffers a loss f (at) + f (a\nt). Then, the regret of the algorithms in function\noptimization is de\ufb01ned as follows:\n\n]\n\n[\nT\u2211\n\nRegF O\n\nT\n\n:= E\n\n(f (at) (cid:0) f (a\n\n(cid:3)\n\nt) (cid:0) f (a\n(cid:3)\n\u2032\n)) + (f (a\n\n))\n\n;\n\n(10)\n\nt=1\n\n= argmina2Af.\n\n(cid:3)\nwhere a\nRecalling that the positive constants l0 and L0 satisfy l0 (cid:20) (cid:27)\n\u2032\nsupa;a\u20322A f (a\nas follows:\nLemma 12.\n\n\u2032 (cid:20) L0 on [(cid:0)B; B], where B :=\n) (cid:0) f (a), the regrets of function optimization (10) and dueling bandit (2) are related\n\nRegDB\n\nT\nL0\n\n(cid:20) RegF O\n\nT\n\n(cid:20) RegDB\n\nT\nl0\n\n:\n\n(11)\n\np\nTheorem 4 and Lemma 12 give an O(\nT log T )-upper bound of the regret of function optimization\nin Algorithm 1 under the same conditions as Theorem 4. Given the convexity of f, the average of\nthe chosen actions of any dueling bandit algorithm (cid:22)aT := 1\n2T\n)] (cid:20) RegF O\n\n\u2032\nT\nt) satis\ufb01es\nt=1(at + a\n\nE[f ((cid:22)aT ) (cid:0) f (a\n\n\u2211\n\n(12)\n\n(cid:3)\n\nT\n2T\n\n:\n\nThus, if an optimization algorithm has a sub-linear regret bound, the above online-to-batch conver-\nsion guarantees convergence to the optimal point.\nTheorem 13. Under Assumptions 1-3, the averaged action (cid:22)aT satis\ufb01es the following when T (cid:21)\nC log T :\n\n(\n\n\u221a\n\nE[f ((cid:22)aT ) (cid:0) f (a\n(cid:3)\n\n)] (cid:20) 1\nl0\n\n2d\n\n(cid:23) log T + C\n\nT\n\n+\n\nLL0R\n\nT\n\n;\n\n)\n\n\u221a\n\nwhere C is the constant de\ufb01ned in Theorem 4.\n\nTheorem 13 shows the convergence rate of NC-SMD (Algorithm 1) to be O(d\n\nlog T =T ).\n\n5 Lower Bound\n\nWe next derive a lower bound in convex optimization with noisy comparison feedback. To do so, we\nemploy a lower bound of convex optimization with noisy function feedback. In a setting where the\nfunction feedback is noisy, we query a point a 2 A and obtain a noisy function value f (a)+(cid:24), where\n(cid:24) is a zero-mean random variable with a \ufb01nite second moment and independent for each query. 3)\nTheorem 14. Assume that the action space A is the d-dimensional unit Euclidean ball Bd and\nthat the link function (cid:27)G is the cumulative distribution function of the zero-mean Gaussian random\nvariable with variance 2. Let the number of rounds T be \ufb01xed. Then, for any algorithm with noisy\ncomparison feedback, there exists a function f over Bd which is twice continuously differentiable,\n0:5-strongly convex and 3:5-smooth such that the output aT of the algorithm satis\ufb01es\n\nE[f (aT ) (cid:0) f (a\n(cid:3)\n\n)] (cid:21) 0:004 min\n\n:\n\n(13)\n\n}\n\n{\n\n1;\n\ndp\n2T\n\n3)In general, the noise (cid:24) can depend on the action a. See e.g. Shamir (2013) for more details.\n\n7\n\n\f\u2032\n\n) + (cid:24)\n\n\u2032\n\n) + (cid:24)\n\n) for arbitrary a; a\n\n\u2032\n) + ((cid:24) (cid:0) (cid:24)\n\n\u2032\n\nsign(f (a) + (cid:24) (cid:0) (f (a\n\u2032\n\n)) = sign(f (a) (cid:0) f (a\n\u2032\n\n[Proof] The probability distribution of noisy comparison feedback F (a; a\n) with the link function\n(cid:27)G can be realized by noisy function feedback with thestandard Gaussian noise as follows. Two\n\u2032 can be obtained using the noisy function feedback\n\u2032\nnoisy function values f (a) + (cid:24) and f (a\n\u2032 are independent standard Gaussian random variables. Then, the probability\ntwice, where (cid:24) and (cid:24)\n\u2032 2 A:\ndistribution of the following random variable coincide with that of F (a; a\n(14)\nHere, note that (cid:24) (cid:0) (cid:24)\n\u2032 is the zero-mean Gaussian random variable with variance 2. Thus, single\nnoisy comparison feedback with the link function (cid:27)G for any actions can be obtained by using\nnoisy function feedback with standard Gaussian noise twice. This means that if any algorithm with\n2T -times noisy function feedback is unable to achieve a certain performance, any algorithm with\nT -times noisy comarison feedback is similarly unable to achieve that performance. Thus, to derive\nTheorem 14, it is suf\ufb01cient to show a lower bound of convergence rate with noisy function feedback.\nThe following lower bound is derived by Theorem 7 of Shamir (2013).\nTheorem 15. (Shamir, 2013) Let the number of rounds T be \ufb01xed. Suppose that the noise (cid:24) at\neach round is a standard Gaussian random variable. Then, for any algorithm with noisy function\nfeedback, there exists a function f over Bd which is twice continuously differentiable, 0:5-strongly\nconvex and 3:5-smooth such that the output aT satis\ufb01es\n)] (cid:21) 0:004 min\n\nE[f (aT ) (cid:0) f (a\n(cid:3)\n\n}\n\n:\n\n)):\n\n{\n\n1;\n\ndp\nT\n\nBy the above discussion and from Theorem 15, we obtain Theorem 14.\nCombining Theorem 13 and Theorem 14, the convergence rate of NC-SMD (Algorithm 1) is near\noptimal with respect to the number of rounds T . In addition, when the parameter (cid:23) of the self-\nconcordant function is of constant order with respect to the dimension d of the space A, the conver-\ngence rate of NC-SMD is optimal with respect to d. However, it should be noted that the parameter\n(cid:23) of a self-concordant function is often of the order of (cid:2)(d) for compact convex sets including the\nsimplex and the hypercube.\nAs a consequece of Lemma 12, (12), and Theorems 4 and 14, the optimal regrets of dueling ban-\ndit and function optimization are of the order\nT except for the logarithmic factor and NC-SMD\nachieves the order. To the best of our knowledge, this is the \ufb01rst algorithm with the optimal order in\nthe continuous dueling bandit setting with the non-linear link function.\np\nFinally, we provide an interesting observation on convex optimization. When noisy function feed-\nT ) under strong\nback is available, the optimal regret of function optimization is of the order (cid:2)(\nconvexity and smoothness conditions (Shamir, 2013). However, even when noisy function feedback\np\nis \"compressed\" into one-bit information as in (14), our results show that NC-MSD (Algorithm 1)\nachieves almost the same order O(\nT log T ) about the regret of function optimization as long as\nthe cumulative probability distribution of the noise satis\ufb01es Assumption 3.4)\n\np\n\n6 Conclusion\n\n\u221a\n\nWe considered a dueling bandit problem over a continuous action space and proposed a stochas-\np\ntic mirror descent. By introducing Assumptions 1-3, we proved that our algorithm achieves an\nT log T )-regret bound. We further considered convex optimization under noisy comparison\nO(\nfeedback and showed that the regrets of dueling bandit and function optimization are essentially\nequivalent. Using the connection between the two regrets, it was shown that our algorithm achieves\na convergence rate of O(\nlog T =T ) in the framework of function optimization with noisy compar-\nison feedback. Moreover, we derived a lower bound of the convergence rate in convex optimization\nand showed that our algorithm achieves near optimal performance in dueling bandit and convex op-\ntimization. Some open questions still remain. While we have only dealt with bounds which hold\nin expectation, the derivation of the high-probability bound is a problem that has not been solved.\nWhile the analysis of our algorithm relies on strong convexity and smoothness, a regret bound with-\nout these conditions is also important.\n\n4)Jamieson et al. (2012) provided a similar observation. However, their upper bound of the regret was derived\nonly when the action space is the whole of Euclidean space (i.e., A = Rd) and the assumption for noisy\ncomparison feedback is different from ours (Assumption 1).\n\n8\n\n\fAcknowledgment\n\nWe would like to thank Professor Takafumi Kanamori for helpful comments. This work was sup-\nported by JSPS KAKENHI Grant Number 17K12653.\n\nReferences\n[1] A. Agarwal, O. Dekel, and L. Xiao (2010) \u201cOptimal Algorithms for Online Convex Optimization with\n\nMulti-Point Bandit Feedback.,\u201d in COLT, pp. 28\u201340, Citeseer.\n\n[2] N. Ailon, T. Joachims, and Z. Karnin (2014) \u201cReducing dueling bandits to cardinal bandits,\u201d arXiv\n\npreprint arXiv:1405.3396.\n\n[3] S. Bubeck and R. Eldan (2014) \u201cThe entropic barrier: a simple and optimal universal self-concordant\n\nbarrier,\u201d arXiv preprint arXiv:1412.1587.\n\n[4] R. Busa-Fekete, E. H\u00fcllermeier, and B. Sz\u00f6r\u00e9nyi (2014) \u201cPreference-based rank elicitation using statis-\ntical models: The case of Mallows,\u201d in Proceedings of the 31st International Conference on Machine\nLearning (ICML-14), pp. 1071\u20131079.\n\n[5] R. Busa-Fekete, B. Szorenyi, W. Cheng, P. Weng, and E. H\u00fcllermeier (2013) \u201cTop-k selection based on\nadaptive sampling of noisy preferences,\u201d in International Conference on Machine Learning, pp. 1094\u2013\n1102.\n\n[6]\n\nJ. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono (2015) \u201cOptimal rates for zero-order\nconvex optimization: The power of two function evaluations,\u201d IEEE Transactions on Information Theory,\nVol. 61, pp. 2788\u20132806.\n\n[7] A. D. Flaxman, A. T. Kalai, and H. B. McMahan (2005) \u201cOnline convex optimization in the bandit set-\nting: gradient descent without a gradient,\u201d in Proceedings of the sixteenth annual ACM-SIAM symposium\non Discrete algorithms, pp. 385\u2013394, Society for Industrial and Applied Mathematics.\n\n[8]\n\nI. Griva, S. G. Nash, and A. Sofer (2009) Linear and nonlinear optimization: Siam\nAppendix\nhttp://math.gmu.edu/~igriva/book/topics.html.\n\navailable\n\ncontains\n\nF\n\nwhich\n\n(F.2)\n\nis\n\nat\n\nthe\n\nfollowing\n\nURL:\n\n[9] E. Hazan and K. Levy (2014) \u201cBandit convex optimization: Towards tight bounds,\u201d in Advances in\n\nNeural Information Processing Systems, pp. 784\u2013792.\n\n[10] K. G. Jamieson, S. Katariya, A. Deshpande, and R. D. Nowak (2015) \u201cSparse Dueling Bandits.,\u201d in\n\nAISTATS.\n\n[11] K. G. Jamieson, R. Nowak, and B. Recht (2012) \u201cQuery complexity of derivative-free optimization,\u201d in\n\nAdvances in Neural Information Processing Systems, pp. 2672\u20132680.\n\n[12] K. Matsui, W. Kumagai, and T. Kanamori (2016) \u201cParallel distributed block coordinate descent methods\n\nbased on pairwise comparison oracle,\u201d Journal of Global Optimization, pp. 1\u201321.\n\n[13] Y. Nesterov, A. Nemirovskii, and Y. Ye (1994) Interior-point polynomial algorithms in convex program-\n\nming, Vol. 13: SIAM.\n\n[14] O. Shamir (2013) \u201cOn the Complexity of Bandit and Derivative-Free Stochastic Convex Optimization.,\u201d\n\nin COLT, pp. 3\u201324.\n\n[15]\n\n(2017) \u201cAn Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-\n\nPoint Feedback,\u201d The Journal of Machine Learning Research, Vol. 18, p. 1\u201311.\n\n[16] T. Urvoy, F. Clerot, R. F\u00e9raud, and S. Naamane (2013) \u201cGeneric Exploration and K-armed Voting Ban-\n\ndits.,\u201d in ICML (2), pp. 91\u201399.\n\n[17] Y. Yue, J. Broder, R. Kleinberg, and T. Joachims (2012) \u201cThe k-armed dueling bandits problem,\u201d Journal\n\nof Computer and System Sciences, Vol. 78, pp. 1538\u20131556.\n\n[18] Y. Yue and T. Joachims (2009) \u201cInteractively optimizing information retrieval systems as a dueling ban-\ndits problem,\u201d in Proceedings of the 26th Annual International Conference on Machine Learning, pp.\n1201\u20131208, ACM.\n\n9\n\n\f[19]\n\n(2011) \u201cBeat the mean bandit,\u201d in Proceedings of the 28th International Conference on Ma-\n\nchine Learning (ICML-11), pp. 241\u2013248.\n\n[20] L. Zhang, T. Yang, R. Jin, Y. Xiao, and Z.-H. Zhou (2016) \u201cOnline stochastic linear optimization under\n\none-bit feedback,\u201d in International Conference on Machine Learning, pp. 392\u2013401.\n\n[21] M. Zoghi, S. Whiteson, R. Munos, M. d. Rijke et al. (2014) \u201cRelative upper con\ufb01dence bound for the\nk-armed dueling bandit problem,\u201d in JMLR Workshop and Conference Proceedings, No. 32, pp. 10\u201318,\nJMLR.\n\n10\n\n\f", "award": [], "sourceid": 948, "authors": [{"given_name": "Wataru", "family_name": "Kumagai", "institution": "RIKEN"}]}