「既存手法を組み合わせただけ」という査読にどうRebuttalを返すか
A. そもそも組み合わせることはいつも簡単なわけじゃない
その例
it is known that gradient descent is optimal for finding first-order stationary points in the non-convex smooth case among first-order methods Therefore, it is in general impossible to combine it with Nesterov’s acceleration to achieve theoretically better rate
B. 組み合わせられても,良い組み合わせを提示することや組み合わせを解析するのは自明じゃない
その例
For example, Nesterov’s acceleration was discovered in 1980-th and SGD was proposed in 1950-th, but the first Accelerated SGD was proposed and analyzed only in 2012.
C. 7ページ目にはいかにその組み合わせが自明ではないか説明しています
Comment:
This paper combines several techniques from existing works, such as VR-MARINA and robust aggregator. This makes the novelty boils down to applying existing works to the byzantine setting, which seems not sufficient for conferences like ICLR.
We believe that it is not always an easy task to combine known techniques to get some improvements in the convergence rates. Some techniques cannot be combined in general: for example, it is known that gradient descent is optimal for finding first-order stationary points in the non-convex smooth case among first-order methods Therefore, it is in general impossible to combine it with Nesterov’s acceleration to achieve theoretically better rate. Next, even when the techniques are combinable, it can be a non-trivial task to propose good combination and to rigorously analyze it. For example, Nesterov’s acceleration was discovered in 1980-th and SGD was proposed in 1950-th, but the first Accelerated SGD was proposed and analyzed only in 2012.
Moreover, in paragraph “Challenges in designing variance-reduced algorithm with tight rates and provable Byzantine-robustness” (page 7), we explain why it is not trivial to achieve the results that we derive. Although our work positions Byz-VR-MARINA as a natural combination of variance reduction and Byzantine-robustness, it was not straightforward beforehand whether VR-MARINA and robust aggregation are combinable, whether such a combination should be considered, and whether it would lead to the new SOTA theoretical results in Byzantine-robust distributed optimization.