1. Introduction

・The flow of federated learning is as follows.

①：Distribute the global model to each local terminal from a central location.
②：Train using the data held by each local terminal and update the model.
③：Create a new global model by centrally aggregating models updated by each local terminal.
④：Repeat ①〜③.

・When creating a global model by aggregating models from the local to the center in ③, (aggregation), an algorithm called "FedAvg" (obtaining the average of each model) is used. It is used.
・However, this FedAvg is not perfect and may create an inaccurate global model.

Causes of poor global model accuracy
・A model that is clearly different from other models (like an outlier) is confused.
・When updating the model with data on each local side, an appropriate average cannot be obtained due to the bias of each data.

・The following algorithms are used to solve these problems.

2. Krum

Krum has aggregation rules that allow f attackers (those who intentionally interfere with model aggregation)

• There is a "gradient estimate"$(V_1,...,V_n)$ of the cost function, collected from each local terminal.
・Sum the Euclidean distances of a gradient estimate $V$ and the nearest "n-f-2 estimates" to get $ scores(i)$.
・The estimated value $V_{i_{*}}$ when minimizing $scores(i)$ is obtained from the Krum function.

\newcommand{\argmax}{\mathop{\rm arg~max}\limits}
\newcommand{\argmin}{\mathop{\rm arg~min}\limits}
\begin{align}
scores(i)&=\sum_{i=1}^{n-f-2}||V-V_i||^2 \\ 
&=\sum_{i=1}^{n-f-2}\sqrt{(V-V_i)^2}\\
 \\
Krum&=V_{i_{*}} 
\end{align}\\

\left\{
\begin{array}{ll}
V: any　gradient　estimate\\
V_i: Gradient　estimates　collected　from　local　models\\
V_{i_{*}}: estimated　value　obtained　by　Krum　function\\
\ f: Attacker (abnormal　model) \\
n: Number　of　clients\\　
i_*: i　that　minimizes　scores(i)

\end{array}
\right.

3. Geometric Median

Update the global model with the "median" of the parameters of the aggregated local model.

・In FedAvg, a global model was created by averaging the models, but in geometric median, median is generated to create a global model.
・The calculation of the median is as follows.
・This is meant to be robust mean estimators, such that outlier values from adversarial submission or general error have less of an effect on the final calculation.

\newcommand{\argmax}{\mathop{\rm arg~max}\limits}
\newcommand{\argmin}{\mathop{\rm arg~min}\limits}
\begin{align}
Geometric Median&=\argmin_{y\in\mathbb{R}^{n}}\sum_{i=1}^{m}||x_i-y||_2 \\
&=\argmin_{y\in\mathbb{R}^{n}}\sum_{i=1}^{m}\sqrt{(x_i-y)^2}
\end{align}\\


\left\{
\begin{array}{ll}
\ y: global　model\\
x_i: local　model\\
\end{array}
\right.

4. Coordinate-Wise Median

Update the global model with the "median" of the parameters of the aggregated local model.

・Same as Geometric Median, instead of averaging, median is used to create a global model.
・Unlike Geometric Median in calculating the median, the distance between two points is calculated as Manhattan distance($||x_i-y||_1$).
・This is meant to be robust mean estimators, such that outlier values from adversarial submission or general error have less of an effect on the final calculation.

\newcommand{\argmax}{\mathop{\rm arg~max}\limits}
\newcommand{\argmin}{\mathop{\rm arg~min}\limits}
\begin{align}
CoordinateWise Median&=\argmin_{y\in\mathbb{R}^{n}}\sum_{i=1}^{m}||x_i-y||_1 \\
&=\argmin_{y\in\mathbb{R}^{n}}\sum_{i=1}^{m}(x_i-y)
\end{align}\\

\left\{
\begin{array}{ll}
\ y: global　model\\
x_i: local　model\\
\end{array}
\right.

5. Fed Prox

Add a proximal term to the local training loss function to limit local model changes.

・The proximal term is the term $(||w−w_g||^2)$ that represents the difference between the weights obtained locally and the weights of the global model.
・Implement restrictions to prevent performance degradation due to non-uniformity of data on different local terminals due to non-iid data.
・Data Heterogeneity: If there is a gap in the data for each client (one client only has images of dogs and the other only images of cats), the aggregated global model Poor performance. Using FedProx creates an agnostic global model (predicting both dogs and cats) that maintains its performance.

\tilde{L(w)}=L(w)+α||w-w_g||^2 \\

\left\{
\begin{array}{ll}
\tilde{L(w)}: loss　function　constrained　by　FedProx\\
L(w): local　loss　function\\
w: locally　derived　weight\\
w_g: global　model　weights\\
α: the　hyperparameter　controlling　the　proximal　term \\
\end{array}
\right.

6. Fed Curv

We use the Fisher information matrix to evaluate the importance of each parameter in the model and impose greater penalties based on importance.

・Penalties are imposed to reduce performance degradation due to uneven computation processing speeds on different local terminals.
・Computation processing speed: On local terminals, the performance of aggregation is degraded due to the difference in computation processing speed due to the difference in machine performance.

\tilde{L(w)_s}=L(w)_s+α\sum_{j∈S-s}(w-w_j)^Tdiag(\hat{I_j})(w-\hat{w_j}) \\

\left\{
\begin{array}{ll}
\tilde{L(w)_s}: Loss　function　constrained　by　FedCurv\\
L(w)_s: local　loss　function\\
S: Node　participating　in　training\\
s: current　node\\
\hat{w_j}: Weight　obtained　in　training　node　j\\
\hat{I_j}: Fisher　information　matrix　corresponding　to　\hat{w_j}\\
w: locally　derived　weight\\
α: the　hyperparameter　controlling　the　penalty　term \\
\end{array}
\right.

7. Trimmed mean

Create a global model by averaging each model using trimmed mean

・Trimmed mean is a method of sorting data in ascending order, excluding the top and bottom XX%, and averaging the remaining data.
• Sort the collected models from each local and take the trimmed mean to create a global mean.

The method of Federated Learning

1. Introduction

2. Krum

Krum has aggregation rules that allow f attackers (those who intentionally interfere with model aggregation)

3. Geometric Median

Update the global model with the "median" of the parameters of the aggregated local model.

4. Coordinate-Wise Median

Update the global model with the "median" of the parameters of the aggregated local model.

5. Fed Prox

Add a proximal term to the local training loss function to limit local model changes.

6. Fed Curv

We use the Fisher information matrix to evaluate the importance of each parameter in the model and impose greater penalties based on importance.

7. Trimmed mean

Create a global model by averaging each model using trimmed mean

Reference