# An exercice with convex analysis and neural networks

This is a note about a simple use of convex analysis in relation with neural networks. There are many points of contact between convex analysis and neural networks, but I have not been able to locate this one, thanks for pointing me to a source, if any.

Let’s start with a directed graph with set of nodes $N$ (these are the neurons) and a set of directed bonds $B$. Each bond has a source and a target, which are neurons, therefore there are source and target functions

$s:B \rightarrow N$   , $t:B \rightarrow N$

so that for any bond $x \in B$ the neuron $a = s(x)$ is the source of the bond and the neuron $b = t(x)$ is the target of the bond.

For any neuron $a \in N$:

• let $in(a) \subset B$ be the set of bonds $x \in B$ with target $t(x)=a$,
• let $out(a) \subset B$ be the set of bonds $x \in B$ with source $s(x)=a$.

A state of the network is a function $u: B \rightarrow V^{*}$ where $V^{*}$ is the dual of a real vector space $V$. I’ll explain why in a moment, but it’s nothing strange: I’ll suppose that $V$ and $V^{*}$ are dual topological vector spaces, with duality product denoted by $(u,v) \in V \times V^{*} \mapsto \langle v, u \rangle$ such that any linear and continuous function from $V$ to the reals is expressed by an element of $V^{*}$ and, similarly, any linear and continuous function from $V^{*}$ to the reals is expressed by an element of $V$.

If you think that’s too much, just imagine $V=V^{*}$ to be finite euclidean vector space with the euclidean scalar product denoted with the $\langle , \rangle$ notation.

A weight of the network is a function $w:B \rightarrow Lin(V^{*}, V)$, you’ll see why in a moment.

Usually the state of the network is described by a function which associates to any bond $x \in B$ a real value $u(x)$. A weight is a function which is defined on bonds and with values in the reals. This corresponds to the choice $V = V^{*} = \mathbb{R}$ and $\langle v, u \rangle = uv$. A linear function from $V^{*}$ to $V$ is just a real number $w$.

The activation function of a neuron $a \in N$ gives a relation between the values of the state on the input bonds and the values of the state of the output bonds: any value of an output bond is a function of the weighted sum of the values of the input bonds. Usually (but not exclusively) this is an increasing continuous function.

The integral of an increasing continuous function is a convex function. I’ll call this integral the activation potential $\phi$ (suppose it does not depends on the neuron, for simplicity). The relation between the input and output values is the following:

for any neuron $a \in N$ and for any bond $y \in out(a)$ we have

$u(y) = D \phi ( \sum_{x \in in(a)} w(x) u(x) )$.

This relation generalizes to:

for any neuron $a \in N$ and for any bond $y \in out(a)$ we have

$u(y) \in \partial \phi ( \sum_{x \in in(a)} w(x) u(x) )$

where $\partial \phi$ is the subgradient of a convex and lower semicontinuous activation potential

$\phi: V \rightarrow \mathbb{R} \cup \left\{ + \infty \right\}$

Written like this, we are done with any smoothness assumptions, which is one of the strong features of convex analysis.

This subgradient relation also explains the maybe strange definition of states and weights with the vector spaces $V$ and $V^{*}$.

This subgradient relation can be expressed as the minimum of a cost function. Indeed, to any convex function $phi$ is associated a sync  (means “syncronized convex function, notion introduced in [1])

$c: V \times V^{*} \rightarrow \mathbb{R} \cup \left\{ + \infty \right\}$

$c(u,v) = \phi(u) + \phi^{*}(v) - \langle v, u \rangle$

where $\phi^{*}$ is the Fenchel dual of the function $\phi$, defined by

$\phi^{*}(v) = \sup \left\{ \langle v, u \rangle - \phi(u) \right\}$

This sync has the following properties:

• it is convex in each argument
• $c(u,v) \geq 0$ for any $(u,v) \in V \times V^{*}$
• $c(u,v) = 0$ if and only if $v \in \partial \phi(u)$.

With the sync we can produce a cost associated to the neuron: for any $a \in N$, the contribution to the cost of the state $u$ and of the weight $w$ is

$\sum_{y \in out(a)} c(\sum_{x \in in(a)} w(x) u(x) , u(y) )$.

The total cost function $C(u,w)$ is

$C(u,w) = \sum_{a \in N} \sum_{y \in out(a)} c(\sum_{x \in in(a)} w(x) u(x) , u(y) )$

and it has the following properties:

• $C(u,w) \geq 0$ for any state $u$ and any weight $w$
• $C(u,w) = 0$ if and only if for any neuron $a \in N$ and for any bond $y \in out(a)$ we have

$u(y) \in \partial \phi ( \sum_{x \in in(a)} w(x) u(x) )$

so that’s a good cost function.

Example:

• take $\phi$ to be the softplus function $\phi(u) =\ln(1+\exp(x))$
• then the activation function (i.e. the subgradient) is the logistic function
• and the Fenchel dual of the softplus function is the (negative of the) binary entropy $\phi^{*}(v) = v \ln(v) + (1-v) \ln(1-v)$ (extended by $0$ for $v = 0$ or $v = 1$ and equal to $+ \infty$ outside the closed interval $[0,1]$).

________

[1] Blurred maximal cyclically monotone sets and bipotentials, with Géry de Saxcé and Claude Vallée, Analysis and Applications 8 (2010), no. 4, 1-14, arXiv:0905.0068

_______________________________

# A symplectic Brezis-Ekeland-Nayroles principle

We submitted to the arXiv the article

Marius Buliga, Gery de Saxce, A symplectic Brezis-Ekeland-Nayroles principle

You can find here the slides of two talks given in Lille and Paris a while ago,  where the article has been announced.

UPDATE: The article appeared, as  arXiv:1408.3102

This is, we hope, an important article! Here is why.

The Brezis-Ekeland-Nayroles principle appeared in two articles from 1976, the first by Brezis-Ekeland, the second by Nayroles. These articles appeared too early, compared to the computation power of the time!

We call the principle by the initials of the names of the inventors: the BEN principle.

The BEN principle asserts that the curve of evolution of a elasto-plastic body minimizes a certain functional, among all possible evolution curves which are compatible with the initial and boundary conditions.

This opens the possibility to find, at once the evolution curve, instead of constructing it incrementally with respect to time.

In 1976 this was SF for the computers of the moment. Now it’s the right time!

Pay attention to the fact that a continuous mechanics system has states belonging to an infinite dimensional space (i.e. has an infinite number of degrees of freedom), therefore we almost never hope to find, nor need the exact solution of the evolution problem. We are happy for all practical purposes with approximate solutions.

We are not after the exact evolution curve, instead we are looking for an approximate evolution curve which has the right quantitative approximate properties, and all the right qualitative exact properties.

In elasto-plasticity (a hugely important class of materials for engineering applications) the evolution equations are moreover not smooth. Differential calculus is conveniently and beautifully replaced by convex analysis.

Another aspect is that elasto-plastic materials are dissipative, therefore there is no obvious hope to treat them with the tools of hamiltonian mechanics.

Our symplectic BEN principle does this: one principle covers the dynamical, dissipative evolution of a body, in a way which can be reasonably easy amenable to numerical applications.

_______________________________________

# Heisenberg group, convex analysis, alike or not?

I posted on mathoverflow a question, with the purpose of clarifying the feelings I have concerning the formal resemblance between Heisenberg group and cyclically monotone operators from convex analysis. The question got an answer which made me realize something trivial (but sometimes we need somebody else to point to the obvious). However, I still think there is something worthy of further consideration here, that’s why I post it.

Setting: Let $S$ be a semigroup (i.e. has an associative operation with neutral element $e$) and let $(A,+)$ be a commutative group (with neutral element $0$).

Let’s say that a semigroup extension of $S$ with $A$ is any operation on $S \times A$, of the form

$(s,a)(s',a') = (s s' , a+a'+ \lambda(s,s'))$

where $\lambda: S \times S \rightarrow A$ is a function such that $S \times A$ with the mentioned operation is a semigroup with neutral element $(e,0)$. Obviously, the operation is encoded by the function $\lambda$, which has to satisfy:

$\lambda(s s', s") + \lambda(s,s') = \lambda(s, s's") + \lambda(s',s")$  (from associativity)

$\lambda(e,s) = \lambda(s,e) = 0$  (from the neutral element condition).

Here are two examples. Both are written in the setting of bipotentials, namely   $X$ and $Y$ are topological, locally convex, real vector spaces of dual variables $x \in X$ and $y \in Y$, with the duality product $\langle \cdot , \cdot \rangle : X \times Y \rightarrow \mathbb{R}$.

The spaces $X, Y$ have topologies compatible with the duality product, in the sense that for any continuous linear functional on $X$ there is an $y \in Y$ which puts the functional into the form $x \mapsto \langle x,y\rangle$ (respectively any continuous linear functional on $Y$ has the form $y \mapsto \langle x,y\rangle$, for an $x \in X$).

Example 1: (Heisenberg group) Let $S = X \times Y$ with the operation of addition of pairs of vectors and let $A = \mathbb{R}$ with addition. We may define a Heisenberg group over the pair $(X,Y)$ as $H(X,Y) = S \times A$ with the operation

$(x,y,a)(x',y',b) = (x+x', y+y', a+a'+ \langle x, y'\rangle - \langle x', y \rangle)$
Fact: There is no injective morphism from $(X\times Y, +)$ to $H(X,Y)$.

Example 2: (convex analysis) Let this time $S = (X \times Y)^{*}$, the free semigroup generated by $X \times Y$, i.e. the collection of all finite words with letters from $X \times Y$, together with the empty word $e$, with the operation of concatenation of words.

Let $A$ be the set of bi-affine real functions on $X \times Y$, i.e. the collection of all functions $a: X \times Y \rightarrow \mathbb{R}$ which are affine and continuous in each argument. $A$ is a commutative group with the addition of real valued functions operation.

We define the function $\lambda: S \times S \rightarrow A$ by:

$\lambda(e, c)(x,y) = \lambda(c,e)(x,y)=0$ for any $c \in S$ and any $(x,y) \in X \times Y$.
– if $c, h \in S$ are words $c = (x_{1},y_{1})...(x_{n}, y_{n})$ and $h = (u_{1},v_{1})...(u_{m}, v_{m})$, with $m,n \geq 1$, then

$\lambda(c,h)(x,y) = \langle u_{1} - x , y_{n} - y \rangle$ for any $(x,y) \in X \times Y$.

This $\lambda$ induces a semigroup extension operation on $S \times A$.

Fact: there is an injective morphism $F: S \rightarrow S \times A$, with the expression $F(c) = (c, E(c))$.

Here, for any $c = (x_{1},y_{1})...(x_{n}, y_{n})$ the expression $E(c)(x,y)$ is the well known circular sum associated to the “dissipation during the discrete cycle” $(x_{1},y_{1})...(x_{n}, y_{n})(x,y)$, namely:

$E(c)(x,y) = \langle x_{1},y\rangle + \langle x,y_{n}\rangle - \langle x,y \rangle + \sum_{k=1}^{n-1}\langle x_{k+1}, y_{k}\rangle - \sum_{k=1}^{n} \langle x_{k},y_{k} \rangle$

which appears in convex analysis, related to cyclically monotone operators.

The main difference between those two examples is that the example 2. is a direct product of a semigroup with a group. There are many resemblances though.

# Bipotentials, variational formulations

The paper on the use of bipotentials in variational formulations is finally submitted, also available on arxiv here. See also this presentation.

In case you wonder how this could be related with other subjects commented on this blog, then wait to see “A gallery of emergent algebras”, where it shall be explained the connection between convex analysis and an emergent algebra related to a semidirect product between the semigroup of words over a symplectic space and $\mathbb{R}$.