An exercice with convex analysis and neural networks

[pdf version]

This is a note about a simple use of convex analysis in relation with neural networks. There are many points of contact between convex analysis and neural networks, but I have not been able to locate this one, thanks for pointing me to a source, if any.

Let’s start with a directed graph with set of nodes N (these are the neurons) and a set of directed bonds B. Each bond has a source and a target, which are neurons, therefore there are source and target functions

s:B \rightarrow N   , t:B \rightarrow N

so that for any bond x \in B the neuron a = s(x) is the source of the bond and the neuron b = t(x) is the target of the bond.

For any neuron a \in N:

  • let in(a) \subset B be the set of bonds x \in B with target t(x)=a,
  • let out(a) \subset B be the set of bonds x \in B with source s(x)=a.

A state of the network is a function u: B \rightarrow V^{*} where V^{*} is the dual of a real vector space V. I’ll explain why in a moment, but it’s nothing strange: I’ll suppose that V and V^{*} are dual topological vector spaces, with duality product denoted by (u,v) \in V \times V^{*} \mapsto \langle v, u \rangle such that any linear and continuous function from V to the reals is expressed by an element of V^{*} and, similarly, any linear and continuous function from V^{*} to the reals is expressed by an element of V.

If you think that’s too much, just imagine V=V^{*} to be finite euclidean vector space with the euclidean scalar product denoted with the \langle , \rangle notation.

A weight of the network is a function w:B \rightarrow Lin(V^{*}, V), you’ll see why in a moment.

Usually the state of the network is described by a function which associates to any bond x \in B a real value u(x). A weight is a function which is defined on bonds and with values in the reals. This corresponds to the choice V = V^{*} = \mathbb{R} and \langle v, u \rangle = uv. A linear function from V^{*} to V is just a real number w.

The activation function of a neuron a \in N gives a relation between the values of the state on the input bonds and the values of the state of the output bonds: any value of an output bond is a function of the weighted sum of the values of the input bonds. Usually (but not exclusively) this is an increasing continuous function.

The integral of an increasing continuous function is a convex function. I’ll call this integral the activation potential \phi (suppose it does not depends on the neuron, for simplicity). The relation between the input and output values is the following:

for any neuron a \in N and for any bond y \in out(a) we have

u(y) = D \phi ( \sum_{x \in in(a)} w(x) u(x) ).

This relation generalizes to:

for any neuron a \in N and for any bond y \in out(a) we have

u(y) \in  \partial \phi ( \sum_{x \in in(a)} w(x) u(x) )

where \partial \phi is the subgradient of a convex and lower semicontinuous activation potential

\phi: V \rightarrow \mathbb{R} \cup \left\{ + \infty \right\}

Written like this, we are done with any smoothness assumptions, which is one of the strong features of convex analysis.

This subgradient relation also explains the maybe strange definition of states and weights with the vector spaces V and V^{*}.

This subgradient relation can be expressed as the minimum of a cost function. Indeed, to any convex function phi is associated a sync  (means “syncronized convex function, notion introduced in [1])

c: V \times V^{*} \rightarrow \mathbb{R} \cup \left\{ + \infty \right\}

c(u,v) = \phi(u) + \phi^{*}(v) - \langle v, u \rangle

where \phi^{*} is the Fenchel dual of the function \phi, defined by

\phi^{*}(v) = \sup \left\{ \langle v, u \rangle - \phi(u) \right\}

This sync has the following properties:

  • it is convex in each argument
  • c(u,v) \geq 0 for any (u,v) \in V \times V^{*}
  • c(u,v) = 0 if and only if v \in \partial \phi(u).

With the sync we can produce a cost associated to the neuron: for any a \in N, the contribution to the cost of the state u and of the weight w is

\sum_{y \in out(a)} c(\sum_{x \in in(a)} w(x) u(x) , u(y) ) .

The total cost function C(u,w) is

C(u,w) = \sum_{a \in N} \sum_{y \in out(a)} c(\sum_{x \in in(a)} w(x) u(x) , u(y) )

and it has the following properties:

  • C(u,w) \geq 0 for any state u and any weight w
  • C(u,w) = 0 if and only if for any neuron a \in N and for any bond y \in out(a) we have

u(y) \in  \partial \phi ( \sum_{x \in in(a)} w(x) u(x) )

so that’s a good cost function.

Example:

  • take \phi to be the softplus function \phi(u) =\ln(1+\exp(x))
  • then the activation function (i.e. the subgradient) is the logistic function
  • and the Fenchel dual of the softplus function is the (negative of the) binary entropy \phi^{*}(v) = v \ln(v) + (1-v) \ln(1-v) (extended by 0 for v = 0 or v = 1 and equal to + \infty outside the closed interval [0,1]).

________

[1] Blurred maximal cyclically monotone sets and bipotentials, with Géry de Saxcé and Claude Vallée, Analysis and Applications 8 (2010), no. 4, 1-14, arXiv:0905.0068

_______________________________