Category Archives: computation

An exercice with convex analysis and neural networks

This is a note about a simple use of convex analysis in relation with neural networks. There are many points of contact between convex analysis and neural networks, but I have not been able to locate this one, thanks for pointing me to a source, if any.

Let’s start with a directed graph with set of nodes N (these are the neurons) and a set of directed bonds B. Each bond has a source and a target, which are neurons, therefore there are source and target functions

s:B \rightarrow B   , t:B \rightarrow N

so that for any bond x \in B the neuron a = s(x) is the source of the bond and the neuron b = t(x) is the target of the bond.

For any neuron a \in N:

  • let in(a) \subset B be the set of bonds x \in B with target t(x)=a,
  • let out(a) \subset B be the set of bonds x \in B with source s(x)=a.

A state of the network is a function u: B \rightarrow V^{*} where V^{*} is the dual of a real vector space V. I’ll explain why in a moment, but it’s nothing strange: I’ll suppose that V and V^{*} are dual topological vector spaces, with duality product denoted by (u,v) \in V \times V^{*} \mapsto \langle v, u \rangle such that any linear and continuous function from V to the reals is expressed by an element of V^{*} and, similarly, any linear and continuous function from V^{*} to the reals is expressed by an element of V.

If you think that’s too much, just imagine V=V^{*} to be finite euclidean vector space with the euclidean scalar product denoted with the \langle , \rangle notation.

A weight of the network is a function w:B \rightarrow Lin(V^{*}, V), you’ll see why in a moment.

Usually the state of the network is described by a function which associates to any bond x \in B a real value u(x). A weight is a function which is defined on bonds and with values in the reals. This corresponds to the choice V = V^{*} = \mathbb{R} and \langle v, u \rangle = uv. A linear function from V^{*} to V is just a real number w.

The activation function of a neuron a \in N gives a relation between the values of the state on the input bonds and the values of the state of the output bonds: any value of an output bond is a function of the weighted sum of the values of the input bonds. Usually (but not exclusively) this is an increasing continuous function.

The integral of an increasing continuous function is a convex function. I’ll call this integral the activation potential \phi (suppose it does not depends on the neuron, for simplicity). The relation between the input and output values is the following:

for any neuron a \in N and for any bond y \in out(a) we have

u(y) = D \phi ( \sum_{x \in in(a)} w(x) u(x) ).

This relation generalizes to:

for any neuron a \in N and for any bond y \in out(a) we have

u(y) \in  \partial \phi ( \sum_{x \in in(a)} w(x) u(x) )

where \partial \phi is the subgradient of a convex and lower semicontinuous activation potential

\phi: V \rightarrow \mathbb{R} \cup \left\{ + \infty \right\}

Written like this, we are done with any smoothness assumptions, which is one of the strong features of convex analysis.

This subgradient relation also explains the maybe strange definition of states and weights with the vector spaces V and V^{*}.

This subgradient relation can be expressed as the minimum of a cost function. Indeed, to any convex function phi is associated a sync  (means “syncronized convex function, notion introduced in [1])

c: V \times V^{*} \rightarrow \mathbb{R} \cup \left\{ + \infty \right\}

c(u,v) = \phi(u) + \phi^{*}(v) - \langle v, u \rangle

where \phi^{*} is the Fenchel dual of the function \phi, defined by

\phi^{*}(v) = \sup \left\{ \langle v, u \rangle - \phi(u) \right\}

This sync has the following properties:

  • it is convex in each argument
  • c(u,v) \geq 0 for any (u,v) \in V \times V^{*}
  • c(u,v) = 0 if and only if v \in \partial \phi(u).

With the sync we can produce a cost associated to the neuron: for any a \in N, the contribution to the cost of the state u and of the weight w is

\sum_{y \in out(a)} c(\sum_{x \in in(a)} w(x) u(x) , u(y) ) .

The total cost function C(u,w) is

C(u,w) = \sum_{a \in N} \sum_{y \in out(a)} c(\sum_{x \in in(a)} w(x) u(x) , u(y) )

and it has the following properties:

  • C(u,w) \geq 0 for any state u and any weight w
  • C(u,w) = 0 if and only if for any neuron a \in N and for any bond y \in out(a) we have

 

u(y) \in  \partial \phi ( \sum_{x \in in(a)} w(x) u(x) )

so that’s a good cost function.

Example:

  • take \phi to be the softplus function \phi(u) =\ln(1+\exp(x))
  • then the activation function (i.e. the subgradient) is the logistic function
  • and the Fenchel dual of the softplus function is the (negative of the) binary entropy \phi^{*}(v) = v \ln(v) + (1-v) \ln(1-v) (extended by 0 for v = 0 or v = 1 and equal to + \infty outside the closed interval [0,1]).

 

________

[1] Blurred maximal cyclically monotone sets and bipotentials, with Géry de Saxcé and Claude Vallée, Analysis and Applications 8 (2010), no. 4, 1-14, arXiv:0905.0068

_______________________________

 

Do triangulations of oriented surfaces compute?

In a precise sense, which I shall explain, they do. But the way they do it is hidden behind the fact that the rewrites seem non local.

  1. They compute, because ribbon graphs with colored, trivalent nodes and directed edges do compute, via the encoding of untyped lambda terms into this family of graphs, provided by chemlambda. Indeed, a chemlambda molecule is a ribbon graph with these properties. If you want to encode a lambda term into chemlambda then there is a simple procedure: start from the lambda term on a form which eliminates the need of any alpha conversion. Then build the syntactic tree and replace the nodes by A nodes for application and L nodes for lambda abstraction (don’t forget that L nodes have one in and 2 out ports, differently from the syntactic tree node for lambda abstraction). Then eliminate the variables which are at the leaves by grafting trees of FO (green fanout) nodes from the lambda abstraction node to the places where the variables occur, or by grafting T (terminal) nodes to the lambda node which issues a variable which does not occur later, or simply by just erasing the variable label for those variables which are not issued from an abstraction. That’s it, you get a ribbon graph which is open (it has at least the root half-edge and maybe the half-edges for the variables which don’t come from an abstraction), but then you may add FRIN (free in) and FROUT (free out) nodes and think about them as tadpoles and you get a trivalent ribbon graph. The dual of this graph is (equivalent to) a triangulated, oriented surface, which has faces colored (corresponding to the nodes of the graph), directed edges, such that there are no faces with the 3 edges directed in a cyclic way.
  2. How they compute? Chemlambda uses a set of graph rewrites which has some classic ones, like the Wadsworth-Lamping graphical version of the beta move, but it has two types of fanouts (FO and FOE), one FANIN, and different than usual rules for distributivity. Look at the moves page to see them. All these rewrites are local, in the sense that there is a small number, fixed a priori, which is an upper bound for the number of nodes and edges which enter (in any way) into the graph rewrite (as a condition or as the left pattern, or as the right pattern). The algorithm of application of the rewrites is a very important piece which is needed to make a model of computation. The algorithm is very simple, it can be deterministic or random, and consists, in the deterministic case, into the application of as many rewrites as possible, with a priority for the distributivity moves in case of conflict, and in the random case, it’s just random application of rewrites.

Here is an example, where I play with the reduction of false omega id in chemlambda

  1. Now let’s pass to the duals, the triangulated surfaces. The nodes of the triangulated surface correspond to the faces of the ribbon graph. Or the faces of the ribbon graph are global notions, because they are the orbits of a permutation. After one of the rewrites, the faces (of the ribbon graph) change in a way which has to be non local, because one has to compute again the orbits of the permutation for the new graph, and there is no upper bound on the number of half-edges which have to be visited for doing that.
  2. So triangulated, oriented surfaces do compute, but the rewrites and the algorithm of application are hidden behind this duality. They are non-local for triangulated surfaces, but local for ribbon graphs.
  3. Finally, a word of attention: these surfaces do compute not by being arrows in a category. They don’t compute in this usual, say Turaev kind of way. They compute by (the duals of) the rewrites, there is nothing else than triangulated surfaces, colored by 3 colors (red, green, yellow), there is no decoration which actually does the computation by substitution and evaluation. I don’t know why, but this seems very hard to understand by many. Really, these surfaces compute by rewrites on the triangulations, not by anything else.

ADDED: If you look at the tadpoles as pinches, then make the easy effort to see what  the SKI formalism looks like, you’ll see funny things. The I combinator is the sphere with one pinch (the plane), the K combinator is the sphere with two pinches (cylinder) and the S combinator is the torus with one pinch. But what is SKK=I? What is KAB=A? What you see in the dual (i.e in the triangulation) It depends globally on the whole term, so these reductions do not appear to be the same topological manipulations in different contexts.