Tag Archives: Open Science

Transparency is superior to trust

I am fascinated by this quote. I think it’s the most beautiful quote, in it’s terseness, I’ve seen since a long time. Wish I invented it!

It is not, though, the motto of Wikileaks, it’s taken from the section on Reproducibility of this Open Science manifesto.

To me, this quote means that validation is superior to peer review.

It is also significant that the quote says nothing about the publishing aspects of Open Science. That is because, I believe, we should split publishing from the discussion about Open Science.

Publishing, scientific publishing I mean, is simply irrelevant at this point. The strong part of Open Science, the new, original idea it brings forth is validation.

Sci-Hub acted as the great leveler, as concerns scientific publication. No interested reader cares, at this point, if an article is hostage behind a paywall or if the author of the article paid money for nothing to a Gold OA publisher.

Scientific publishing is finished. You have to be realistic about this thing.

But science communication is a far greater subject of interest. And validation is one major contribution to a superior scientific method.

More experiments with Open Science

I still don’t know which format is better for Open Science. I’m long past the article format for obvious reasons. Validation is a good word and concept because you don’t have to rely absolutely on opinions of others and that’s how the world works. This is not all the story though.

I am very fortunate to be a mathematician, not a biologist or biochemist. Still I long for the good format for Open Science, even if, as a mathematician, I don’t have the problems biologists or chemists have, namely loads and loads of experimental data and empirical approaches. I do have a world of my own to experiment with, where I do have loads of data and empirical constructs. My mind, my brain are real and I could understand myself by using tools of chemists and biologists to explore the outcomes of my research. Funny right? I can look at myself from the outside.

That is why  I chose to not jump directly to make Hydrogen, but instead to treat the chemlambda  world, again, as a guinea pig for Open Science.

There are 427 well written molecules in the chemlambda library of molecules on Github. There are 385 posts in the chemlambda collection on Google+, most of them with animations from simulations of those molecules. It is a world, how big is it?

It is easy to make first a one page direct access to the chemlambda collection. It is funnier to build a phylogenetic tree of the molecules, based on their genes. That’s what I am doing now, based on a work in progress.

Each molecule can be decomposed in “genes” say, by a sequencer program. Then one can use a distance between these genes to estimate first how they cluster and later to make a phylogenetic tree.

Here is the first heatmap (using the edit distance between single occurrences of genes in molecules) of the 427 molecules.

Screenshot-22

Is a screenshot, proving that my custom programs work 🙂 (one understands more by writing some scripts than by taking tools ready made from others, at least at this stage of research).

By using the edit distance I can map the explored chemlambda molecules. In the following image the 427 molecules from the library are represented as nodes and for each pair of molecules at an edit distance at most 20 there is a link. The nodes are in a central gravitational field, each node has the same charge and the links between nodes act as springs.

Screenshot-29_map

This is a screenshot of the result, showing clusters and trees, connecting them. Not very sophisticated, but enough to give a sense of the explored territory. In the curated collection, such a map would be useful to navigate through the molecules, as well as for giving ideas about which parts are not as well explored. I have not yet made clear which parts of the map cover lambda terms, which cover quines, etc.

Moreover, I see structure! The 427 molecules are made of copies of  605 different linear “genes” (i.e. sticks with colored ends)  and 38 ring shaped ones.  (Is easy to prove that lambda terms have no rings, when turned into molecules.) There are some interesting curved features visible in the edit distance of the sticks.

screenshot-23

They don’t look random enough.

Is clear that a phylogenetic tree is in reach, then what else than connecting the G+ collection posts with the molecules used, arranged along the tree…?

Can I discover which molecules are coming from lambda terms?

Can I discover how my mind worked when building these molecules?

Which are the neglected sides, the blind places?

I hope to be able to tell by the numbers.

Which brings me to the main subject of this post: which is a good format for an Open Science piece of research?

Right now I am in between two variants, which may turn out to not be as different as they seem. An OS research vehicle could be:

  • like a viable living organism, literary
  • or like a viable world, literary.

Only the future will tell which is which. Maybe both!

Update the Panton Principles please

There is a big contradiction between the text of The Panton Principles and the List of the Recommended Conformant Licenses. It appears that it is intentional, I’ll explain in a moment why I write this.

This contradiction is very bad for the Open Science movement. That is why, please, update your principles.

Here is the evidence.

1. The second of the Panton Principles is:

“2. Many widely recognized licenses are not intended for, and are not appropriate for, data or collections of data. A variety of waivers and licenses that are designed for and appropriate for the treatment of data are described [here](http://opendefinition.org/licenses#Data). Creative Commons licenses (apart from CCZero), GFDL, GPL, BSD, etc are NOT appropriate for data and their use is STRONGLY discouraged.

*Use a recognized waiver or license that is appropriate for data.* ”

As you can see, the authors clearly state that “Creative Commons licenses (apart from CCZero) … are NOT appropriate for data and their use is STRONGLY discouraged.”

2. However, if you look at the List of Recommended Licenses, surprise:

Creative Commons Attribution Share-Alike 4.0 (CC-BY-SA-4.0) is recommended.

3. The CC-BY-SA-4.0 is important because it has a very clear anti-DRM part:

“You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, the Licensed Material if doing so restricts exercise of the Licensed Rights by any recipient of the Licensed Material.” [source CC 4.0 licence: in Section 2/Scope/a. Licence grant/5]

4. The anti-DRM is not a “must” in the Open Definition 2.1. Indeed, the Open Definition clearly uses “must” in some places and “may” in another places.  See

“2.2.6 Technical Restriction Prohibition

The license may require that distributions of the work remain free of any technical measures that would restrict the exercise of otherwise allowed rights. ”

5. I asked why is this here. Rufus Pollock, one of the authors of The Panton Principles and of the Open Definition 2.1, answered:

“Hi that’s quite simple: that’s about allowing licenses which have anti-DRM clauses. This is one of the few restrictions that an open license can have.”

My reply:

“Thanks Rufus Pollock but to me this looks like allowing as well any DRM clauses. Why don’t include a statement as clear as the one I quoted?”

Rufus:

“Marius: erm how do you read it that way? “The license may prohibit distribution of the work in a manner where technical measures impose restrictions on the exercise of otherwise allowed rights.”

That’s pretty clear: it allows licenses to prohibit DRM stuff – not to allow it. “[Open] Licenses may prohibit …. technical measures …”

Then:

“Marius: so are you saying your unhappy because the Definition fails to require that all “open licenses” explicitly prohibit DRM? That would seem a bit of a strong thing to require – its one thing to allow people to do that but its another to require it in every license. Remember the Definition is not a license but a set of principles (a standard if you like) that open works (data, content etc) and open licenses for data and content must conform to.”

I gather from this exchange that indeed the anti-DRM is not one of the main concerns!

6. So, until now, what do we have? Principles and definitions which aim to regulate what Open Data means which avoid to take an anti-DRM stance. In the same time they strongly discourage the use of an anti-DRM license like CC-BY-4.0. However, on a page which is not as visible they recommend, among others, CC-BY-4.0.

There is one thing to say: “you may use anti-DRM licenses for Open Data”. It means almost nothing, it’s up to you, not important for them. They write that all CC licenses excepting CCZero are bad! Notice that CC0 does not have anything anti-DRM.

Conclusion. This ambiguity has to be settled by the authors. Or not, is up to them. For me this is a strong signal that we witness one more attempt to tweak a well intended  movement for cloudy purposes.

The Open Definition 2.1. ends with:

Richard Stallman was the first to push the ideals of software freedom which we continue.

Don’t say, really? Maybe is the moment for a less ambiguous Free Science.

The price of publishing with GitHub, Figshare, G+, etc

Three years ago I posted The price of publishing with arXiv. If you look at my arXiv articles then you’ll notice that I barely posted on arXiv.org since then. Instead I went into territory which is even less recognized as serious by a big part of academia. I used:

The effects of this choice are put in front of my homepage, so go there to read them. (Besides, it is a good exercise to remember how to click on links and use them, that lost art from the age when internet was free.)

In this post I want to explain what is the price I paid for these choices and what I think now about them.

First, it is a very stressful way of living. I am not joking, as you know stress comes from realizing that there are many choices and one has to choose. Random reward from the social media is addictive. The discovery that there is a way to get out from the situation which keeps us locked into the legacy publishing system (validation). The realization that the problem is not technical but social. A much more cynical view of the undercurrents of the social life of researchers.

The feeling that I can really change the world with my research. The worries that some possible changes might be very dangerous.

The debt I owe concerning the scarcity of my explanations. The effort to show only the aspects I think are relevant, putting aside those who are not. (Btw, if you look at my About page then you’ll read “This blog contains ideas from the future”. It is true because I already pruned the 99% of the paths leading nowhere interesting.)

The desire to go much deeper, the desire to explain once again what and why, to people who seem either lacking long term attention capability or having shallow pet theories.

Is like fishing for Moby Dick.

How to use the chemlambda collection of simulations

The chemlambda_casting folder (1GB) of simulations is now available on Figshare [1].

How to use the chemlambda collection of simulations? Here’s an example. The synthesis from a tape video [2] is reproduced here with a cheap animated gif. The movie records the simulation file 3_tape_long_5346.html which is available for download at [1].

That simple.

If you want to run it in your computer then all you have to do is to download 3_tape_long_5346.html from [1], download from the same place d3.min.js and jquery.min.js (which are there for your convenience). Put the js libs in the same folder as the html file. Open the html file with a browser, strongly recommend Safari or Chrome (not Firefox which blocks with these d3.js animations, for reasons related to d3). In case your computer has problems with the simulation (I used a macbook pro with safari) then slow it like this: edit the html file (with any editor) and look for the line starting with

return 3000 + (4*(step+(Math.random()*

and replace the “4” by “150”, it should be enough.

Here is a longer explanation. The best would be to read carefully the README [4].
“Advanced”: If you want to make another simulation for the same molecule then follow the steps.

1. The molecule used is 3_tape_long_5346.mol which is available at the library of chemlambda molecules [3].

2. So download the content of the gh-pages branch of the chemlambda repository at github [4] as explained in that link.

3. then follow the steps explained there and you’ll get a shiny new 3_tape_long_5346.html which of course may be different in details than the initial one (it depends on the script used, if you use the random rewrites scripts then of course the order of rewrites may be different).

[1] The Chemlambda collection of simulations
https://doi.org/10.6084/m9.figshare.4747390.v1

[2] Synthesis from a tape
https://plus.google.com/+MariusBuliga/posts/Kv5EUz4Mdyp

[3] The library of chemlambda molecules
https://github.com/chorasimilarity/chemlambda-gui/tree/gh-pages/dynamic/mol

[4] Chemlambda repository (readme) https://github.com/chorasimilarity/chemlambda-gui/blob/gh-pages/dynamic/README.md

The chemlambda collection is a social hack, here’s why

 

People from data deprived places turn to available sources for scientific information. They have the impression that Social Media may be useful for this. Reality is that it is not, by design.

But we can socially hack the Social Media for the benefit of Open Science.

Social Media is not fit for Open Science by design. They are Big Data gatherers, therefore they are interested not in the content per se, but in the metadata. The huge quantity of metadata they suck from the users tells them about the instantaneous interests and social links or preferences. That is why cat pics are everywhere: the awww moment is data poor but metadata rich.

Open Science has as aim to share scientific data and rigorous validation means. For free! Therefore Open Science is data rich. It is also, by design, metadata poor, because at least if a piece of research is not yet popular, there is not much interaction (useful for example to advertisers or to tech companies or govenrnments) to be encoded in
metadata.

The public impression is that science is hard and many times boring. There are however many people interested in science, like for example smart kids or creative people living in data deprived places. There are so many people with access to the Social Media so that, in principle, even the most seemingly boring science project may gather the attentions of tens of thousands of them. If well done!

Such science projects may never see the light of the media attention because classical media works with big numbers and very low level content. Classical media has still to adapt to the new realities of the Net. One of them is that the Net people are in such a great number that there is no need to adapt a message for a majority of people which is not, generically, interested in science.

Likewise, Social Media is by design driven by big numbers (of metadata, this time). They couldn’t care less about the content provided that it generates big data exhaust (Zuboff, Big other: surveillance capitalism and the prospects of an information civilization).

They can be tricked!

This was the purpose of the chemlambda collection: beautiful animations, data rich content hidden behind for those interested. My previous attempts to use classical channels for Open Science gave only very limited results. Indeed, the same is true for a smart kid or a creative person from Africa.

If you are not born in the right place, studied at the right university and made the right friends then your ideas will not spread through the classical channels, unless your
ideas are useful to a privileged team. You, smart kid or creative person from Africa, will never advance your ideas to the world unless they are useful first not to you, but to privileged people from far away places. If this happens, the best you can expect is to be an useful servant for them.

So, with these ideas and experiences, I tried to socially hack the Big Data gatherers. I presented short animations (under 10s) obtained from real scientific simulations. I chose them among those which are visually appealing. Each of them can be reproduced and researched by anybody interested via a GitHub repository.

It worked. The Algorithmic Gods from Google decided to make chemlambda a featured collection. I had more than 50 000 followers and more than 50 millions views of these scientific, original simulations.

To compare, another collection, dedicated to censorship on social media, had no views!

I shall make, acording to my access to data, which is limited, an analysis of people who saw the collection.

It seems to me that there were far more women that men. Probably the algorithms used the prior that women, stupid as they are, are more interested in pictures than text. Great, let’s hack this stupid prior and turn it into a chance to help Women access to science 🙂

There were far more people from Asia and Africa than from the West. Because, of course, they are stupid and don’t speak the language (English), but they can look at the pictures. Great, let’s turn this snobbery into an advantage, because they are the main public which could benefit from Open Science.
The amazing (for me) popularity of this experiment showed that there is something more to dig in this direction!
Science can be made interesting and remain rigorous too.

Science and art are not as different as they look, in particular for this project the visual arts.

And the chemlambda project is very interesting, of course, because it a take on life at molecular level done by a mathematician. The biologists need this, not only mathematical tools, but also mathematical minds. Biologists, as the Social Media companies, sit on heaps of Big Data.

Finally, there is the following question I’d like to ask.
Scientific data is, in bits, a tiny proportion of the Big Data gathered everyday. Is tiny, ridiculously tiny.

Question: where to put it freely, so that it stays free and is treated properly, I mean as visible and easy to access as a cat pic? Would it be so hard to dedicate something like 1/10 000 of the servers used for Big Data in order to keep Open Science alive? In order to not let it rot along with older cat pics?