bergis reptile zoo of software, hardware and ideas

RDF-Ext v1 Release

Finally RDF-Ext v1 is here! Maybe some of you already noticed that the packages on GitHub and npm have now a 1.* version number. All packages are based now on the RDFJS spec. The spec is not yet final, but mainly editorial issues are open. I want to say thanks to all the people who helped creating that spec and made RDF-Ext v1 possible! At Zazuko we used the develop branches already for some projects and were quite happy about the new API and the performance of the stream processing. The lack of documentation is a known problem and we are trying to solve it soon.

Me and my colleagues at Zazuko invested a considerable amount of time into the development of this new RDF-Ext release and into the RDFJS spec. While part of the work was funded by projects we were working on, we would still appreciate additional funding, especially to finish documentation and examples. Our goal is to make it relatively easy to start with RDF and JavaScript for newcomers. If you would like to support our effort drop me an email.

RawDevJS Release

It’s been a while since I was working actively on my JavaScript raw image developer called RawDevJS. In 2012 I ported RawDev, my C++ raw developer, to JavaScript. One year before pics.io, the first commercial service, was available. New APIs like WebWorker, WebGL and WebCL have made it possible to implement fast image processing applications in JavaScript. At the parallel 2013 conference I showed the possibilities of JavaScript Parallel Computing APIs on the basis of RawDevJS. The slides are available here. I also gave a talk about it at the MunichJS JavaScript Meetup in September 2013 with the title RawDevJS – ein JavaScript Raw Entwickler (even if the title is German, the talk was in English - A spontaneous decision).

I thought soon we will have real cloud image processing apps. But now, 5 years later, there is only pics.io, which I didn’t even know until now. I discovered it while I was searching for links for this blog post.

I have too many other projects to continue my work on this project, so I decided to make it open source. I’m pavonine when it comes to code, so I made a quick rewrite in ES6. The WebWorker code was removed, because it didn’t work anymore with the current version of Node.js. I had already a look at more up to date libraries to bring back that feature. The performance gain would be big, so it would be worth to have a look at this. The WebGL code used a different filter pipeline and today there are better options to write filters for JavaScript and WebGL, like turbo.js and glsl-transpiler. That’s why I also skipped that part. The WebCL code was written before the spec was final. I even had to adapt the code during development to the API changes. Also there is no native support for WebCL by any browser until now. So I also skipped that part.

The result is available at the RawDevJS organization on GitHub, which contains 18 repositories today. One of them is rawdevjs-browser fetches and renders DNG images into Blob URLs, which can be display using an image element.

<div class="rawdevjs" data-src="http://static.bergnet.org/IMG_8801.dng"></div>

Another one is rawdevjs-cli, a command line tool to convert DNG images to PNG images.

If you are interested in participating or in adopting that project, please contact me!

Ligand binding affinity prediction using deep learning

Preamble

I’m not an domain expert in protein binding or even biochemistry, but I have a strong interest in hacking with machines. Using an abstract view on life shows that we are just DNA based machines. The architecture is very different to the current silicon based machines, but I expect in the not so far future it will blur more and more. That’s how protein binding got my attention.

Idea

Two years ago I investigated the idea of using deep learning to make predictions for ligand binding affinity. The basic idea is very simple: There are many known molecules and their binding affinity to specific receptors. The sum of information of all these molecules to a specific receptor is like a negative of the receptor itself. Training a neural network with the information of many molecules to a single receptor would make the neural network itself a negative of the receptor.

Problems

Data

Usually one problem is how to collect all the data. But with my Linked Data background that was quite easy. Use SPARQL to select the required information, done. But ChEMBL is not the biggest public database. So I wrote a small CSV to RDF converter for PDSB Ki database and BindingDB.

Feeding the neural network

Another problem, which wasn’t solved at that time, was how to feed the neural network with the molecules? The data structure of a molecule is a graph. How can you feed a graph to a neural network? I thought about the method used to feed text to neural networks. Each character or token gets a own input neuron. Text representations for molecules could be split into tokens. Even more specific than ASCII codes. So I had a look at different representations of molecules. One candidate was InChI. InChI is canonical, which allows easy lookups in databases. One big drawback is the structure of the format. There are different layers, one after the other, and the information of the different layers must be combined to build the molecule. That means the distance in the 1D text representation can be very big compared to the distance in the graph. The next candidate was SMILES. SMILES doesn’t have that drawback of the big distances in the representation. The format isn’t canonical out of the box, but there are algorithms to make canonical representations of any SMILES definition. This looks like a drawback, but it is very useful for training the neural network. There is a limited amount of data. Alternative representations could be used for data augmentation. Open Babel has a feature to generate these alternative representations. It’s also nice during testing using alternative representations. The output for alternative representations of a molecule should be the same, if you understand SMILES. Using alternative representations in the test data allows to verify if the neural network understands SMILES. Spoiler: After a while the output becomes very close for the alternatives!

Neural network model

The neural network model is very simple. There is an input neuron for every possible token. All neurons are 0 except the current token. That neuron will get the value 1. Than all tokens are feed to the model in sequence. There is only one output neuron. The value of the output should be 1 at the end of the sequence if the molecule binds or 0 if it doesn’t bind. With bigger datasets it could be possible to use the Ki value as output to make a more detailed prediction.

Implementation

Contrary to most deep learning projects I choose JavaScript for most of the code. Only a small part is written in Python using Keras. The decision was made based on my other deep learning projects. I will write a blog post about it later.

For this project I developed some generic utils, which can be used also for other use cases.

Generic tools

  • Keras Gaia handles datasets and models for Keras in simple project definitions (Python project).
  • SMILES parser and serializer is used to split the SMILES strings to tokens. It also works the other way round.
  • The Open Babel command line wrapper is used to generate the canonical SMILES string and the alternatives.
  • nn-mapping maps the selected JSON data to neural network ready data.

The ligand binding code is also separated into sub projects to make them more reusable.

Ligand binding

Results

The output of the test dataset varied for each receptor. Looking at the output of a single test molecule doesn’t allow to make a prediction. But a look at all test molecules shows a clear pattern. Sorting the molecules by the output value gives a much higher chance to find a good candidate. Because of the lack of negative data the test dataset is very small. But with rotated datasets the result could be reproduced.

This diagram shows the sorted results for the 5ht2a receptor using 75 alternatives per molecule.

With these results I expect this method could be useful to check big datasets of molecules. The training for a single receptor took about 1 day on a Nvidia GTX980. But the prediction is done in seconds. This allows to select good candidates for further tests.

I expect it should be also possible to predict other properties of molecules. Toxicity could be a good candidate. With a very big dataset it could be even possible to predict the LD50 value.

Conclusion

Why did it take two years to publish the idea and to release the code? As I already mention, I’m not a domain expert. The idea was so different to what I found on the Web, so I wasn’t sure if the idea is ahead of it’s time or just stupid. One year after I had this idea I talked to a colleague who had some experience in that field. He encouraged me to test the idea. Also other colleagues ask me than: Why hasn’t somebody else already implemented it? How good does it perform compared to existing solutions? The result looked pretty good, but I didn’t found any test dataset to benchmark it against other methods. Also it was just a spare time project at that time. All that delayed the release for such a long time. But now, after similar approaches showed up, I’d like to say:

Don’t be afraid of ideas which are very different to current state of the art technology, especially if you are working in the field of deep learning!

Parts of this project now moved from my spare time to something we will support with the company I’m CTO for, Zazuko GmbH. We have plans of using deep learning in other contexts as well, for example entity linking. If you need any support in deep learning and linked data, don’t hesitate to contact us!

I expect much better results with a database which contains much more entries about molecules which don’t bind to a receptor. But usually such data doesn’t get published. In general we will need much more data negative data for machine learning. So I also would like to say:

Please, publish your failures!

In the meantime some other projects use SMILES to feed neural networks with molecule structures or auto encoders which generate molecules. A very similar projects is described in the paper Generating Focussed Molecule Libraries for Drug Discovery with Recurrent Neural Networks.

Short comparison (mine, paper):

  • Property used to rate the molecule: Ki value / IC50
  • Database: BindingDB / ChEMBL
  • Data augmentation: Alternative SMILES using Open Babel / none

I also used the 5ht2a receptor for testing and also tried just the ChEMBL database without alternatives. The quality for the neural network should increase by a significant amount using the bigger dataset and the alternatives.

Edit: I was pointed to the paper Learning to SMILE(S), which seems to be the first public paper about the method to feed molecules to neural networks.

Third Hand with magnifier

I found this cool Third Hand from Hobby Creek, which is doing a great job to fasten objects for soldering. It only lacks a magnifier. The stand of the Third Hand has some more holes which could be used to add another arm with a magnifier. A complete set was not available, so built one myself. I bought this set of arms and a magnifier with battery driven LED light. The magnifier should be removable. I used Polymorph Thermoplastic to build a mount where the magnifier can be inserted. The thermoplastic was also used to attach the mount to the arm.

Quite simple but very usefull.

New blog, old downloads

New blog

I have many small projects which don’t have a real home on the web. My PHP code to manage my WebID still does a good job, but the blog software part would need a rewrite. Right now I don’t have time for this, so I choose Hexo. It’s implemented in JavaScript, generates static files and supports markdown posts. Linked Data support would be cool, but let’s be pragmatic. I hope you like my adaption of the theme.

Old downloads

Maybe you are still interesting in software or source code which I hosted on my old page. Here is a copy of the download page:

Vantage SettingsEditor

A Qt based editor for settings files of Vantage satellite receivers.

bergwave

My experimental wavelet based video codec.

bergos

My IA32 research project.

bergfiltercollection

A AviSynth plugin which currently only includes a noise filter.

bergdiseqc

A ProgDVB Diseqc plugin.

bergphoto

The image/photo processing software is still hosted on www.bergphoto.org.