A Return to Machine Learning
This post is aimed at artists and other creative people who are interested in a survey of recent developments in machine learning research that intersect with art and culture. If you’ve been following ML research recently, you might find some of the experiments interesting but will want to skip most of the explanations.
The first AI that left me speechless was a chatbot named MegaHAL. I read a typical training session with its creator, Jason Hutchens:
User: Tu parles français?
MegaHAL: I GUESS SO. PARLEZ-VOUS FRANÇAIS?
User: In 1793 the french king was executed.
MegaHAL: HA HA HA! CORRECT. ALTHOUGH, EXECUTED HAS MULTIPLE MEANINGS.
User: The revolution started on July, 14.
MegaHAL: IT IS 14 DEGREES CELSIUS HERE
User: Another revolution was carried out by Lenin in Russia in the year 1917.
MegaHAL: WHILE LENIN READ A BOOK.
I was in awe.
It turns out MegaHAL was basically sleight of hand, picking a single word from your input and using a technique called Markov chains to iteratively guess the most likely words that would precede and follow based on a large corpus of example text (not unlike some Dada word games). But reading these transcripts in high school had a big effect on how I saw computers, and my interest in AI even affected where I applied to college.
Halfway through undergrad, I started doing research at an AI lab. But I’d skipped most of my computer science classes to sit in on art and music classes, and in lab meetings I spent most of the time proposing ideas about computational creativity or automating online activity and identities. After a few two many of these interjections the director sat me down after a meeting and said, “Kyle, I think you might be an artist.” It wasn’t a compliment, but I took it to heart and went on to do an MFA.
Since then I’ve found myself in a regular pattern: learn about a new tool or field of research, explore it conceptually and technically through small studies, and eventually craft new artwork that integrates recurring themes from my practice. Previously I’ve focused on tools like 3d scanning or face tracking, and most recently it’s been machine learning. It’s not my first time: besides the AI lab, I was building my own side projects. Programs that tried to understand and improvise around the rhythm you were clapping, or tried to finish drawings you started. But my tool of choice at the time (neural networks) wouldn’t scale to the situations I really wanted to explore. Fortunately, in the last few years machine learning research has been transformed, fueled by new kinds and scales of data, faster computers, new toolkits, and new communities. And neural nets are back in fashion.
This last year I’ve been getting back into machine learning and AI, rediscovering the things that drew me to it in the first place. I’m still in the “learning” and “small studies” phase that naturally precedes crafting any new artwork, and I wanted to share some of that process here. This is a fairly linear record of my path, but my hope is that this post is modular enough that anyone interested in a specific part can skip ahead and find something that gets them excited, too. I’ll cover some experiments with these general topics:
- Convolutional Neural Networks
- Recurrent Neural Networks
- Dimensionality Reduction and Visualization
- Autoencoders
If you’d rather hear and watch this as a video, there’s significant overlap with my OpenVis Conference talk of the same name (which includes live demos) and a smaller overlap with my Eyeo 2016 keynote (which has more of an emphasis on the nature of intelligence and creativity).
Convolutional Neural Networks
On June 16th, 2015 Reddit erupted in a furious debate over this image of a “puppyslug” (originally “dog-slug”), posted anonymously with the title “Image generated by a Convolutional Neural Network”. Machine learning researchers and hobbyists were quick to argue over whether it could possibly be the work of a neural net as claimed, or if it was some other algorithm, or even somehow handcrafted. The deleted comments littering the thread only added to the mystery of the image’s origins.
About a week later a Google Research blog post appeared titled “Inceptionism: Going Deeper into Neural Networks” with other similar images, effectively confirming the origin of the “puppyslug”. Soon after, Google released code called “Deep Dream” that anyone could use to recreate these images. (It’s worth mentioning that projects like Deep Dream are mostly a side effect of collecting curious engineers in the same location, and it’s hard to draw more general conclusions about the relationship of Deep Dream to Google’s mission or future technologies.)
While Deep Dream may have been the first moment a neural net captured the public’s imagination, they’ve seen plenty of success in everyday life. Convolutional neural nets (CNNs) in particular have been used for reading checks since the 90s, powering recent search-by-image systems, and making culturally insensitive, biased guesses for automated image tagging.
Seeing all the applications for CNNs, my first step was to look for a toolkit written in a familiar programming language or framework. I found the ccv library, a computer vision library that shipped with a CNN implementation and a flexible license, and made a wrapper for openFrameworks to experiment in realtime (this was in early 2015). It was really exciting to see my laptop “understanding” things, it was totally beyond the basic color matching or feature tracking of traditional computer vision I’m more familiar with. Even if it got some things totally wrong, the uncanny labels provided inspiration. It made me think back to the famous story of AI pioneer Marvin Minsky assigning an MIT undergraduate student “computer vision” as a summer project in 1966, and I wondered what he thought of these most recent technologies, with all their successes and failures.
Another tool I found around this time was the Jetpac SDK. It’s an unusual toolkit in that they have provided a mostly open source implementation of CNNs for a huge variety of platforms, including WebGL with JavaScript. Of course Google bought them out, and the library isn’t being developed anymore. It’s still one of the only JavaScript implementations of a fast CNN.
Building and training my own nets with ccv seemed a little daunting due to the lack of community or examples. Looking into more toolkits I learned the big contenders were Caffe, Theano, and Torch written in C++, Python, and Lua respectively. There were plenty of less recognized toolkits and wrappers written in Python, and even Caffe had a Python wrapper, so I decided to start practicing Python. (This turned out to be a good investment, with Google’s TensorFlow becoming the de facto standard for deep learning, and mainly being used via Python.)
The first thing I tried was tackling a problem I knew well: smile detection. Looking through the examples in Lasagne (a Theano-based Deep Learning toolkit) I modified a CNN example meant for handwritten digit recognition and turned it into a binary classification problem: instead of asking if this small 28x28 pixel image (the standard size for MNIST) is a hand-drawn 0, 1, 2, 3…. or 9, instead I asked a smile (one) or not (zero). I trained on thousands of images and tested on a video of myself smiling. It worked correctly the first time without any modifications!
I could even “watch” it learn by observing the classification changing over time on a video of myself smiling twice. At the top there’s just white noise, it doesn’t know what’s a smile and what’s not, its predictions are random. But very quickly it assumes everything is a non-smile (“zero", or black, presumably due to the data set having more examples of non-smiles, making neutral a safe bet), then slowly the two smiles start to fade in, with some uncertainty at the start and stop of the smile being the last thing to resolve.
This kind of net could be very useful for real time tracking, except for the fact that most toolkits for working with CNNs are designed for offline batch-processing and have a lot of overhead for individual frames of data. For something as slow as a smile it could still work, but something faster like face tracking, it’s not realtime enough. Most of my work is interactive, so this made me question the suitability of CNNs for new work, but my interest was piqued again when Deep Dream appeared. First, a quick explanation about what’s going on behind the scenes.
CNNs are based on simultaneously learning (roughly) two different things: image patches that help discern categories, and which combinations of these patches make up a given category. First the net detects things like edges and spots, then a combination of an edge and a dark spot that might make up an eyebrow and an eye, and finally the net will recognize that an eyebrow, eye, and a nose (detected separately) make up a face. I say “roughly” because sometimes it’s hard to disentangle where a net is detecting “features” and where it’s detecting “combinations”, especially with recent research.
The Deep Dream images are based on reversing this process. For example, let’s say you start with a picture of a squirrel, and a network that’s been trained to detect 1000 different categories of objects based on a database called ImageNet containing 1.2 million example images. Deep Dream first runs the squirrel image through the net and identifies what kinds of activity is happening: are there edges? Spots? Eyes? Once the kinds of activity are identified, Deep Dream modifies the original image in a way that amplifies that activity. So if you have some vaguely eye-like shapes, they starting really looking like eyes. Or if you have a vaguely dog-face-like shape, it turns into a dog (this happens often, because many of the 1000 categories happen to be different dog breeds).
Nets trained on ImageNet have also been re-trained to classifying other sets of images too: flower varieties, gender and age, or kinds of places. Researchers train and share these models with each other.
When Google released code implementing Deep Dream, I started exploring by applying it to a large number of images — from classics like Man Ray or Michelangelo to my personal collection of glitch images — testing with different settings, or networks trained on different categories. While it was difficult installing the toolkit that Deep Dream uses, making small modifications like processing multiple images to produce an animation was much easier.
One of the surprising things about these animations was that Deep Dream produced relatively similar results from frame to frame. In some older papers discussing non-photorealistic rendering techniques for video, this kind of stability can be the main focus. Other people working with Deep Dream have used additional processing, like frame blending or optical flow, to achieve different kinds of stability.
In Google’s original blog post about Deep Dream, they show some incredible class (or category) visualization images: demonstrating, for example, that a network’s “concept” of a dumbbell is incomplete without an attached arm. While some previous research had shown similar class visualizations, these were some of the clearest images yet.
To understand this concept better, I modified the code to produce class visualizations for every class, and some minor variations like large image fields consisting of only one object. I tested a few more techniques like optimizing for high-level abstractions in a small image, then zooming in and optimizing for lower-level visual features, but most of the images that emerged were very similar to the original Deep Dream images.
Instead of using a network trained on ImageNet, we can produce class visualizations for a network trained on satellite imagery: Terrapattern, a visual search tool for satellite imagery I worked on with Golan Levin, David Newbury, and others, has recognizable classes like “cabin”, “swimming pool” and “cemetery”.
About two months after Google’s “Inceptionism” post, researchers from University Tübingen in Germany posted “A Neural Algorithm of Artistic Style” to the open-access service Arxiv, where most public research in computer science is shared. As the paper was released without an official implementation to accompany it, many people released their own version of the technique over the next few days, with varying degrees of quality. I discuss this moment and compare implementations in an article titled “Comparing Artificial Artists”.
In the paper, they show how to imitate an “artistic style” when rendering a photo, using a neural net. It looks impossible. Like the sort of thing that should require carefully trained humans who have undergone years of study and practice. It shouldn’t be so easy for fully automated computer programs. My first experiment was to try something harder: reversing the technique. I attempted to remove the painterly “filter” from a few Vincent Van Gogh landscapes by asking the net to render them in the “style” of a landscape photo.
Results were mixed, and uninspiring, so I posted a too-perfect hoax instead and continued with other experiments. As with Deep Dream, I learned the most by processing a huge combination of different images from Western art history, plus a few carefully tuned portraits. While producing a Deep Dream image took only one minute to process on my laptop, style transfer took closer to five. Some people used cloud computing systems like Amazon’s AWS to speed up the rendering, but I was very fortunate to have some friends in Japan who let me remotely borrow use fast computer for rendering.
After looking through all the renders, I learned that “style transfer” really meant something more like “texture transfer”, and decided to feed 40 textures through in every possible combination.
Also like Deep Dream, I made some small modifications that allowed me to render animations, and eventually a longer video (NSFW). My favorite results came from a study where an image is initially rendered in a cubist style, and over the course of the animation turns into an impressionist style with a similar palette. I have an unfinished study that uses this technique with a photo of Rouen Cathedral, styled in every different variation that Monet painted.
Since my initial explorations with style transfer, a new paper was published that accomplishes some of the things I felt lacking: “Combining Markov Random Fields and Convolutional Neural Networks for Image Synthesis” (2016) by Li et al uses a patch-based reconstruction algorithm to imitate style.
Alex J. Champanard (who has a Twitter bot dedicated to style transfer) has expanded this technique to allow explicit control over the patch source and destination regions, adding additional user-guided constraints for reconfiguring existing imagery.
One of my favorite things about techniques like Deep Dream and style transfer is that they visually communicate an intuition for some very complicated topics. Not many people understand how a neural net works from front to back, including all the justifications for the code, architectural and mathematical choices that went into the system. But everyone can look at a “puppyslug” or a fake Van Gogh and have some intuition about what’s happening behind the scenes.
I’m also inspired by the visceral response some people have to Deep Dream, and the disbelief that style transfer inspires. While the work of the algorists might draw on the nature of computation, Deep Dream seems to poke at something deep in our visual perception.
Recurrent Neural Networks
One of the most interesting things about neural nets is that it’s easy to develop an intuition for manipulating them once you understand the basic concepts. A basic neural net is just addition and multiplication: taking weighted sums iteratively (across a layer), and recursively (through multiple layers). The training process is just about figuring out which weights will give you the answers you’re expecting, and then when you run new data through the net it should give you similarly correct answers.
One modification of this setup is to take the output of the net and feed it back into the input (along with the previous state of the network). This is called a recurrent neural network (RNN). While most neural networks have a fixed architecture which means the input and output size doesn’t change, an RNN can be useful for modeling sequences of variable length. Examples of sequences include the pen movement from cursive handwriting, a sequence of characters in text, notes in music, or temperatures from weather.
What Deep Dream is to images, an article called The Unreasonable Effectiveness of Recurrent Neural Networks is to text. This article by researcher Andrej Karpathy goes into exquisite details explaining how RNNs work and what they’re capable of, and it’s accompanied by a toolkit called “char-rnn” (recently superseded by “torch-rnn”) that lets readers follow along and experiment with their own text.
Andrej gives a number of compelling examples of RNNs learning to generate novel text from a few megabytes of example text. For example, after feeding the collected works of Shakespeare: “So drop upon your lordship’s head, and your opinion / Shall be against your honour.” or Wikipedia: “Naturalism and decision for the majority of Arab countries’ capitalide was grounded
by the Irish language by [[John Clair]]” or even Linux kernel code:
static void do_command(struct seq_file *m, void *v)
{
int column = 32 << (cmd[2] & 0x80);
if (state)
cmd = (int)(int_state ^ (in_8(&ch->ch_flags) & Cmd) ? 2 : 1);
else
seq = 1;
for (i = 0; i < 16; i++) {
if (k & (1 << 1))
pipe = (in_use & UMXTHREAD_UNCCA) +
((count & 0x00000000fffffff8) & 0x000000f) << 8;
...
If you are familiar with the Markov chain text generation technique used by MegaHAL, you’ll notice that the RNN has a surprising ability to balance syntactic correctness with novel text. Usually you have to make a tradeoff between copying your source material and producing incorrect output, but RNNs find a middle ground by capturing a deeper structure than Markov chains can.
char-rnn belongs in the general category of “generative models”. Generative models aren’t trying to predict anything per se, or make a judgement or classify something (though they can be engineered to do so). Generative models are relatively poorly studied, partially because justifying their output or quantifying its value is difficult. But for people who are interested in the creative applications of new technologies, generative models are one of the most interesting topics. So naturally, there are tons of experiments using char-rnn, and comments on Andrej’s blog post capture a small slice of them.
Before my first experiments with char-rnn, I posted a shared document on Twitter with some ideas asking for contributions.
My first experiments were very similar to Andrej’s. I just found any text I could longer than one million characters (one megabyte) and tried cleaning it up to make sure certain characters could uniquely trigger different states in the RNN, like having a tab separate words from their definitions when feeding in dictionary entries:
yelp-conserving adj. fool relatively at the open.glamorous adj. 1 of conquering a lady. 2 clever or belt in a loss of management. n. swate of luck. [french from latin lego fall]veton n. 1 rope (not lowest) to heal! after its tonacht. 2 person bell up yellow dusibly test.glandular adj. of or showing hindus. n. (attrib.) energy, protest. [from *geneal]viceroy n. 1 a specialist pocket. b condition of reference and state of my formal prices, esp. in warfare. 2 colloq. a jockey and word.
char-rnn captures some basic structure, like a part of speech following a word, and numbers coming sequentially. Sometimes it provides a correct inflection, or vaguely topical definition. My favorite definitions are the result of feeding in “seed” text that primes the network before it starts generating, effectively defining whatever word you provide. Now there’s even a Twitter bot that will define words like this for you.
After trying the dictionary, it was clear that the most entertaining results came from the moments where the algorithm appeared to have agency or intention, or even direct suggestions for the reader. So the next experiments were with old books of manners, and lists of “useful phrases”:
family of her morning would not be her people propise the pabace of one come of the . theater or her partner, but every things, their way possible that day and hostess . agreeable at the invitation to her own inflience and even goy, which may . arrive done by a pault flatteres in the costume who round it would be rooms .
(The books of manners didn’t have quite enough content to teach the network how to spell correctly.)
meaningless joy. endeavoring to his ond envy. phantom and ripple. obmension and delight. perminent love. repulsive and definite. prowess and conceit. energy and enforcing. an incomparable general compromises. almost innocent theme. dreaming the subject you have the air. natural and thereforeed. fortunate, some thing hold of laughter. a mark of impertinent night.
The “useful phrases” were mostly simple mashups that could have been computed much more simply, but something about the typos and incomplete or imaginary words gives the output an almost Sapphic vibe. To more explicitly produce poetry I created a huge list of 12k phrases and traversed it by chaining together words appearing in consecutive sentences, ending with:
love in its honesty of enough.
a love of secpent obscure.
the love was like a soul of a shower.
surpassing with love and ered and apprehensive strange.
the subject of a resourceful love is which i think it here.
a lovely characteristic facts.
flippant love.
flippant with every memory of the acknowledgment.
the moon and memory of her bright as the grave.
the moon of the stars of the soul.
This is starting to feel like it’s getting somewhere. Hoping that a more personal source material would reveal new properties of the algorithm, I fed it a few years of chat logs with Lauren McCarthy:
lauren: so
lauren: did u sieve anymore?
kyle: yeah we only hear it out sense that you were kind of because it could be last year
kyle: so he was quickly saying that might be realtime
lauren: mm i gues it would ever want to trash support of general
kyle: :)
kyle: his once weird mids
lauren: ya
lauren: https://github.com/122725785_iram
kyle: https://twitter.com/nyus/Recisageles/5ubPmer
The lines are the right length, the smiley feels well placed, I write “yeah” and Lauren “ya”. The urls aren’t real, but they look exactly like the kind of links we would send to each other. The output seems to occupy the liminal space between modes of awareness, between dreaming and waking attention. It has the look and feel of the thing without having the content or meaning.
To compare with another kind of dialogue, I tried the Cornell Movie Dialogue Corpus which 220k exchanges.
the numbers are two kilodes!
thank heaven!
i’m going to kill me.
yeah nickels are gonna go inside. the captain restored you!
different. she was not alright — you’re such great artistic suicide — she’s getting a great knowledge coming and playing ginning from chilor numbers and angel three teachers for crank deeds, he was afraid not to stand up on your heart.
break.
pig bill?
affirmative, mrs. lampert.
sorry, mr. ceusing is very bored and from deceitful work and iena — and we are trying to help you. you’ve seen me side.
is it no better, i’ll spread your twins and pull the roof. somewhere else — come to my lawyer. on your life! who said that?!
It’s almost completely nonsensical, but sometimes it’s hard to tell if it’s the limitations of my hardware, or my technique. More expensive graphics cards allow you to train a bigger model, which can increase the verisimilitude of the output, but I’m usually limited by my laptop or a friend’s machine. Soon after these experiments, Google also published some results from training on movie subtitles:
Human: What is the definition of altruism?
Machine: If you don’t believe in god, then you don’t know.
Human: OK... so what is the definition of morality?
Machine: Well, the truth is, you’re not a believer in god almighty.
Human: Well me the definition of morality, I am quite upset now!
Machine: I’m not ashamed of being a philosopher!
I’m sure their model was trained for much longer, but the call-and-response adds a lot to the impression of intentionality.
Another hand-crafted small dataset I tried was a 25kb collection of pickup lines:
How was the into the your beautiful getting to hand Bibl a lost you. Lomed you a love everytime I was in your daticn? When your name you mange to are a stars with a heart, because I’m milling in a on the say canttst trece, you’re so falling have a happile a friends in when haver sees with my most nrom.
Again, a lot of nonsense but it has the vibe of pickup lines: “you’re so”, “stars with a heart”, “beautiful”, “love everytime”, etc. How long until a computer will be more effective than a human at wooing someone?
One interesting trick that helped with some of these smaller data sets was to start by training char-rnn on a big collection of English text, like Wikipedia or the movie books corpus. Once that model has almost converged it’s possible to point it at the material you really want it to learn now that it has a basic model for English spelling and grammar.
The last really big text dataset I tried is based on 20k drug experiences reported on Erowid (I uploaded the scraped reports here).
Behind myself and other spiritual planes remained for what later lasted for about a year and a half. I probably would have probably died and stimulated ecstacy to find out if I was going to die before I do not feel extremely euphoric such a feeling of great sensory, along with a strong sense of empathy with some negative aspects of my mind. This definitely comes out 80–150–2001 compared to my use and speed but was beginning to feel more anxious. I have been able to function for a few years of form in the process of waking up and my headache continued to take again. There were these thoughts from a desire to drive a few times.
With such a big collection of text, the output is much more grammatically correct with proper spelling and punctuation. Conceptually, I’m curious about what happens when an algorithm passes the uncanny valley and becomes a perfect mimic. If humans were unable to distinguish the generated drug experience from a real one, the machine would become a sort of philosophical zombie: an entity that appears to be something that it isn’t, something it could never be.
Emoji
As mentioned above, an RNN doesn’t have to stop at language. Anything that can be encoded as a sequence of symbols is perfect fodder for char-rnn. For example, an excerpt from a vector graphic SVG file typically looks like:
<path id=”path34" style=”fill:#ffac33;fill-opacity:1;fill-rule:nonzero;stroke:none” d=”M 0,0 C 0,4.9 4,9 4,9 4,9 8,4.9 8,0 8,-4.9 6.2,-9 4,-9 1.7,-9 0,-4.9 0,0"/></g><g transform=”translate(12.7,15.9)” id=”g36"><path id=”path38" style=”fill:#553788;fill-opacity:1;fill-rule:nonzero;stroke:none” d=”m 0,0 c 0.7,-0.7 1.2,-1.7 1.2,-2.9 0,-2.2 -1.7,-4 -4,-4 -1.1,0 -2.1,0.5 -2.9,1.2 0.3,-2.4 2.4,-4.2 4.9,-4.2 2.7,0 5,2.2 5,5 C 4.2,-2.4 2.4,-0.3 0,0"/></g><g transform=”translate(28,9)” id=”g40">...
This is just a sequence of characters, and we can learn the proper order of these characters with enough examples. The above example comes from Twemoji, the emoji library provided by Twitter. In total Twemoji contains 875 graphics, around 3kb each — or about 2MB total after snapping the points to a lower resolution and removing the headers that are the same between all the files.
Instead of preparing one big data dump of chat logs, or dictionary definitions, or drug trips, I prepared a file that has one emoji SVG file per line, starting with the name of the emoji. This means that when char-rnn outputs new emoji, it also outputs a name for each symbol. This one is titled “CLOCK FACE NINE”, which is almost a real emoji name:
And here’s a bigger collection to show the variety in the output.
It’s amazing to see how the big circles that go along with faces consistently find their way into the mix, and other barely recognizable shapes are scattered throughout. The colors are the most consistent, as Twemoji uses a restricted palette and char-rnn learns to memorize it. I’m working on a series of prints called “Innards”, a triptych built from these emoji, inspired by Roger Coqart.
Another awesome project along these lines is “SMILING FACE WITHFACE” by Allison Parrish, where she deconstructs and rearranges Twemoji for posting to Tumblr as surreal conversations or psychedelic cultural artifacts. I see her work as injecting chaos into the emoji, while these char-rnn experiments are more about learning and rebuilding the structure from the ground up, both approaches revealing different perspectives.
Chess
Another encoded representation I explored is the chess game. One notation system for chess is called Portable Game Notation, and there are huge databases with hundreds of millions of games ready for download. In collaboration with Charlotte Stiles, we downloaded around 30MB of games played by humans, and prepared them for char-rnn by formatting them like this:
Na3 Nc6
Nb5 e5
h4 d5
d3 a6
Nc3 Bb4
a3 Ba5
b4 Bb6
e3 Be6
...
With almost 4 millions rows of moves. After training for a few hours, we could produce new games that involved dozens of moves without copying directly from the database. Usually the openings were copies, but this is true for most chess games anyway:
Charlotte did some analysis of hundreds of generated games to determine if the net was really learning anything, or just copying old games and noticed something very interesting: on average, up to 3 consecutive moves might be copied from a training game, but it would take up to 9 moves until the net started making errors (illegal moves). In general the average game was much longer, around 29 moves, but almost none of these were 100% valid games.
The longest valid game we saw, after generating a few hundred games, was this one in the animation above:
d4 Nf6
c4 g6
Nc3 d5
cxd5 Nxd5
e4 Nxc3
bxc3 Bg7
Nf3 O-O
Bb5+ c6
Bd3 c5
O-O cxd4
cxd4 Nc6
Bb2 Bg4
Rc1 Rc8
Re1 Rc7
Qc2
Another kind of sequence is audio: a sequence of samples. The GRUV Toolkit is a student project using a technique similar to char-rnn but with a focus on music. After applying it to the amen break, I got results that mostly sound like noise and memorization of the input data:
Even after a lot of tweaking to try different sample rates, learning the audio backwards (it should be easier to predict an onset when you see it coming than when it comes out of nowhere), the best results are just noisy copies of the training data.
Very recently, researchers from DeepMind have (in theory) solved the audio generation problem with their WaveNet architecture. Unfortunately it still takes a few minutes to produce one second of audio, so whether we’re trying to reproduce speech or music the architecture isn’t quite ready for realtime. My dream of creating a new episode of “This American Life” from scratch at the would take at least 5 days just to render, and that’s not including the training time.
I’m keeping a small playlist of other people who are posting examples of music and machine learning here, but most of the results, whether on raw audio or symbolic representations, still don’t quite approach the uncanny accuracy of char-rnn on some English text. My favorite examples of RNN-based symbolic music composition right now come from this post by Daniel Johnson.
Some other experiments that still need more work:
- After scraping 5000 shaders from ShaderToy and minifying the results, it comes out to 7MB of data. After training and reading through the generated output, there’s a good chance that many of them compile, but a very small chance that they do anything visual.
- Working with Charlotte again, we scraped almost half a million Tinder profile descriptions and pictures. From “Laid back and love to smoke” to “I actually just want to meet people with the same interests as me…”. There’s something magical about this data, and the mixture of sincerity and trepidation with which people approach dating, but it feels almost too personal, and too easy to take advantage of from a distance. Generating fake pictures and profiles would only be good for a sophomoric chuckle, and besides Ashley Madison did this already.
Finally, recurrent nets get really interesting when combined with images. Combining CNNs with RNNs, you get automatic image captioning, a field that experienced multiple breakthroughs in 2015.
Or you can run it in reverse, and generate images from captions.
There are even some early results for generating entire videos from text.
Dimensionality Reduction and Visualization
“That’s another thing we’ve learned from your Nation,” said Mein Herr, “map-making. But we’ve carried it much further than you. What do you consider the largest map that would be really useful?”
“About six inches to the mile.”
“Only six inches!”exclaimed Mein Herr. “We very soon got to six yards to the mile. Then we tried a hundred yards to the mile. And then came the grandest idea of all! We actually made a map of the country, on the scale of a mile to the mile!”
This excerpt from “Sylvie and Bruno Concluded” by Lewis Carroll (and similarly, “On Exactitude in Science” by Jorge Luis Borges) gets to the core of one application of machine learning: our desire to create abstractions as an aid in navigating unfamiliar terrain.
One way to think about making abstractions is called dimensionality reduction. For example, a small 28x28 pixel image is made of 784 numbers (dimensions) but if each image contains a single handwritten digit, a more useful representation (or “embedding”) might be 10-dimensional: so for a handwritten three we might get [0,0,0,0.9,0,0.1,0,0,0,0]. The 0.9 means it’s “mostly a three” and the 0.1 means “it looks a little like a five” (which is common).
10 dimensions might be useful for picking categories, but it’s hard to visualize a 10 dimensional space. So we can take this farther and try to embed into two or three dimensions, and use the results for drawing scatter plots. When you only have a few dimensions to work with, the way you want to use these dimensions is different. Instead of having one dimension representing the numbers 0–4 and the other for the numbers 5–9, for example, it might make more sense to have one dimension represent how bold the number is, and have the other dimension represent how slanted the number is.
Different dimensionality reduction algorithms create different kinds of abstractions, and these differences are most apparent when you only have a few dimensions. One of my favorite algorithms is called t-SNE (pronounced “tee-snee”). It tries to keep very similar data points very close to each other, but doesn’t worry too much about data points that are different. There’s an excellent interactive visualization of t-SNE from science.ai, and a thorough explanation of t-SNE compared to some other dimensionality reduction techniques from Chris Olah. But check out this image first to build some visual intuition for t-SNE.
t-SNE captures structure at multiple scales. At the largest scale, it’s placed different digits in different clusters. At a smaller scale, you can see patterns in how slanted the handwriting is, and gradients in stroke weight. In the center, it has the hardest time separating threes from fives and eights, which all look similar. This is impressive because the algorithm doesn’t know which digits are which or what’s important about their shapes, it’s just grouping them based on how they look (the image at the bottom right is provided as a reference anyway to show the categories).
I ran t-SNE on a dataset from Golan Levin and David Newbury: sketches contributed by people all around the world for the Moon Drawings project. After running creating a point cloud like above, I snapped all the points to a grid (you can formulate this grid-snapping in terms of the assignment problem).
Here is an excerpt from the right side that shows some of the logic in the layout. Some structure is easy to see, like the darkest content at the top right, but some structure is very subtle, like the two question marks near the bottom right that ended up next to each other, and all the hearts in the same region.
word2vec
Some of the most interesting data doesn’t have a clear numeric representation, but sometimes there are techniques for extracting a numeric representation from context. word2vec is one algorithm for assigning a set of numbers individual words. It looks at what context a word usually occurs in, and uses the context to determine which words are similar or dissimilar. word2vec might be trained on hundreds of thousands of unique words scattered throughout millions of news articles, and return 300 numbers for each word. These numbers don’t have clear interpretations, but if you treat each set of numbers as a high-dimensional vector you can do basic comparisons and arithmetic between them. This means you can look at distances (more similar words have a smaller distance between them) and do analogies (the closest vector to “Paris minus France plus Japan” is “Tokyo”). So even though each dimension is not clearly interpretable, the general direction and location of each vector encodes a meaning.
word2vec represents Monday-Friday being similar to each other, but different from Saturday and Sunday. Friday seems to even have a little overlap with the weekend. The months of the year are roughly divided into March through July and August through February, but consecutive months are generally more similar than distant months. The digit 0 is completely independent from all the other digits (probably an error in the data) but the other digits are more similar to their neighbors than to distant digits.
Now with a vector for every word, we can take any list of words and run it through t-SNE. Here’s what happens when you use a list of 750 moods I compiled from multiple sources:
The colors come from embedding the same vectors in 3d (i.e., RGB) instead of 2d. In theory, 3d should give the data more room to “spread out”, and more clearly show us where the boundaries are even when 2d doesn’t have enough space to show those boundaries.
Here’s a closeup one especially interesting region:
Some obvious pairings are given like “happy” and “glad”, or “hesitant”, “leery” and “cautious”, or “eager” and “anxious”. But other connections are surprising, because while they occupy the same general category, they have opposite valence: “doubtful/hopeful”, and “discouraged/encouraged” are two examples above.
We can also focus on these antonyms specifically, and try to understand which antonym relationships are similar to others. For example, is “forward/backward” more similar to “happy/sad” or “future/past”? This probably varies from language to language, and reminds me of questions raised in Metaphors we Live By.
With the antonyms it’s not as clear what the relationship between nearby pairs are. Sometimes it’s easy to interpret, like “unintelligent/intelligent” being close to “uninteresting/interesting” and other “un-” antonyms. The one or two clusters above seem fairly compact, but it’s very difficult to interpret what they all have in common. One lesson might be that antonyms capture many different kinds of relationships, and that there’s no single relationship encoded in language as “opposite-ness”.
Another way to turn text into numbers is a technique called Latent Dirichlet Allocation (or LDA, not to be confused with Linear Discriminant Analysis, another dimensionality reduction technique). LDA looks at a bunch of “documents” (usually anything from a paragraph to a page), and tries to describe each document as a mixture of “topics”. A topic is essentially a set of words. When LDA sees those words it knows that topic is present to some degree. So if you give LDA a bunch of news articles and ask for 10 topics, one topic might contain words like “pitcher”, “football”, “goalie”, and we would call it “Sports”. Another topic might contain “Obama”, “Merkel”, “Sanders”, and we would call it “Politics”. To LDA, a document is basically a bag of words handpicked from a few topical bags of words.
To experiment with LDA, I tried extracting topics from “Les Miserables”, treating each page as a document. Then I projected the topic vectors to 2d with t-SNE.
Like the book, the diagram is long and circuitous. Instead of using the 3d embedding for colors, in this case the colors represent the page number in the book, cycling through hues over the course of the book. Some characters, scenes, or topics are in an independent green cluster at the bottom, perhaps it’s one of the novel’s many side stories. While the image is intriguing, to get a better understanding of what it’s capturing it might be necessary to highlight pages by characters names, and place names, and by asking LDA what it sees as the most significant words for each topic.
Working with Tejaswinee Kelkar, we tried visualizing styles of Indian classical music with t-SNE after manually extracting some important features like which notes may be played and where the melody typically starts and finishes.
I don’t know as much about these styles of music, but for me it’s interesting to see how the styles that are grouped also have similar names (Gandhari/Dev gandhari, Kedar/Deepak Kedar/Chandni Kedar).
Convolutional Neural Networks and t-SNE
After developing a better intuition for t-SNE, I started looking for vectors everywhere. Anything that could be represented numerically could also be visualized as a graph of labels. A huge influence on this line of work comes from this visualization by Andrej Karpathy:
This visualization uses t-SNE to place a bunch of images. The representation that is given to t-SNE is not the image itself, but a high-level “description” of the image that is taken from the internals of a CNN. When you are doing image classification with a CNN, just before the CNN decides on a category it has a list of thousands of numbers that describe the content of the image more abstractly (a “CNN code”). Some of these numbers in combination mean things like “green background”, “blue sky”, “circular objects”, “eye shapes”, or “feathery textures”. So these are the features that t-SNE groups on. Above if you look closely you can see a number of flowers and plants at the bottom left, and outdoor activities like boating at the top right.
If we combine this idea of CNN codes with the idea of Deep Dream above, instead of generating a CNN code from an image we can generate it from a category by running the net backwards. So I generated a CNN code for every category in ImageNet and laid them out with t-SNE.
It’s incredible to see how much information is encoded by these categories. For example, a number of personal products (lotion, sunscreen, hair spray, lipstick, bandaid) are in the same region even though they have big huge visual differences from each other. One possible explanation is that they exist in similar environments (bathrooms) and this contributes to their similarity.
My favorite results come from the top of the visualization where there are some musical instruments grouped together. Again, some look very different from the others but I’m sure they occur in the same context/environment. What’s impressive is that some combination of the CNN and t-SNE has managed to create multiple scales of structure: the brass instruments are more to the right, the woodwinds more to the left (correctly including the flute), and string instruments to the top. That the violin is closer to the orchestral instruments than the folk instruments may just be chance.
There might be other structure hidden “between” the labels when t-SNE does layout, I’ve tried visualizing this in-between space but haven’t had much success yet.
For anyone interested in making similar visualizations with t-SNE (and word2vec) I’ve compiled a collection “Embedding Scripts” on GitHub, developed during a residency at ITP. But since initially working on this, I’ve found some simpler ways to accomplish the same thing, and would suggest instead looking at some of the examples like word2vec and lda_tsne from this workshop with Yusuke Tomoto (hosted by Rhizomatiks). If you’re interested in real-time, interactive visualizations that use the colored-Voronoi layout above, try this openFrameworks example from the same workshop.
Working with Archives and Libraries
Another huge inspiration for me when it comes to visualizations of large data sets is the work of Quasimondo. Part of his practice involves downloading huge collections of image archives and organizing them in various creative ways: extracting all the mushroom clouds from album covers, finding all the mustachioed men in portrait galleries. For me, I’ve approached two archives so far: a personal collection of audio samples, and the archives of media art professor and researcher Itsuo Sakane.
If my mind is wrapped up in computation, my heart is wrapped up in music. I might not have been turned on to interactive art if it wasn’t for excitement around simple tools for experimental composition and improvisation. Whether that meant building weird noisy circuits with lights and sensors or scripts that output MIDI to randomly set knobs on software synths. While all of these visual experiments with machine learning are fun, they’re partially a means for me to develop intuition for working with audio instead.
This sample layout tool is based on the same techniques as the above t-SNE visualizations, but I use a frequency domain representation called CQT to get a fingerprint for each sound, extracted with a Python tool called librosa. This representation encourages samples with similar pitches, noisiness, or envelopes to be grouped together. Another common approach for working with audio is to use the STFT, but this can put undue emphasis on higher frequency sounds.
Adding a text search to the interface really illuminates how the samples are separated by t-SNE, and explains some of the logic behind the layout.
This system is something I’ve wanted for a long time, but now that I have it, I only have more ideas for how to develop or perform with it. I’ve been developing some ideas around this with a small team at Google Creative Labs NYC that should be released soon.
These same techniques can be applied to video archives, but instead of hovering over points to hear sounds we can reveal images and videos clips. Working with Yusuke Tomoto from Rhizomatiks, we developed a system for exploring the video archive of Itsuo Sakane, who has been documenting media art in Japan and around the world since the 1960s.
We started by extracting keyframes every second, then feeding these through a neural net to compute feature vectors, and using t-SNE for dimensionality reduction. Then we developed an openFrameworks app for visualizing all these images simultaneously as the background for cells in a Voronoi diagram. Clicking on a cell would open the current video. It’s funny and revealing to see images grouped by interpretable features like “having a prominent circle” or “having brightly saturated colors”, but there are also a lot of useful groupings like “having a face” or “having bright text on a dark background” (usually projected slides) or “having dark text on a dark background” (usually papers or titles). It seems like it might even group Japanese text separately from English text.
Autoencoders
As I mentioned earlier about RNNs, one of the best things about neural nets is how they can be intuitively manipulated to develop new, interesting architectures. One of my favorite architectures is the autoencoder: a neural net that learns to reconstruct its input. If the neural net is big enough there is a trivial solution to this: copy the input to the output. So we impose various constraints on the net to force it into finding a more interesting solution.
One constraint is to create a bottleneck. This means using a small layer in the middle of the neural network. Ideally, this bottleneck has a representation that captures something interesting about the structure of the data. If we were training on images of I Ching hexagrams, and had a 6-neuron bottleneck, the network should learn a binary representation of the hexagram images. To the left of the bottleneck the network learns an encoding function, and to the right it learns a decoding function. There are other constraints that can be placed on autoencoders, too. Instead of learning a binary representation across examples, you could put a penalty on having too many neurons turned on in the bottleneck at any one time, making the representation look more like the output of a classifier.
If you train an autoencoder to reconstruct handwritten digits, you can get a very accurate reconstruction even from very small bottlenecks. My favorite moments are when the net is still training, or if you make a mistake when defining the size of the different layers. This can result in unpredictable asemic imagery.
Autoencoders belong to a class of machine learning algorithms called unsupervised learners. These algorithms try to learn something from data without strict guidance like labels. Supervised learning, like image classification, has had a lot of success, but some people think that good unsupervised learning will be the next step for learning algorithms. A fun exercise is to try and think of ways to extract significant features from a dataset when you don’t have any labels. When word2vec tries to predict a word from its context, that’s one type of unsupervised learning. We can expand this idea to images with something called “inpainting” where we predict a missing part of an image from its context.
Or, given pieces of an image we can try to ask a net to learn how they should be arranged, like reconstructing a jigsaw puzzle. It turns out, surprisingly, that the features learned in this process bear a strong resemblance to those learned when training on labeled data. This is a step towards one goal of unsupervised learning, which is to learn features and representations that are as useful as the ones we get from supervised learning.
Or, we can ask a second neural net to determine whether the output of a first looks real or fake. This technique is called adversarial learning. It’s often compared to the relationship between someone producing counterfeit money and the customers trying to determine whether the money is real or not. When the counterfeit money is rejected, the person making the money improves their technique until it’s indistinguishable from the real thing.
You might say that these are “face-like”, or that adversarially generated photos of “objects” are “object-like”, the suggestion being that the network has learned to discern the various qualities of a class in some way similar to how humans divide up the world. But from the output alone it’s very difficult to separate what the net has “really learned” from what it has “appeared to have learned”. It turns out that when you look into the internal representation, they do exist in arithmetic spaces just like word2vec, where questions about analogy and similarity can be easily answered.
Conclusion
There’s a lot to of room for creative exploration with machine learning. Writing poetry, stylizing imagery, making visual and textual analogies. Some argue that automation of human activities is a fool’s errand, and that the real target should be augmentation of human creativity and curiosity — tools like Terrapattern, or Neural Doodle. Personally, I think that this reframing might be a coping mechanism for an unsettling longer term change in the way that creative artifacts are produced. There’s no obvious reason that a computational system can’t eventually be a favorite author of fiction, poetry, or music similar to how AlphaGo has trumped one of our favorite Go players. I wouldn’t argue that poetry has a clear, measurable, optimization-ready end goal the same way Go does. But like Go, poetry does come down to a relationship between abstract symbols and lived experience. Plenty examples of both are out there, and we might even find a bot’s connections interesting in a uniquely “botty” way. I’d argue there’s plenty of poetry already to the images and text produced by the techniques above.
For me, I’m less interested in imitating human creativity per se. I’m more excited about the potential for creative-bots to undermine our feelings of uniqueness and our own understanding of our artistic, or intellectual significance. My favorite moments aren’t in appreciating the beautiful output of some style transfer, or reading assonant couplets from char-rnn, but the surprise and frustration that’s inseparable from the whole experience. The mixture of fear and delight in seeing an automatic system accomplish something that seems impossible to automate. The pleasure and trepidation of a meeting an alien intelligence. Those are the feelings I’m after, and it’s why I keep returning to machine learning.