AI-powered art scene explodes as hackers create revolutionary new tools CLIP + VQ-GAN


In recent months, an AI-powered art scene has exploded as hackers modified an OpenAI model to create amazing image generation tools.

All you have to do to guide these systems is invite them with the image you want. For example, you can invite them with the text: “a fantastic world”. With this prompt, the author of this article generated the image you see above.

The crisp, consistent, and high-resolution quality of the images created by these tools sets them apart from the artistic AI tools that came before. The tools are very iterative – in the video below you can see the generation of an image based on the words “a man tortured to death by a demon”.

The main engine inside the new tools is a cutting edge image classification AI called CLIP. It was announced in January by the company OpenAI, renowned for the invention of GPT-3, which itself was not announced until May 2020. GPT-3 can generate a text of a really general nature, of type human, simply by feeding it a simple quick.

While the new CLIP-based systems are reminiscent of GPT-3 in their “readiness”, their internal workings are very different. CLIP was designed to be a narrow reach, yet extremely powerful tool. This is a general purpose image classifier that can decide how well an image matches a prompt, for example, match the image of an apple with the word “apple”. But that’s all. “It was not clear that it could be used to generate art,” Charlie Snell, a computer science student at the University of California at Berkeley, who followed the new scene, said in an interview.

But soon after its release, hackers like Ryan Murdock, an artist and machine learning engineer, discovered how to connect other AIs to CLIP, creating an image generator. “A few days after I started playing with it, I realized I could generate images,” Murdock said in an interview.

Over a series of weeks and months, hackers experimented with connecting CLIP to AIs getting better and better. On March 4, Murdock successfully connected CLIP and VQ-GAN, another type of cutting edge AI that was released in a preprint in December 2020. “It took a long time to figure out how to make the system work properly, Said Murdock. He continued to fine tune the system until he was able to produce crisp images. Now the combinations of CLIP and VQ-GAN are the most used versions of the new tools.

These tools have recently become popular and have led to a new computer-generated art scene.

“These are the first coupons that have been made available to the public,” Snell said. “These systems are the first to respond in some way to the ‘text-to-image promise’.”

Snell believes they are perhaps the biggest innovation in the art of AI since DeepDream, a 2015 AI that has become widely used to create hallucinogenic interpretations of images. “It’s definitely the most important thing I’ve seen,” Snell said.

Previously, the most powerful public image generation tools were neural networks called generative antagonist networks, or GANs, of which VQ-GAN is a specific example. After training these networks on a large body of images, they can synthesize new images of a similar type. However, GANs by themselves cannot generate images via a prompt. Other types of networks in addition to GANs can invite, but not very well. “They just weren’t very good,” Snell said. “It’s kind of a new approach. “

The new tools are readily available to anyone who wants to use them. On June 27, Twitter user @images_ai tweeted a popular tutorial from computer scientist Katherine Crowson on how to use one of the latest models. By following the instructions, a knowledgeable user can run the system in minutes from a web programming notebook.

“The results are so shocking that to many they seem to defy belief,” Crowson said in an email. “CLIP is formed out of 400 million image / text pairs,” she continued. “At this scale, we start to see abilities that we had previously only seen in human artists, such as abstraction and analogies. “

There is already a vast body of mind-boggling works. There are beautiful images of abstract sunsets, for example. There are also idyllic country houses and giant towns. There are weapons portrayed with unsettling animosity and Escher-like structures that fold in on themselves.

People became fascinated with the capabilities of the tools and artists began to adopt them widely. “There’s a lot of buzz about it about machine learning and Twitter art,” Murdock said.

Users began to develop an art specific to tools. One of the quirks of systems is that you have to figure out how to optimize your prompt to generate an image that is closest to your intention. Snell observed on his Twitter feed that artists have gradually evolved in the way they invite.

“They’re constantly trying new tweaks to try and make it better,” Snell said. “And it’s getting better. Each week they feel like they are seeing improvement.

The new tools have limits, such as the size of their generated images. The images themselves can often be unexpected and weird. But the fact that the tools could be built was a surprise.

On the same day they announced CLIP, OpenAI also announced a powerful AI called DALL · E which was directly designed for image generation. They published a handful of its results, which made it look like a true image generation analogue of GPT-3, something that expertly imaged anything. However, DALL · E was not made public, neither its code nor the production AI, which was likely very expensive to train. In contrast, OpenAI released the CLIP model in its entirety. “The hardware to produce these neural networks is relatively inexpensive,” said Crowson.

The new tools have shown that CLIP provides a sort of backdoor method to replicate the capabilities of DALL · E. Given that OpenAI picked DALL · E, it looks like the company was caught off guard. “I really suspect they’re a little surprised that it could do all of this,” Murdock said.

Snell described it this way: “They teased us with DALL · E. They were like, ‘We have this thing.’ And they didn’t release him, ”he said. “And everyone was like, ‘We want it though.’ So they kind of did it themselves.

Hacked CLIP-based tools work very differently from DALL · E. DALL · E directly produces images that match the text. Instead, Snell describes CLIP-based systems as something like AI interpretation tools. Since VQ-GAN and CLIP work together, the first template creates an image and the second template indicates how well it matches the prompt. Both iterate until the image best matches the prompt. The iteration says something about the imagery that CLIP associates with certain words.

So CLIP-based models are a whole new kind of artistic tool, a new kind of computer brush. They don’t feel perfect yet, Snell points out. “You have some control over it, but you can’t control it completely. And you’re still like, a little surprised. But this quality of human ingenuity is a big part of the appeal of the new tools.

It remains to be seen what impact they will have. It looks like it will be easy for companies and collaborations to dramatically improve tools, given that current tools have been made mostly by individuals. But they are already very powerful. Many people seem likely to adopt them for art, work and pleasure. Creating art has become as easy as using language, allowing everyone to be their own genre of lyric Picasso.


Comments are closed.