Good afternoon! I am currently a Master's student at the UvA on the Preservation and Presentation of the Moving Image programme, working on a thesis regarding the sonification of visual data - as such, I'm really interested in your 'we live to rock' clip, and was wondering if you would be so kind as to share some info as to the methods used. Any info would be very gratefully received!
Hello Dogbeak, thanks for your message!
That video was the result of a little experiment. Following Guy Sherwin’s idea of optical sound, we used the images as input for the soundtrack in its most blunt way: by interpreting the pixel data as raw audio data using a program called SoX. By tweaking the sample rate, and eventually stretching the audio a little bit, the audio ended up being the same length as the video and reflected directly the texture of the image.
If you have any other questions, please don’t hesitate to ask!
One essential task where Jan Bot relies on the output of AI technology is image recognition. This is used to provide the metadata of Bits & Piece, the collection of orphan footage we use to generate films. Nowadays there is quite some demand for this kind of product, especially coming from marketing agencies, who use it to research consumer’s online behavior.
We have tested five different companies who offer image recognition software. They are Clarifai, Cloudsight, Google Cloud Vision, IBM Watson and Microsoft Computer Vision. Worth to be mentioned, the tests we run take as parameters the standard databases provided by each service.
For this test we have the following shot from Bits & Pieces. The results are displayed below.
Clarifai
On the right window, it is visible the output provided by Clarifai. Next to each tag, there is an index that expresses the degree of certainty after which Clarifai made its prediction. In scale, 1 would be the perfect match between word and image. As it is noticeable, Clarifai is remarkably accurate in its predictions. Even the more abstract tags, such as “offense”, “danger” and “war”, are within the acceptable.
CloudSight
In Cloudsight the outcome is still accurate, but considerably more laconic than Clarifai. The words “zwarte revolver” –“black revolver” in english– are the only output provided. This kind of service could work for tasks where images need to be simplified to one term. That is clearly not our case since Jan Bot needs a good number of tags per shot to create a rich semantic print to connect with news items.
Google Cloud Vision
If Google would aspire to make out of its Cloud Vision the René Magritte of image recognition market, the word “image” would be a respectable output in our test. But this is not the case. We know this is an image, and instead of Google being 90% certain, we are –a priori– fully convinced about it.
Google Cloud Vision provided us with only two tags: one too general (”image”), while the other too especific and –if that is not a problem– wrong. Tagging those hands holding a gun as “soldier” is too adventurous to be right.
IBM Watson
This is Watson, the supercomputer who in 2016 supposedly edited the trailer for a horror film. At least that’s what the clickbates announced. Unsurprisingly that is a misleading lie that not even IBM can hold for more than 30 seconds (although that should be enough to capture the full attention span of most people who like to believe in technology as science fiction). "Watson is the tool who helps to arrange the visuals, but it still needs the human element,” says cagily the director of the trailer, who actually got a preselection of shots from Watson and then had to do the real work of editing himself.
Coming back to the analysis of our shot, Watson is the least efficient of all the programs we tested. “The score for this image is not above the threshold of 0.5″ is a synonym of “image not recognized”. Although the edges of the objects inside the photo are clear, Watson was incapable recognize anything in this shot.
Microsoft Computer Vision
Finally, we have Microsoft Computer Vision describing the content of the image as “a person holding a cell phone”. The generated tags –“name”, “gun” and “confidence”–, seem much closer to describe what most of us would agree the image is about. These results are certainly appealing for Jan Bot, for they put human imagination to work in the way that an accurate output wouldn’t. Nonetheless, for ensuring more interesting results at the end of the workflow of Jan Bot, we decide to stick to Clarifai.
Conclusion
Some people and companies like to believe that AI has brought image recognition software to its golden age, yet so far that is not the case. This simple test does a good job in revealing that. Image recognition is developing fast, but it is not there yet. From an aesthetic approach, however, the results provided by these programs are very interesting. They reveal the variety of sensibilities that current technology can adopt around the same image. Such diversity can be attributed to the technical features of each software, but that is only one part. Understanding that competing products in the field of technology tend to develop quite fast and similarly, it is crucial is to realize that the output provided by image recognition software may be influenced by the editorial criteria of those who design the software. The bottom line of all this is the following: behind every artificial intelligence, a group of humans minds is always involved.
Film generated after trending topic Pep Guardiola, football coach who had been appointed manager of Manchester City football club.
On January 3rd 2017, Pep Guardiola appeared in the news items such as,
“Petulant Pep Guardiola struggles to hide strain of turbulent start and claims Man City are picked on by officials”
and,
“Pep Guardiola’s winter of discontent is nothing new in Premier League circus”
Footage Selection
Using a free version of cortical, a Natural Language Processor (NLP), we matched the news items related to Pep Guardiola with the tags generated for each shot of Bits & Pieces. The top results are the following:
watch (0.26)
guy (0.25)
looking (0.24)
crowd (0.24)
dinner jacket (0.23)
club (0.23)
give voice (0.23)
friendship (0.23)
season (0.23)
minute hand (0.22)
These tags led to the following selection of footage:
(To see more details check http://selection.janbot.nl, date 03-01-2017)
Film Composition
This video has been edited using several algorithms designed by the authors of Jan Bot. These algorithms are experiments where scene compositions are preconceived and tested with non-specific footage.
In the experiments currently in development, the duration of each pre-montage is divided by basic temporal units called Acts. Pep Guardiola –this experiment– is composed of 3 acts, the first covering from seconds 0 to 10, the second from second 10 to 20, and the third from second 20 to 30.
Each act contains several video generators, which are algorithms designed by the authors of Jan Bot inspired by editing principles observed in films or conceived as translations of basic editing principles. In Pep Guardiola, 14 generators have been applied.
Intertitles
Generated with Jan Bot catachrestic intertitle generator, which selects snippets from news items and replaces some articles for tags connected to Bits & Pieces. More info about this operation in http://research.janbot.nl/synopsis.
For Pep Guardiola, Jan Bot’s catachrestic intertitle generator gave the following output, from which the first 3 clauses from the first 3 sentences where picked as intertitles.
A minute hand sarcastic struggled to hide the looking
The looking on as the claimed his the looking
Season moved to up to the looking
A club has been a diverting season
A guy his with the after the new against the with just about everything a crowd
Soundtrack
Sound registered using a piezo microphone attached to Jan Bot’s computer processor while generating the film.
Most audiovisual narratives make use of the intrinsic ability of the film camera to represent the world in the form of photographic documentation. Such photographic nature, built by optics that capture light onto photosensitive support, enables a form expression that can be accessed by human intellect without the need of translations. Roland Barthes refers to this as “messages without code”; unlike written text, drawings or even music, we don’t need to learn how to read film images to understand their basic message, because the representation of the world they offer matches the ways of human perception.
This idea gains clarity when we take as example the spectatorial phenomena of mainstream cinema or TV shows. Almost instantly and without resistance we are transported into the virtual world that resides on the other side of the screen. Once there, we don’t need to do much effort to understand the setting, to empathise with characters or to understand the interactions between them. It can be a TV presenter talking to the screen or a historical drama: in most of the cases we will easily inhabit the story and forget about the screen that separates our world from the self-contained universe built by the imagery of film.
A CINEMA OF THE MIND
Ernie Gehr, experimental filmmaker, uses the term “representational film” to address the majority of the films that build their narratives based on this photo-realistic approach. With a critical attitude that develops further through his work, he claims that “film is a real thing and as a real thing it is not imitation [of reality]”. Since the 1970’s Gehr has been part of a film avant-garde which introduced a complete new approach to film narration, and that during that decade became known as the “structural film” movement.
Different than representational films, structural cinema explores the physical limits of its medium, which described by Gehr consists of “a variable intensity of light, an internal balance of time, a movement within a given space.”. By doing this, structural film reminds us that realism in film is essentially an optical illusion which results from the impact of light on photosensitive material, be that celluloid, magnetic tape or a chip.
Adam Sitney, the academic who coined the term structural film, described it as a “cinema of the mind rather than the eye”, because it tries to raise awareness about the components of film and the narrative devices that prompt them to become aesthetic objects.
An inspiring execution of this idea is the work of Stan Brakhage. In his 1987 film series The Dante Quartet, he intervenes film footage by painting directly on it. The result is an abstract composition of colours, contrasts and movements.
JAN BOT: A SYNTACTIC APPROACH TO FILM ARCHIVES
Although Jan Bot’s films are not structural, our approach to film and narrative has a strong affinity with the ethos of structural film. This association wasn’t a priori, but rather something we realised retrospectively during one of our research iterations. Therefore, instead of listing reasons trying to explain why Jan Bot relates to structural cinema, it seems more valuable for us to describe the process of how such approach to film narration took place as a natural decision from the very first discussions that originated this project. So this is how it began.
We knew about Bits & Pieces already for some time before we decided to work with it. This is a curated group of orphan films that EYE’s institute has been systematically collecting since 1989 according to the aesthetic judgement of its curators. Another characteristic of this collection is that hardly any of its fragments has been identified. Unlike traditional film collections that make reference to historical events or particular networks of film industry, Bits & Pieces only makes reference to its own imagery.
Given that most of the time film archives are exhibited with the intention to illustrate historical events, we knew that the public presentation of a collection without historical records like Bits & Pieces was a serious challenge. Assuming that beauty –the main criteria to assemble this collection– is something that lies in the eyes of the beholder, and that no historical context nor particular subject bundled this collection together, we asked ourselves the following question: what could be a common denominator to all the pieces of this archive that would allow us to reflect about it and make it relevant for public presentation?
Our answer was an structural approach to the collection. Instead of establishing a semantic criteria to classify the images and speculate about their potential signification, we decided to question the very essence of Bits & Pieces, starting from the digital materiality in which its images are normally presented and distributed. We decided to think of it not as a group of digital images, but rather an agglomeration of pixels structured in the form of images that, to the eye of humans, yield an implicit meaning. This approach helped us to find common denominators in the archive related to the quality of the images. We called this a syntactic analysis of film.
PIXELS AS UNITY
Such structural approach to Bits & Pieces corresponds exactly to one of the most elementary and at the same time essential procedures employed by computer software to process images. Different than humans, computers are not natural readers of photographic images. They need to decode their meaning, and they do this by parsing each of the pixels that compose them. Only after this process is done, further implementations –such as image recognition– can take place.
Using a software called Shotdetect, we processed the almost seven hours of digitised reels that Bits & Pieces consists of. This allowed us to organise its fragments according to several images properties, such as color, motion and definition. The following image illustrates a fragment of indexed material. The four rows display: 1. still images, 2. the amount of motion (i.e. the difference between two frames), 3. the average brightness and color, and 4. the color separated by its RGB components.
An interactive version can be found on http://analysis.janbot.nl
COMPUTER-AIDED FILMMAKING
In this article we are not deepening into a detailed analysis of the results obtained after the indexation of Bits & Pieces. That will be the topic of another text specially dedicated to the algorithmic edit of films based on selected footage. For the time being, we would like to highlight that the importance of this indexation in an early stage of research has provided us of enough confidence to believe that a structural approach to found footage filmmaking aided by computers is not only possible, but also promising.
Nowadays more and more the influence of computer performance and algorithmic workflows is gaining momentum in our global culture. In the media industry, just as in many other fields, there is a tendency to delegate the fine work of crafting content and making small decisions to these entities. The current popularisation of artificial intelligence is one significant step towards that. Considering this as well as other technologies on their way to commercial use, we must start to familiarise with the idea that software is becoming –if not already– a fundamental mediator between the world and our minds.
THE RELEVANCE OF A STRUCTURAL APPROACH NOW
The way this mediation is taking place is through the creation of machines that can simulate human understanding. Artificial intelligence is precisely that: computers with the potential to adopt communication skills that make interaction with humans swift and effortless. Just as cinema creates the illusion of perceiving the world with our own eyes, computers are increasingly being used to formulate an understanding of the world that matches human perception and cognition. By doing this, they are becoming an invisible medium, and our notions of actual and virtual reality, a fuzzy domain.
Although we are not against this new relationship between computers and society, we do think that such transition shouldn’t be undermined, neither simplified with utopian or dystopian auguries. That is why our structural approach towards digital images is so actual and relevant. Just as structural filmmakers want to raise awareness about the virtuality and malleability of film as medium, we want to make computer mediation as visible and concrete as possible in order to explore its poetic use. By using computers for expression instead of simulation of reality, we dive into a realm of aesthetic experiences that exist beyond the pragmatism of goal-minded interactions.
Instead following the current tendency that tries to integrate the film medium into the more and more hegemonic narrative of computers mediating between reality and the mind, a structural approach to filmmaking can offer a way to integrate the computer narrative into the film medium. We believe this can help us to reflect, visualise and speculate about the working of algorithms, and thus expand film towards new realms of knowledge, where imagination is key.
Sketch of Jan Bot Inspired by article “Red Hot Chili Peppers' Anthony Kiedis Hospitalized" (here indexation)
This is an updated version of a previous experiment. We have added a soundtrack inspired in Guy Sherwin’s notion of “optical sound”, which in this case means transcoding image data to use it for sound.
The origin of the term comes from the ancients. The concept becomes more precise with the use of variables in mathematics. Algorithm in the sense of what is now used by computers appeared as soon as first mechanical engines were invented.
A definition of algorithm: http://www.scriptol.com/programming/algorithm-definition.php
On classification of algorithms: http://www.scriptol.com/programming/algorithms-classification.php
In his books “Film Form” and “The Film Sense”, pioneer film director Sergei Eisenstein developed and extensively studied montage, a process of assembling, juxtaposing, overlaying and overlapping images over time. He treated montage not merely as a series of visual techniques but as a methodology that conveyed meaning beyond that given by filmed content. Eisenstein described five formal categories of montage: metric, rhythmic, tonal, overtonal and intellectual. These montage types were guided by different rules of construction and assorted treatments of the part-whole relationship; many of them functioned according to musical principles.
An algorithm uses an image to generate prose and poetry, while a surveillance camera scans for faces and describes what it sees in “spoken” words.
The creative applications of artificial intelligence technology are becoming increasingly refined. A perfect example is this algorithm that, without human intervention, uses imagery to generate prose and poetry. Take any sort of photo you want, upload it, and the word.camera app will transpose the image into ornate text. A picture of a dead pigeon on a sidewalk might trigger a reflection on mortality; wearing a funny party hat might inspire the app to come up with a joke. This multimedia project uses artificial intelligence algorithms to generate textual descriptions of images. This could be the beginning of a new kind of camera, or a new kind of photography.
John Whitney created one of the first computer-generated images in history. In the early 1960’s he built an analog computer from anti-missile detectors designed during WWII. The original machines worked with two men manually operating telescopes to follow missiles. Their motion provided the necessary data for analog computers to determine ballistic equations that helped predict their direction. According to those calculations, the missiles would be targeted and destroyed.
A visualization of the results obtained by this complex mechanism resembles somehow the workings of a spirograph. This is noticeable in the first collection of experiments made by Whitney and his machine, published in 1961 in the form of a promotional reel titled "catalog".
In the book “Expanded Cinema”, Gene Youngblood explains:
“An M-5 Antiaircraft Gun Director provided the basic machinery for Whitney's first mechanical analogue computer in the late 1950's. This complex instrument of death now became a tool for producing benevolent and beautiful graphic designs. Later Whitney augmented the M-5 with the more sophisticated M-7, hybridizing the machines into a mammoth twelve-foot-high device of formidable complexity upon which most of the business of Motion Graphics was conducted for many years.”
“Unlike the digital computer, which requires only a mathematical code as its input, the mechanical analogue computer as used by the Whitneys requires some form of input that directly corresponds to the desired output. That is, at least a basic element of the final image we see on the screen must first be drawn, photographed, pasted together, or otherwise assembled before it is fed into the analogue equipment for processing. This means that a great deal of handicraft still is involved, though its relation to the final output is minimal. The original input may be as simple as a moiré pattern or as complex as a syncretistic field of hand-painted dots— but some form of handmade or physically demonstrable information is required as input in the absence of conventional computer software.”
Filmmaker Guy Maddin says that watching a movie has more in common with a paranormal séance than meets the eye; he remembers the first time he realized that the French word for a movie screening is "séance," which translates into "a sitting." Both activities take place in the dark; both...