Popular text-to-image generators like DALL-E, Midjourney, and Stable Diffusion have introduced what some believe to be a new creative frontier in the architectural design process. However, they also encourage us to reflect on a series of relationships at the heart of design, representation, and space: That of human versus machine 'understanding,' of two versus three dimensions, and of chaos versus creativity.
Below, the artist, researcher, author, and academic Amanda Wasielewski unpacks such relationships and the role of text-to-image generators as tools in the architecture and design arsenal, with reflections building upon her upcoming book Computational Formalism: Art History and Machine Learning published by MIT Press.
This article is part of the Archinect In-Depth: Artificial Intelligence series.
Over the last year, generative AI went mainstream thanks to text and image generation platforms like OpenAI’s DALL-E 2 and ChatGPT, which have made it possible for anyone with an internet connection to harness the power of machine learning for creative endeavors. No technical skills or programming required; no need for a special hardware setup or computer. Given the ease and accessibility of these tools, it is not surprising that the response has been equal parts worry and excitement. The growing public awareness and rapid uptake of AI have been accompanied by pressing questions about the future of creative industries. Though the full impact of consumer AI tools remains to be seen, it is already clear that they are game changers.
Popular text-to-image generators like DALL-E 2, Midjourney, and Stable Diffusion are so-called multimodal tools, meaning they operate by connecting two categories of data (text and image) and different deep learning models that process them. Multimodality was the real breakthrough for opening up AI image generators to a wider audience. It means that a simple text-based description is all that is necessary to generate a variety of impressive imagery. Other AI image generation techniques, such as GANs (generative adversarial networks), have been part of the public discourse for some time, but they are relatively inflexible. The new image creation tools that are built on diffusion models, however, are far more flexible for a range of different applications, including architecture and design renderings.
Architects and designers can use these more general text-to-image platforms to create new renderings or enhance existing designs, but there are also now profession-specific plugins for popular architecture and 3D software using the same type of technology. The ArkoAI plugin, for example, works with SketchUp, Revit, and Rhino3D while the Veras plugin works with Revit (with plans for integration with other software in the near future). The idea of these tools is that, with a simple text description and the click of a button, an architectural sketch can be fleshed out with materials, environment, and inhabitants in a matter of seconds.
The sense that these tools have an understanding and intelligence is more a facet of human empathy than their actual capabilities.
One of the main tasks facing these architecture and design-specific plugins is that they must somehow deal with the concept of three-dimensionality. Training AI tools to ‘understand’ and somehow render a convincing representation of space is, in fact, harder than it sounds. The word ‘understand’ is in quotes because, in this context, it is more metaphor than reality. We often use words like ‘understand’ to describe the way that deep learning models process data, but as scholars like Emily M. Bender remind us, the sense that these tools have an understanding and intelligence is more a facet of human empathy than their actual capabilities.
In fact, a core issue with the current range of available text-to-image tools is that they have no concept of or means by which to map out three-dimensional space. Any illusion of space is essentially a surface effect. Computer vision research is concerned with how digital images and video can be used to help a computer ‘see’ and discriminate between different people or objects. In applications like facial or object recognition, identification via matching is key. DALL-E and its ilk have been trained on large datasets of two-dimensional digital images (many of which are digital photographs) that have been labeled for object recognition tasks. Image generation is thus a kind of reverse engineering of simple category matching.
Complicating matters even further, the primary method of producing training data has involved vast amounts of real human labor. Low-paid workers on Amazon’s Mechanical Turk or other platforms have been paid pennies to monotonously label the objects in some of the early and influential large datasets. In their investigation of labels of human beings in the ImageNet training set, scholar Kate Crawford and artist Trevor Paglen point out the large-scale replication of human bias that can stem from this invisible labor. They also produced a viral artwork, ImageNet Roulette (2019), that allowed users to upload a photo and see what bizarre, misogynistic, or racist labels came out of the data.
It is important to remember that these labels are tagging pixel patterns in digital images and thus exclusively focused on two-dimensional representation. Generative models have been trained to connect the variety of pixel patterns that are labeled with text like, for example, “tree” or “dog.” When someone writes a text prompt for a dog or a tree, the tool can compose an image that has a similar collection of pixels we recognize as a tree or dog. As accurate and efficient as these tools are, however, they have no understanding of what a tree or a dog actually are because they do not have experience in a spatial world, as humans do.
Text-to-image generators have no such spatial awareness or understanding, and we cannot take for granted that they ‘perceive’ space in two-dimensional images the same way we do.
We tend to look at representational images, such as photographs or architectural drawings, with an innate sense of space. The perspective or projection easily disappears in our mind’s eye. As three-dimensional beings, we humans intuitively understand and even expect to perceive 3D space. Indeed, the vast majority of Western art and imagery since the Renaissance has been focused on creating illusions or representations of space in two dimensions.
Maps, plans, and renderings all present an understanding of three-dimensional space in two dimensions. Even rotatable 3D digital objects are typically viewed on flat 2D screens. We are so used to representing the 3D world in two dimensions that this act of transposition often becomes invisible. We take it for granted. But text-to-image generators have no such spatial awareness or understanding, and we cannot take for granted that they ‘perceive’ space in two-dimensional images the same way we do.
For the past year, I have been following communities of amateurs on Facebook and Reddit who post about their experiences using popular text-to-image generators. One recent post from a designer showed a simple mockup of a bar interior that he had created in SketchUp. This designer then fed the sketch into Midjourney and ArkoAI to see what kind of renderings these tools would come up with. The resultant images produced by both tools show fleshed-out scenes, complete with photorealistic textures and materials filled into the design. However, the AI-generated renderings did not faithfully follow the original sketch. For example, the sketch clearly shows arched windows on either side of the bar at the center, but the AI did not recognize that these were meant to be windows. So, it just reproduced arches over a blank wall space instead. The form of the ceiling and the bottle rack in the back of the bar also deviated significantly from the original sketch.
In other words, the renderings produced by text-to-image generators exhibit something that their designers sometimes call creativity… or chaos. Many of these systems allow the user to toggle a metric for how much creativity or deviation from the sketch is allowed. Generally, more deviation (or creativity/chaos) creates more dynamic and impressive images, but it also means liberties are taken with regard to the description in the text prompt or the details of the input image.
Happy accidents in AI can be useful in their own right.
From my time spent on amateur groups, it is clear that those who regularly use text-to-image generators are accustomed to getting output images that deviate from their ideas or intentions. Most of the posts on these groups are seeking answers for why a particular prompt does not work or asking how to formulate a prompt to produce the desired effects. A new type of job, Prompt Engineer, has even been proposed in the wake of AI tools such as these. Companies of all sorts may soon employ experts to craft prompts for AI-generated content. However, it is also clear that happy accidents in AI can be useful in their own right. The ‘creativity’ of such platforms can be used to generate or brainstorm ideas quickly.
For those seeking to transform precise sketches into detailed renderings, however, there are a few significant problems. One issue that impacts the precision and utility of renderings is the fact that text-to-image generators can’t count. This means that asking for a specific number of something in an image often does not produce the desired effect. For example, when asked to produce a rendering of a building with a certain number of floors, these tools struggle to deliver. To illustrate this, I generated a few images with Midjourney from the prompt “a 4-story apartment building in an urban environment,” and the images generated produced buildings that typically had between 6 and 10 stories.
The same thing can happen even if you feed a reference sketch with a specific number of elements. AI-generated imagery is iterative in a way that means it is likely to riff on certain motifs that the generator latches onto. When you feed Midjourney a sketch that has two chandeliers, for example, it may riff on the chandelier in the reference sketch and create six chandeliers instead, as it did in the case of the bar design example described above. When a deep neural network is trained on images, it understands them as parts rather than a whole. In generating, there is equally little awareness of total composition apart from smaller-scale relations within the training data. This may pose a significant problem for any field that depends on the precision of numbers and measurements.
This issue with counting is perhaps most apparent in images of human bodies, particularly hands. Text-to-image generators have become notorious for their failures to produce naturalistic hands with only five fingers. Midjourney version 4 and DALL-E have made some efforts to fix this issue, but hands are often still a dead giveaway that an image has been AI-generated. For example, the Midjourney-created image of the Pope in a white puffer jacket that went viral on Twitter because so many people believed it was real has a tell that it is AI-generated. If you look at the Pope’s visible hand in the image, you can see that it is significantly distorted. Most people on social media, however, never look this closely.
The issue with counting and the problem of rendering hands are both essentially spatial problems. Hands and human bodies are complex three-dimensional forms that are extremely variable in their appearance in two dimensions. They are also made up of a very specific configuration of parts that must all be in their right places to read convincingly as human bodies or hands. Outlandish deviations from this, such as toes for fingers or fingers that melt into a body, can read as disturbing or uncanny. As good as AI image generators are at the tasks they perform, they can only understand a body through a wide array of perspectives in two dimensions, piece by piece rather than as a whole.
For text-to-image generators, the three-dimensional world is a kind of Möbius strip that they travel around, never getting anywhere and never seeing the world as a spatial whole.
The same holds true for other complex three-dimensional objects rendered with text-to-image generators. A simple prompt in Midjourney illustrates this point. I input the text, “an architectural rendering of a staircase inside a brutalist style museum building,” and the output images produced a variety of staircases that could only be described as Escher-esque. Midjourney understands the style cues and even the content cues quite well, but the treatment of a staircase in three dimensions falls spectacularly short.
As the staircase example shows, text-to-image generators tend to riff on the various perspectives of an object they have ‘seen’ in the training data. For the more complex three-dimensional objects, this can mean that a greater understanding of the space those objects occupy is lost. AI tools tend to construct space, in these cases, in odd and fantastic ways because they do not understand a staircase, for example, as a spatial construction with a distinctive practical function.
For text-to-image generators, the three-dimensional world is a kind of Möbius strip that they travel around, never getting anywhere and never seeing the world as a spatial whole. Their perspective is limited.
They exist in Plato’s allegory of the cave, able to only perceive and name the shadows (the images) we create of the three-dimensional world.
In a classic episode of Cosmos, Carl Sagan explains how we perceive different dimensions based on the limits of our own dimensionality. Citing the satirical novella Flatland (1884) by Edwin Abbott Abbott as a basis, he describes how creatures in a totally flat, two-dimensional land would have no way to even imagine the third dimension. He demonstrates this with a three-dimensional object, an apple, that he shows ‘entering’ Flatland. Cutting the apple in half, he explains that it would only be perceived by the 2D inhabitants of Flatland as successive slices passing through rather than as a whole.
This is also an instructive way to understand text-to-image generators. They exist in Plato’s allegory of the cave, able to only perceive and name the shadows (the images) we create of the three-dimensional world. Trained on two-dimensional images alone, they cannot even begin to conceive of what three-dimensional space is. All they know is the slices of it passing through their two-dimensional world, the digital images used to train them.
Text-to-image generators are proving to be a useful tool in the architecture and design arsenal, but they still have many blind spots and limitations for designing and building in the three-dimensional world.
Amanda Wasielewski is Associate Senior Lecturer of Digital Humanities and Associate Professor (Docent) of Art History at Uppsala University. Her writing and research investigate the use of digital technology in relation to art/visual culture and spatial practice. Her recent focus has been on the ...
9 Comments
Even supposedly rational and realistic linear perspective of long ago presented problems for painters that required adjustment. Pure geometry didn't produce 2D images that corresponded with the way we see and understand the outside world.
As I dimly understand it, AI taps extensive databases of existing images, which will only get larger. The designs it comes up with will be some variation and combination of what already exists. A lot of architecture does just that, repeat the known in predictable patterns, understandably, necessarily. It's not hard to imagine AI quickly coming up with designs for a whole neighborhood of standard houses, with variations of size and detail and adjustment to site, eventually, once the bugs are worked out. The next BIG or Heatherwick will be a snap. I wonder, however, if it will struggle with site. Could AI ever come up with a house as sensitive to site as Wright's Fallingwater?
But it won't be able to come up with anything new. If the Gothic style had never developed, AI wouldn't be able to create it on its on and we'd never have seen flying buttresses, etc., though likely it would figure out that a pointed arch had better load distributing characteristics than the round arch or lintel. And could AI ever comprehend the spirit of religion? In the arts, AI, could never have come up with Cubism from a database of previous paintings, Cezanne and the others.
The strength of AI generated images depends on the depth of the questioner and the quality of questions asked and the range of the AI's databases. The fear is that both humans and the databases get caught in closed loop of the familiar, without breaking out. With the human, as we see so often now, choices may be determined by whatever is most known, most popular, most recent.
https://www.midjourney.com/showcase/recent/
Midjourney's images may not show its full potential, but you see repetition of the too familiar, idealized and trivialized models, superheroes, etc. Their only originally comes from different and odd combinations of what we have already seen so many times (cf. Trump in outer space). The images are quite "realistic"—and wholly simplistic and artificial.
And thanks.
"Their only originality. . . ." I keep meaning to turn off auto suggestion—but appropriate for this post.
"create it on its own"—arrrrgh.
Love the connection to Flatland, thank you! That helps me understand how these image generators "work" so much better than other articles I've read.
I also have a question following up on Gary's comment about it being a closed loop of generation: I assume the algorithms are looking at all of the images of Gothic buildings that humans have input *as well as* images of "Gothic buildings" generated by AI, so as the generated images proliferate the output will veer further and further away from being influenced by actual Gothic buildings, right? Or is there a safety that prevents that kind of self-consumption?
Great point. When does the system become inbread and why wouldn't pranksters flood the space with crap. We'll still need humans to verify what is worth two bits despite how slick and neet-o this stuff is. So far at least, we have nothing to fear so long as we put real humans at the center of our work. Thanks for the great article.
Yep! In such a scenario "the results tend to be digital poop" see https://www.techspot.com/news/99064-ai-feedback-loop-spell-death-future-generative-models.html
H. T. Webster drew this cartoon in 1923:
wow..."in the year 2023.." here we are...
I have experimented with prompting the system to image a "floating four dimensional space time curve". It actually attempts this but I wonder how the system can construct something that no human has (or can) experience.
Block this user
Are you sure you want to block this user and hide all related comments throughout the site?
Archinect
This is your first comment on Archinect. Your comment will be visible once approved.