Space in Photographs

Abstract
Photographs are flat objects, but we have no difficulty perceiving spatial depth in them. What makes this possible are embedded perspectival cues in the form of changes in color and brightness. There is a rich history of thought about perspective in the context of painting and drawing, as well as, more recently, in the context of computer graphics and machine vision. The relevant photographic thinking, by contrast, is often confused and ill-informed. Revisiting the Renaissance discovery of perspective nearly six hundred years ago may help improve the situation. A simple idea from Leon Battista Alberti's essay on painting from 1435–36 can be developed into an intuitive understanding of how photographs depict space and of the effects of different lenses and camera angles. The idea also serves to frame the problem of tonal reproduction, i.e. how to translate the rich colors and tones of the photographic subject into the restricted palette of a print or a computer screen. Digital image processing offers powerful tools to shape this translation, with important consequences for perceived depth.
Prologue: Parallax
How is it that our surroundings look 3-dimensional to us? A large part of the answer lies in the fact that we are endowed with binocular depth perception, or stereopsis. Close one eye, don't move your head, and much of your visual sense of space is lost. Two eyes provide two vantage points from which the world looks ever so slightly different. The displacement of features from one retinal image to the other is called parallax. Your visual system can measure the parallax and reconstruct spatial relationships from it, in effect triangulating the lay of the land in front of you.
Hold out a hand and look at it first with one eye closed, then with the other eye closed. You see the hand shift sideways relative to the background. The direction and size of the shift enables your brain to compute a spatial representation of the scene before your eyes. 
Parallax makes stereo photography possible. Two photographs taken from two sideways displaced vantage points are served up separately to each eye in order to create the impression of seeing the world through eyes placed at these vantage points. The stereo pair below is prepared for cross-eyed viewing, which is my preferred method absent fancier viewing technology. Look at the pair with your eyes crossed in such a way that the left eye looks at the right photograph and the right eye looks at the left photograph. Your eyes are toed in correctly when you see three images float side by side. The one in the middle will suddenly gain depth, at which point you can slowly relax and begin to scan it. (It may help to hold up a finger in such a way that, when seen with the right eye, the finger points at the left photo, and when seen with the left eye, it points at the right photo. Focus on the finger, then let you gaze drift to the image floating behind it.) 
Ganter bridge in the Swiss Alps, designed by the late Christian Menn, from a 3D-photography project I did in the early oughts. You wouldn't get the same impression of depth if you viewed the scene directly. The stereo-pair was shot with two large-format cameras set up about three feet apart, giving you the view of a giant with a 3-foot inter-ocular distance. The scene looks correspondingly small. 
In stereopsis, we experience parallax simultaneously. Our visual system is also able to process parallax information that is presented sequentially. The impression of space is less powerful than with simultaneous parallax, but it is there. You can verify this by moving your head side-to-side with one eye closed.
The movie industry understood this early on. It took the camera off the tripod and put it on moving dollies, booms, cars, helicopters, drones, etc. The spatial impression from these devices can be very strong, so strong, in fact, that true 3D-photography doesn’t add enough extra pop to make the added effort, at both the production and the consumption end, worthwhile. This, I suspect, is the main reason why 3D-films never really caught on. 
A single photograph contains no parallax information. So it should be a poor device for conveying space. But this not the case. Even a flat photograph can convey a sense of depth, albeit less vivid than a stereo pair. How is this possible? Answer: through a whole range of cues including geometry, shading, occlusion, texture gradients, depth-of-field, fading colors, etc. Many, perhaps all, of these cues can be gathered under the rubric of perspective
Perspective
Perspective was (re)discovered in the early 1400s by Filippo Brunelleschi in Florence and codified into a set of drawing techniques by Leon Battista Alberti. Everything you need to know about photographic perspective flows from Alberti’s insight that realistic depiction of space is like painting on a window, and that viewing the result is like looking through that window from the painter's point of view: 
A “truthful” picture (Alberti’s term) is like a window on the world, and such a picture can be made by tracing the view of the world as if on a windowpane. 
Illustration: Jacopo Barozzi da Vignola (showing a rather thick window). Notice that the projected image is oddly slanted. This is a curiosity of many didactic illustrations of perspective: they are full of little perspectival errors.
It’s there already in the name: perspicere, to look through (from per- ‘through’ + specere ‘to look’). Alberti’s window is not just a clever metaphor. It is a precise model of how to construct 2D-pictures that, under the right viewing conditions, are indistinguishable from the 3D-configurations they represent. Geometrically speaking, Alberti’s eye-window-world arrangement is a linear projection in which the window defines an area on the projection or image plane and the eye of the painter or viewer is the so-called “central” or projection point. (In recent years, “central point” perspective has come to epitomize Western “hegemonic” business, for reasons that I cannot fathom. The “central” point here isn't even central, unlike in photography, where it lies between the object and the image plane. To avoid distraction, I will mostly use the less charged expression “projection point”.) 
In Alberti’s setup, the image of a point O on the far side of the image plane is the intersection with the image plane of the sightline that connects O to the projection point. The image of an object is the set of images of the points that constitute the object and are visible from the projection point. (This needs refining to deal with partially transparent objects, but it will do for now.) 
If a camera is positioned so that the optical center of its lens is at the location of the painter’s eye and the sensor is aligned with Alberti’s window, it will produce a photograph that is “similar” to the image on the window in the strict geometric sense that it agrees with the image in every respect but scale. (If you want to verify this, think about similar triangles spanned by the sightlines.)

Alberti's window to the left of the projection point P, the photograph to the right of P

This means that Alberti’s perspectival projection shares its geometric properties with photography. To begin with, straight lines are imaged as straight lines. This can be gleaned from the diagram below. The straight line L, together with the projection point P not on L, defines a flat plane, G (green). The sightlines (dotted) from P to any point on L all lie on G. Therefore L’s image, consisting of the intersection points of the sightlines with the image plane, also lies on G, namely, at the place where G and the image plane intersect. And the place where two (flat) planes intersect is always a straight line. 
As a straight line recedes from the image plane, its image approaches a “vanishing” point beyond which it does not extend. The image lies between the vanishing point and the line’s intersection with the image plane. In the diagram, L recedes from the image plane towards the background. We draw a line parallel to L through P, which lies on plane G and intersects the image plane at some point V. The area that is swept out by the sightlines from P to L is bounded on the left by the line PV and on the right by the line PQ, where Q is L’s intersection with the image plane. The bounds exist because sightlines must first pass through the image plane before they meet L, and no sightline passing through the image plane to the left of V or to the right of Q meets L behind the image plane. The image L' is therefore bounded by V and Q. 
Vanishing points become interesting when more than one line is involved: the image of a bundle of straight lines that are parallel to each other but not parallel to the image plane is a bundle of straight lines that converge at the same vanishing point. This is because the relevant property of L that determines the vanishing point is L’s direction, not its location in space: we found the vanishing point by drawing a line through P in the same direction as L. Any line L* parallel to L has the same direction and therefore the same vanishing point. (You may think of generating the image of L* by rotating the green plane around the line PV until it meets L*.)
Floor corrugations are not parallel to the image plane and converge at a vanishing point (which, for lines perpendicular to the image plane, is the point on the horizon where a vertical dropped from the projection point meets the image).
The columns and girders in the photograph do not converge. They represent the special case of bundles of straight lines that are parallel, not only to each other, but also to the image plane. Such bundles are imaged as bundles of parallel straight lines. You may take this on faith and skip to the next paragraph. Otherwise, pick a line L parallel to the image plane and again draw a line A parallel to L through the projection point P. A does not intersect the image plane, so there is no vanishing point. But the plane AL, defined by A and L, does intersect the image plane, and the image L' of L will lie on this intersection. (In the limiting case where AL is parallel to the image plane, the distance between L and A is infinite and there is no image.) L' and A must be parallel. If they weren’t, then A would intersect the image plane, and so would L, in virtue of being parallel to A, but this contradicts the assumption that L is parallel to the image plane. For any other line L* parallel to L, the argument is exactly the same: its image L*' is parallel to A, and thus all images are parallel to each other. End of gymnastics. 
One often encounters talk about one-, two-, and three-point perspective, characterized by the number of vanishing points. (What is bandied about as four-, five-, or six-point perspective are altogether different kinds of projection from Alberti’s.) This tripartite classification is really a classification of drawing techniques for certain simple spatial arrangements, not a typology of perspective. As a typology of perspective, it would be woefully incomplete because there is no upper limit on the number of vanishing points. There are as many vanishing points in the image plane as there are bundles of parallel lines in the object space that are not parallel to each other or to the image plane. The following photograph has no fewer than seven vanishing points. 
If we placed a cube on the floor of the building in such a way that its sides aren’t aligned with the walls, we would bring the number of vanishing points to nine. And so on. The number of vanishing points tells us something about the depicted scene, not about the manner of depiction. 
Alberti and others devised techniques for producing perspectival images that combine geometric construction with physical machinery. Some of it gets rather involved. One may wonder if the engraving on the right from Albrecht Dürer's Unterweysung der Messung (Measurement Tutorial) from 1525 was meant to embellish the logic of projection or illustrate a practical method. 
You’ll notice that the subject in Dürer’s engraving is treated as if it were made entirely of outlines: a wireframe model whose image is a set of points connected by lines. But of course the visual world is not a wireframe model, and paintings aren't line drawings. What is missing is color and tone.
Our model of projection is an idealization with distinct theoretical advantages, but here we confront one of its shortcomings. The model deals with points and lines in space and their images – points and lines again – in the projection plane. But points and lines lack extension and therefore color. What we see and paint and photograph are not mathematical points but tiny (yet still extended) colored blots. To conceptualize this, we need to replace sightlines with steep visual pyramids. These pyramids, with their apexes at the projection point, intersect the image plane in square pixels and the visible surfaces of objects in messy cross-sections. 
Working this out in detail is tedious and doesn’t matter here. What matters is that a conceptual shift from points to pixels is a natural way to incorporate tonal reproduction into the treatment of perspective. Tonal reproduction is the translation of the subject's colors into the colors of the painting (or representation, more generally). If a painting is supposed to be like a window onto the world, then it needs to get not only the world's geometry right but also the world's colors and tones—and not as they are (whatever that means, exactly), but as they appear from the painter's point of view.
There is dramatic variation in the concrete floor, from dark brown to almost white. The floor itself is greyish brown, but how it looks depends very much on the point of view. It acts as a dull mirror, and mirrors show different things from different angles. For example, the highlights move across the floor as you wander around the space (looking at the real floor, not at the photograph). All surfaces are like that, to a greater or lesser extent. So color and tone are a matter of the projection point and therefore a matter of perspective. Aerial perspective – a brightening of colors and a shift towards blue with increasing distance, caused by atmospheric light scattering – becomes part of perspective proper. I will have to say more about tonal reproduction towards the end of this piece. (Homework assignment: generalize to reproduction with a restricted color palette, with black-and-white as the limiting case.)
Let's return to the techniques and devices proposed for constructing perspectives. In Dürer’s engraving, we got a glimpse of some of the complications involved. There is an elegant shortcut that does away with all the complications, at least in priciple: the camera obscura (literally, dark chamber, which, by the way, goes back at least to the 11th century Egyptian scholar Alhazen).

Athanasius Kircher, Large portable camera obscura, 1646, cut open for illustration

A small hole projects an upside-down image on the back of the dark chamber which is then traced by the painter.
The camera obscura offers an elegant solution to the problem of perspectival representation—at least in principle. In practice, things are less elegant. The camera is called “obscura” for a reason: inside of it, it is very dark, so dark, in fact, that tracing outlines is a challenge, and getting colors right is nigh impossible. You’d have to mix your pigments in the dark and paint each shade of color under very dim light of exactly that same color. This really messes with your judgment. Much better to evict the painter from the camera and instead place light-sensitive material at the back. This move of course had to wait for the invention of the right chemistry in the 19th century.
The image recorded at the back of a photographic camera is an upside-down version of Alberti’s window. In order to see it properly, we need to turn it right-side up, switch it from negative to positive (unless we use slide or Polaroid film), and place our eye in the correct viewing position, like so:
You may think of what’s going on here as running the camera in reverse, turning it into a projector. We project the image from the back of the camera through the hole or lens outwards onto a screen. When one looks at that projected image from the point where the light enters the camera, which is the central point of the projection, then the image is exactly superimposed onto the world: the illusion is as good as it gets. (You will have to be clever to avoid blocking the projection with your head when you peek through the hole.)
I have glossed over the role of the photographic lens in all this. This is because the lens has no bearing whatsoever on perspective. All it does is project a brighter and sharper image than the image projected by a little hole. Some lenses project larger image circles that include more subject matter, others project smaller image circles that include less subject matter. But their images all match up perfectly when scaled to the same magnification (provided they are focused on the same object plane and shot through the same size lens opening—not f-stop).
For the painter inside the camera obscura, a lens trades one problem—darkness—for another: focus. In order to let in the maximum amount of light, the lens opening has to be quite large. This means that only a thin object plane will be in focus at any given time because depth of field is inversely related to aperture. To paint the out-of-focus parts of the scene, the painter has to refocus the camera. This can be done by moving the back of the camera, which changes the magnification of the image. Or it can be done by moving the lens, which changes the viewpoint and hence the perspective. A headache either way, and something to ponder in the debate about whether painters like Vermeer actually used lenses in this way. 
More on the correct viewpoint
Alberti’s idea implies that, in order for a perspectival picture to look right, the viewer’s eye must be in the same place relative to the picture as the camera lens was relative to the film or the digital sensor. He writes: “Know that a painted thing can never appear truthful where there is not a definite distance for seeing it” (De Pictura, Book 1).
When you take a photo or draw a perspective, you know where the correct viewpoint is because you got to choose it. But what if you are handed a picture for which you don’t know the viewpoint?
One option is to go to the original site and superimpose the picture on it until it blends in seamlessly. There is a whole cottage industry devoted to this. 
If you are sufficiently obsessive, you can fine-tune the alignment and mix older and newer photography:

Shawn Clover, San Francisco 1906 earthquake material

Without the benefit of the original site behind the picture, we can often still determine the right viewpoint experimentally by closing one eye and moving our head around in front of the picture until it looks spatially most compelling. Closing one eye is essential. With both eyes open, there is a conflict between stereopsis telling us that we are looking at a flat surface and perspective cues suggesting depth. Closing one eye turns off the conflicting stereoptic information. 
Pieter Saenredam, Great Church Haarlem, 1648, Scottish National Gallery, viewed from the correct viewpoint
I highly recommend this exercise. You can perform it with images on a reasonably sized computer screen or with prints in books or magazines. But the real fun is with large pictures in a gallery or a museum. It doesn't work with your phone unless you are extremely near-sighted because you cannot see the screen in focus from close enough.
In some cases, the viewpoint can be constructed geometrically. The construction is always based on certain plausible but unproven assumptions. This is different from stereopsis where depth information can be mathematically derived from the visual input without any further assumptions.
What kinds of assumptions have to be made? The simplest ones are that certain lines in the image that converge at a point represent parallel lines in the world, i.e. that the point is their common vanishing point; and that certain angles in the image represent right angles in the world. These assumptions are a robust guide to the architectural world surrounding us, which is full of rectilinear structure.

Giovanni Panini’s Interior of the Pantheon, circa 1734

Let’s try it for Panini’s painting of the Pantheon. First, find the vanishing point V for orthogonals to the image plane. Draw the horizon line through V. Then find the intersection E of the horizon line and diagonals of squares that are orthogonal to the image plane. The viewpoint lies on the orthogonal through V at distance |VE| in front of the painting. (The relevant geometry is explained here.)
Putting your eye at the right viewpoint would be quite uncomfortable. You would be so close to the painting that you could barely see the upper parts without tilting your head. But as soon as you tilt your head, you will lose the viewpoint. I will return to this observation shortly when I talk about wide-angle lenses. 
Lucas Cranach, Melancholy Allegory, Scottish National Gallery 
The viewpoint for Cranach’s Melancholy Allegory is actually to the right of the frame, a hint that the painting at one point probably extended much further to the right and was subsequently cropped. The painting is now hung so high that one cannot put one’s eye in the right place. Sadly, this happens all too often in museums. For example, Canaletto’s vedute of Venice are almost always hung so high that one cannot appreciate their hypnotic spatiality. Besides, when paintings are high up on the wall, they tend to be overpowered by reflections from overhead light. Even paintings hung at the right height are all too often impossible to view from the right vantage point because of reflections. I, for one, am mystified by this common disregard for viewing conditions in the very temples of art. 
The assumptions underlying the viewpoint reconstruction for the Panini painting would be of little help if our built environment looked like this:

Earth House, Austin, TX, early 1970s, by Tao design group

For Renaissance artists, depiction of space and architectural regularity went hand in hand. It is tempting to infer from this that architectural regularity is required to sustain depth in pictures. But the Earth House photograph demonstrates that this is not the case. What is true is that constructing a perspectival view of the idealized Renaissance city by geometric means is quite straightforward, compared to the tedious work that would be required to construct a perspective of the Earth House. And determining the correct viewpoint for the Earth House photograph is difficult as well without the help of vanishing points and the like. 
The simple assumptions underlying the reconstruction in the Panini example and others like it are a special case and not universally applicable. But where they are applicable, they can be very powerful, so powerful that they override other, contravening depth cues and give rise to strong optical illusions. A famous example is the Ames room which exploits our proclivity to perceive right angles and parallel lines where in reality there aren't any:
There are many other delightful ways to fool the default assumptions we make about our environment's geometry. The Japanese mathematician and artist Kokichi Siguhara has turned this into an art form:
Notice how motion parallax in Siguhara's video enables an intuitive grasp of space. The illusions rub in the point that inferences about space based on observation from a single viewpoint always go out on a limb. The pièce de resistance in this regard is Bernard Pras’s anamorphic sculpture of postman Ferdinand Cheval:
Wide-angle and telephoto lenses
Being clear about viewpoints resolves a lot of confusion in the photography world about how different lenses affect perspective. Wide-angle lenses are said to introduce “perspective distortion”, telephoto lenses are said to “collapse perspective” or “foreshorten distances”. What people have in mind is the following.
The building wings in this telephoto image look “telescoped together”; it’s hard to tell how deep the space between them is.
A wide-angle lens appears to exaggerate depth and distort objects near the edges of the photo. The red oblong table in the foreground is circular in real life. 
A wide-angle example bordering on the grotesque is a White House photograph in which President and Mrs. Biden appear to dwarf President and Mrs. Carter. The New York Times, upon consulting with experts, attributes the effect to the “perspective distortion” inherent in wide-angle lenses.
The clout of the Times notwithstanding, what we see here are not distortions of perspective, for the simple reason that such distortions do not exist. This is not to say that nothing interesting is going on, it’s just to say that lens properties aren’t the issue. Lenses do not affect perspective (unless they are badly flawed or of highly unusual design); therefore there cannot be wide-angle distortion or telephoto collapse of perspective. What lenses of different focal lengths do is this: they draw dimmer or brighter, smaller or larger, more or less inclusive circular images on the projection plane. That's all there is to it. Once we adjust for brightness and scale, all lenses depict the world in the same way. They just take a wider or narrower view. A wide-angle lens draws a circle whose center portion exactly matches the circle drawn by a telephoto lens; the photo taken with the telephoto lens is simply a small part of the photo taken with the wide-angle lens.
To illustrate, here is a wide angle shot.
And here is a moderate telephoto shot from the same camera position. Except for scale, it is indistinguishable from the central portion of the wide-angle shot.
Actually, that was a lie. The second photo is not a telephoto shot; it is crop from the wide-angle shot. But a telephoto shot would look identical. If I showed it to you, you wouldn’t know the difference, which is why I didn't bother.
Don’t take my word for it. Do the experiment for yourself. Photograph a scene with a wide-angle and a telephoto-lens from the same vantage point. Then resize and overlay one shot on the other. 
In reproductions of the same size, where this sort of resizing wasn't done, the correct vantage point for the telephoto image is at quite a distance, and the one for the wide-angle image is uncomfortably close. Yet we tend to look at both images from the same comfortable middle distance, which isn't appropriate for either. The impression of distortion is entirely the result of the inappropriate viewing distance: 
Photographs viewed from the correct vantage point do not show any perspectival distortions. 
Here is a painting from the 15th century that is very instructive in this context:

Andrea Mantegna’s Lamentation of Christ, circa 1480

Many art historians consider Christ's feet too small. They reckon that Mantegna departed from perspective to make the feet “look right”. But are the feet really too small? From close up, it surely looks that way. But that’s the least of Mantegna’s problems: from close up, the entire body looks ill-proportioned. As you move away, however, things begin to look more proportionate. 
From somewhere around 30 times the image width, the body parts fall into place, the figure looks natural. The way to convince yourself of this is not by shrinking the painting (because it just gets too small) but by photographing a real person from far away and comparing the photograph to the painting.
The comparison suggests that Mantegna painted the figure with the narrow angle of a super-telephoto lens. He did this a century before the invention of the telescope. How did he do it? I have no idea. It is like painting your tennis partner from the other end of the court, for which you need a hawk's vision. And why would Mantegna have done this when it makes the painting look strange from any more reasonable viewing distance?
There is another puzzle, to do with the marble slab on which the body lies. From afar, the slab sticks up way too much for a horizontal object the length of Christ. If the slab is indeed horizontal, rectangular, and twice as long as it is wide, then it is painted with a very different perspective from the figure, one that induces a viewing distance of only 3.5 times the image width.
The painting was found in Mantegna’s studio after his death, and nobody knows how he would have displayed it. In some of his other works, Mantegna chose quite unusual viewpoints, some even outside the frame, but never anything this radical. For those other works, which are frescoes, the unusual viewpoints cohere very well with the works’ installations.
So much for telephoto effects. Before I get to wide-angle “distortion”, let me address what lies between wide-angles and telephotos.
The idea of the normal lens
In the days before zoom lenses came to dominate the landscape, camera bodies were often sold together with a “normal” lens. A normal lens is conventionally understood as having a fixed focal length about equal to the diagonal of the film or sensor onto which it projects. The diagonal of 35mm film and so-called full-frame sensors is 43mm. The most common normal lens happens to be a bit longer, 50mm, a discrepancy that continues to irk many photographers. But what is normal about such a lens in the first place, and what does it have to do with perspective?
The answer that is repeated ad nauseam is that a normal lens produces an image that roughly matches what the human eye sees and therefore makes for natural viewing. Is this true?
This is a sketch drawn by the physicist Ernst Mach at the end of the 19th century (the Mach familiar from airspeed measurement in terms of the speed of sound). The sketch shows Mach’s study as seen out of his left eye, with his nose framing the image on the right side.
I reclined on my couch at home and took a photo in the direction of my feet with a normal lens. Here is part of that photo overlaid on Mach's sketch, with me stepping into his right shoe to make sure I get the scale right.
And here is the entire frame captured by the normal lens.
As you can see, the frame includes only a fraction of what the eye sees. The normal (50mm equivalent) lens covers an angle of about 46 degrees, measured diagonally across the frame. The angle of view of the human eye is on the order of 210 degrees, and that is without moving the fixation point.
You might say: the claim about normal lenses was meant to be restricted to our sharp central vision. But that cannot be right either: the angle of foveal, sharp vision is only 1.5–2 degrees. The angle of macular vision (where things are still somewhat sharp) is only about 17 degrees. What makes a lens normal must therefore be something different. It has to do, I submit, with conventions about viewing conditions.
Most people these days, photographers included, don’t fuss over viewing conditions. Photos taken with vastly different lenses all get enlarged to various sizes and viewed from various distances—in an album, a book, a gallery, or on a smartphone—without any consideration for the focal lengths involved. But there is one particular viewing condition that was considered normal long ago when prints were the preferred currency of photography. And for that condition, the normal lens stands out as the right focal length to produce photos that look spatially natural—or perspectivally correct—when viewed under that condition.
Here is how it works. Once upon a time, the standard print size for photographs was 8" x 10". When you hold an 8" x 10" print at a comfortable reading distance, which for most people is in the 14"–16" range, the distance between eye and print is slightly greater than the print diagonal, which is 13". Likewise, the focal length of a normal lens is slightly greater than the film diagonal. Therefore the geometric relationship between eye and print is the same as the relationship between normal lens and film (or sensor). In fact, the correspondence is extremely close. If we let the reading distance fall in the middle of the range, at 15”, then it exceeds the print diagonal by a factor of 15/13. Multiply the diagonal of a 35mm frame by 15/13 and you get 49.6mm! So stop complaining about 50mm being too long for a normal lens. (Homework assignment: a 35mm negative’s aspect ratio of 2/3 does not exactly match the 4/5 print ratio. How does that affect the argument?)
To recap: if you hold an 8" x 10" print from a 35mm negative exposed with a 50mm lens at reading distance, then everything will look spatially natural. The print is a perspectivally sound representation because it works exactly like Alberti’s window.
It so happens that a normal lens is also the easiest and cheapest focal length to design and manufacture, which helps explain how it became everybody's first lens. I don’t know whether there is a deeper cause for this coincidence or whether Alberti’s ghost simply ordained it.
Wide-angle lenses
As I said, we tend to view photos taken with wide-angle and telephoto lenses in the same manner as we view photos taken with a normal lens. Consequently, we view wide-angle photos from farther away than their viewpoint, and we view long-lens photos from too close up. This is where the lore of perspectival distortion originates. Mantegna’s Christ looks distorted, all right, but as we saw, this is no flaw of the painting but a consequence of our looking at it from too close.
If you want to use a telephoto lens and show convincing depth, you must place your viewers in the right position. This means, you must either show very small prints or else keep people at a distance. But this is not what you bought that big expensive lens for. So the lens is a poor choice for conveying space. A normal lens is good with respect to perspective under normal viewing conditions. But it is often still too narrow to capture what we are interested in. So we are forced into using wide-angles. What about their distortions?
Let’s begin by convincing ourselves that, as with telephoto-lenses, the distortion is not a lens defect but the consequence of looking at the photo from the wrong distance.

14mm lens on “full frame” 35mm sensor

The round objects near the edges in this photo, taken with a 14mm wide-angle lens on a full-frame sensor, look seriously deformed. Could this be just a matter of viewing distance?
The right viewing distance for this photo is about half its height, a little less than 5" from my 15" laptop screen when the photo takes up the entire screen. My eyesight fails me at that distance. But my phone can do it, and here is how it sees the orange in the left bottom corner:
The orange now looks right, the distortion is gone. (You may notice that this phone picture has its own wide-angle “distortions” towards the edges. This is because the phone’s lens is itself a moderate wide-angle, corresponding to a 28mm lens on a full-frame sensor.) You can easily do the experiment for yourself by clicking on the photo to enlarge it on your computer screen and looking at it through your phone. Keep in mind that the phone lens must be centered on the screen and at a distance of half the image height. 
The following video is my attempt to scan the entire wide-angle photo from the correct viewpoint.
As you can see, there is nothing wrong with the 14mm-lens. There is no wide-angle distortion. Spheres look appropriately spherical when viewed from the right vantage point. What you can also see is how difficult it is for me to hold the phone in the right place. We encountered this problem of maintaining the viewpoint already in the discussion of Panini’s painting of the Pantheon. 
But now, what about barrel and pincushion distortion? They surely exist! Yes, there are distortions that are lens flaws, but they are different from what we have been looking at. Let me show you the barrel distortion that is common in wide-angle lenses. 
You'll notice the outward bend of the table’s edge in the foreground. Longer lenses tend to exhibit the opposite bend, called pincushion distortion. Barrel and pincushion distortions do violate perspective because straight lines are imaged as curves. Both distortions can be corrected in software, which mostly happens automatically these days. The version of the photo you saw earlier had the distortion corrected.
By contrast, the perspectival “distortion” that bothers many people must be corrected, not by software, but by getting into the right viewing position. How can this be achieved? By printing large, really large.
This photo was made on 4" x 5" sheet film with the equivalent of a 14mm lens. The right viewing distance is again 1/2 of the image height, too close for a normal print. When I had the photo in a show (about a decommissioned coal-firing power plant), it was printed at 8 ft x 10 ft. At that size, it can be viewed comfortably and correctly from 4 ft away, without danger of losing the right viewpoint by tilting one's head. The spatial impression thus created is strong and convincing (e.g. in the appearance of the stairs top-left or the fire extinguisher). 
What if you can’t print large enough? Maybe software can help after all. There is a “Volume Anamorphosis” tool by a well-known French software company that claims to correct wide angle distortion. Let's give it a try. First the original photo, then the results of running two different correction algorithms.

Original photo with “distorted” bass drum and drummer. The ceiling is slanted in reality.

Output of the first algorithm

In the output from the first algorithm, the drum, and especially the drummer, look better from a distance, but now we have noticeably bent lines on the left wall and ceiling that should be straight.

Output of the second algorithm

In the output from the second algorithm, the drum is nicely round, the drummer looks good, but we have barrel distortion on the right wall and the floor.
It is like trying to straighten out a bulging carpet. All the software can do is push the bulge around; “correcting” one part of the image messes up another part. How obvious this is depends on the image. In most architectural photographs, I wouldn't consider the tradeoff. I might paste one of the “corrected” versions of drum and drummer into the original photo, but only in the context of architectural photography, where marketing is the name of the game. I wouldn’t do it in the context of documentary photography, where I am willing to live with the “distortions”. I try to minimize the problem by avoiding to place round objects near the edges of the frame. 
Since wide-angle photos are hardly ever viewed from sufficiently close up, it remains an interesting question to ask what people make of the perceived “distortions”. (For illustration, it may be helpful to refer back to the photo of my cluttered dining table.) 
The architectural historian Claire Zimmerman summarizes her misgivings about a set of architectural photographs from the early 1930s depicting the famous Villa Tugendhat as follows:
“the relative dimensions of the space were altered in the photographs. This is an effect of the wide-angle lens. [...] the wide-angle lens stretches unevenly, elongating the space closest to the camera and compressing the space behind.” (Zimmerman, Photographic Architecture in the Twentieth Century, University of Minnesota Press 2014, pp. 111-2) 
This sounds to me like a deformation in one dimension, perpendicular to the image plane. Later, Zimmerman says:
“the wide angle lens [...] distorts perspectival space into a trumpet shape” (ibid. p. 227). 
This sounds like a different, more complex deformation.
Let’s try to figure out how exactly the space in front of the camera would have to be stretched or squeezed in order to give rise to the offending image. For this exercise, we assume that the (too distant) viewpoint is in fact the projection point and work backwards from the image to possible shapes in the world, but not just any of the infinitely many possible shapes, but ones that can be produced from the actual shapes by fairly simple squeezing and stretching. How might the object space be transformed in order to look like that
The following sketch illustrates two possible transformations, one being a stretch perpendicular to the image plane, the other my best stab at Zimmerman's trumpet shape. We look at the situation from above. The point E on the bottom right is the correct viewpoint for the sphere in the object space on the left, E' is the actual viewpoint. The sphere with E as the projection point yields the same image as the ellipsoid or the squished form with E' as the projection point. 
Here is a video to illustrate the simpler of the two transformations. It shows how a regular and an orthogonally stretched scene can look exactly alike when viewed from different viewpoints. It starts out slow to give you time to contemplate the scene.

Thanks to Jane Foote for the modeling.

If stretching in depth is the relevant transformation, then it has to be uniform, not “uneven” as Zimmerman claims. I wish I could show you the trumpet-shaped distortion. But the architectural modeling program we used couldn't compute the more complex transformation. 
So much for wide angle lenses. For what it's worth, my own preferred general-purpose focal length is 35mm. At normal viewing distance, the slightly exaggerated depth compared to a normal lens is not too obvious, and the lens allows me to include more context surrounding the main subject, which is something that I often like to do.
Straight Verticals
With the examples I have shown so far, I have smuggled in a convention that has been in force in architectural photography from the very beginning. In fact, the convention predates photography by centuries. It stipulates that real-world verticals align with the sides of the picture. The origin of the convention lies, as far as I can tell, in architectural practice and ultimately in gravity.
Let’s take Alberti by his word and envisage a real window, cut into a real wall. Real walls are mostly vertical because the ones that aren’t tend to have a short lifespan. A practical opening in a vertical wall has vertical sides. These sides run parallel to all other verticals in the vicinity (we may ignore the Earth’s curvature in this context). In a central projection with the window as the projection plane, all these verticals are parallel (remember the remarks about the behavior of bundles of parallel lines).
Here I am pointing the camera horizontally, so that its sensor is parallel to the window whose verticals are parallel to the building verticals. The convention is satisfied.
Trouble arises as soon as I point the camera up or down. Maybe I want to show less of the foreground and more of the buildings. After all, I am violating the compositional rule of thirds, if that’s the sort of thing about which you care. So I tilt the camera up:
I tilt the window together with the camera; the two continue to be aligned. This means that the sides of the window remain parallel to the sides of the image, but they are no longer parallel to the verticals in the world. There is nothing perspectivally wrong here. This is the way things look through a slanted window that is perpendicular to the raised line of sight. But it is not what we want. What we want is a view through a vertical window, like this:
The building verticals are once again aligned with the window verticals (the grid lines). But neither are parallel to the photo’s sides. What can be done?
First option: crop. To do that, we need a wider lens to begin with, one that allows us to include the building tops while the camera is level, like so:
Same camera position as before, focal length now 14mm instead of 24mm. We crop around the window.
In the old world of view cameras with flexible bellows, the cropping was done in camera. That’s what the fabled “perspective controls” of these cameras amount to. View camera lenses project image circles that are considerably larger than the film area. This gives the leeway to frame the image by sliding the front or back of the camera sideways and up-and-down, thereby moving the film holder across the image circle. 
This is the image circle of a lens designed for a 4" x 5" view camera projected on 8" x 10" film. We can slide a 4" x 5" frame around in it and pick the part of the image that we want. Tilt-shift lenses sold for digital cameras are more expensive tools to accomplish the same thing.
Second option: transform in software, then crop.
This is geometrically equivalent to the first option. Make sure to give yourself enough of a margin for cropping.
The “corrected” photos are still perspectivally sound. The only difference is that the viewpoint is no longer centered on the photo but has moved down with the horizon. (The viewpoint now lies on a plane that includes the horizon and forms a right angle with the image plane.)
After years of working with view cameras, I now prefer the transformation in software. The reason is that I can use the smaller image circles of less extreme wide-angle lenses to their full potential, instead of using the outer parts of the larger image circles of more extreme wide-angle lenses, where things tend to get quite dark and blurry. The images from most digital cameras have enough pixels to survive the transformation without much loss of definition. 
The straight-verticals convention works nicely with the preferred display mode for photographs: in tasteful frames on a vertical wall. The display ensures that real-world verticals and verticals in the photograph are parallel and that the image horizon is level. This makes it possible to view the photographic space as continuous with the real space occupied by the viewer. The same continuity would obtain if we hung perspectivally “uncorrected” photos at the appropriate angle, like skylight windows in a pitched roof. But I am not aware of anyone ever having done this.
There are some bits of practical advice for depicting space that can be drawn from the discussion.
1. Use the least extreme wide-angle lens you can.
2. When you are excited about a sense of space and want to photograph the scene, stop moving (to avoid motion parallax) and close one eye (to avoid stereoptic parallax). Then see if the remaining depth cues are powerful enough to work in a photo. 
3. Inspect the scene on your camera’s back screen. The screen being so small, your actual viewing distance is much greater than the perspectively correct distance. This will accentuate any “distortion” problems towards the edges and corners. 
4. If you want to be informative about a space, choose a projection point that is robust in the sense that changing it slightly doesn’t change the depth cues dramatically. This is violated in the illusions in the videos. On the other hand, a specific projection point may give a preferred 2D composition or juxtaposition.
Tonal Reproduction
You have no problem discerning the dominant shape in this picture, and it is not because of perspectival cues like lines and angles. Maybe you think this is easy because you see a familiar object. (By the way, this is a medical teaching model.)
But look, even Photoshop can figure it out. I asked Photoshop to make a 3D model from the original photo and then render it. I doubt that Photoshop has a shape library that contains buttocks. If it did, it shouldn’t make the mistake of turning highlights into protrusions. Much of the information that makes the 3D elaboration possible lies in the gradients of light and shade.
There are plenty of tutorials, in books and on the web, both for photographers and painters, that teach how the direction and quality of light affects the modeling of form. It's fun to experiment by taking a simple object like an egg and casting different kinds of light on it. (You want to play with a light gray or white object to better see the shadows and not be distracted by colors.) On-axis harsh front lighting is the worst to model the form, soft but still directional side or back lighting is the best. The tea-sieve figure in the cup showcases soft window light coming from the side and back and being reflected from the inside of the cup. The spatial information is all in the tonal gradients. I won't go into much detail here. I only want to talk briefly about one aspect that specifically concerns architectural spaces.
Architectural interiors are technically challenging to photograph because they present an enormous brightness or dynamic range range from the darkest shadow inside to the brightest highlight outside. This range outstrips the recording capacity of both traditional film and digital sensors: you can’t capture it all in one exposure. You have to make a decision. You can base your exposure on the interior light, but then the exterior is blown out.

Soviet fighter jet bunker in east Germany

Or you can base the exposure on the exterior, but then the interior will be too dark and shadows all blocked up.
Not good if what we aim for is something like this, which is much closer to what the place looks like:
A further complication is that architectural interiors are mostly painted in light colors, so they should look reasonably bright in a photograph. But equally often, they are not exposed to enough light and hence will come out looking murky.
In the olden days, solving these problems required some heavy lifting. In order to balance interior and exterior light levels and color temperature, architectural photographers (or their assistants) replaced incandescent lightbulbs with brighter and bluer ones, put colored gels on fluorescent tubes, covered the windows with dark transparent foil, and brought in studio lighting to brighten the interior. The results can look very slick, but often we sense the artifice.
Things have become easier with digital technology. We now can make a number of different exposures to capture a wide dynamic range and then stack them in software. This allows us to take the highlights from one frame and the shadows from another frame (so-called High Dynamic Range processing). One could do this in principle with film, but in practice it is impossible to line up the frames precisely enough because film always buckles a little.
Of course there is no free lunch. We are trying to squeeze a vast dynamic range into the much smaller range that can be displayed on a computer screen or printed on paper. For those of you with a feel for f-stops (remember: each f-stop amounts to a doubling or halving of the light level): an architectural interior often encompasses 15 stops from shadows to highlights. A print can hold about 5 stops. So we are talking about a compression by 10 stops or a factor of 2^10 = 1024. No matter how you do the compression, you will lose contrast in the process. And contrast is what models shape.
This kitchen provides a pedestrian example. Above is the best overall exposure straight from the camera. Shadows are okay, highlights lack detail or are entirely blown out: backsplash, grill, counter top, and pendants.
This is after HDR processing. We now have highlight detail, but the image looks very flat. I could have tried to jazz it up in the HDR software, but that tends to yield the kind of horrible result with which the web is flooded:
Beside the kitchen looking flat, the view through the screen doors is way too blue; it was pouring rain outside. So let’s adjust the white balance accordingly:
Bad! There is no setting that makes both exterior and interior color look acceptable. I had no choice but to adjust the color locally. This was especially tricky in the reflections off the chairs and floor and in the pastel paint in the alcove (which was of great importance to the architect). I also adjusted the overall contrast and colors and brightened up the ceiling. Never mind the details; the end result is this:
We finally have decent modeling of shapes by tonal gradients, and the colors look plausible. Coloring in Alberti’s window turned out to be a non-trivial exercise.
An interesting alternative to HDR processing is a technique that sometimes goes by the name "flambient", where the interior is photographed separately under ambient light and lit up with a powerful flash, and the exposures are cleverly blended in software. The ambient light provides the atmosphere, the flash gives neutral color information. The blending needs to be done very gingerly lest the flash impose an unnatural, almost clinical look on the interior, visible in a lot of real estate photography. The technique has another trick for recovering exterior detail. A flash overexposure of the interior is used as a mask for pasting in the "correctly" exposed exterior. This is very effective but can make the exterior look like a landscape photograph stuck in the window frame. One telltale sign of a poor flambient job is too close a match in brightness, contrast, and white balance between interior and exterior. 
Starting with Alberti’s thoughts about the “truthful” representation of space, I urged that we include tonal reproduction under the heading of perspective. Now we see that in order to make a photo look right, we need to deviate in subtle and not-so-subtle ways from straightforward, globally defined tonal mappings. Put bluntly, we need to cheat locally in order to approach a more global semblance of truth. It is this massaging of tones, together with the choice of a good vantage point, that makes shapes come alive in a flat picture.

Annunciation with St Emidius, 1486, by Carlo Crivelli. UK National Gallery

Painters have known this since at least the Renaissance. The interior in Carlo Crivelli’s Annunciation is almost as bright as the exterior, which wouldn’t be the case in reality. In fact, Crivelli is so concerned about enlivening local contrast that he gives up much of the more global modeling through light. Many of his contemporaries worked similarly, relying more on geometrical form than light for overall spatial impact.
Pieter Janssens, Interior with Painter, Reading Lady, and Sweeping Maid, ca. 1665 – 1670, Städel Museum, Frankfurt am Main, Germany. The reproduction exaggerates the luminosity a little; the original is more subdued.
Compare this treatment by Pieter Janssens 200 years later. Janssens presents light in an almost photorealistic way, 150 years before color film. If you have ever tried to photograph such a scene, you know how extensively you need to rework local color and contrast to arrive at such a vivid result. It would be interesting to study the dramatic shadows for verisimilitude. Hint: the chair’s shadow on the back wall is wrong.
Classical painters are believed to have purposefully manipulated perspective not just with respect to color but also with respect to shape. The sphere at the bottom of Cranach’s Melancholy Allegory may be a case in point. Another example may be the apple and cucumber on the foreground ledge in Crivelli’s painting. They are certainly placed prominently enough to draw attention to the painter's skills. 

Viewpoint reconstruction

Comparison photograph with 18mm lens and apple in place

Comparison of the apples. Mine looks much more elongated than Crivelli's. Which one looks better?

When centered in the image, my apple resembles Crivelli’s more closely. Is this the look he preferred? Did he purposely distort his apple to make it look better, in the way in which I might paste a "corrected" version of the drum and drummer into an otherwise perspectivally sound photo?  
Distorted apples aside, Crivelli’s pickle is humungous. 
Back to Top