Abyss of esoteric reconditeness: November 2007

It's perhaps too immature to think that there's an easy way to solve the computer vision problem, i.e. creating a vision system that analyzes as human eyes and visual cortex do, and doing so at a computationally affordable level. What kind of techniques do we have, and what are their limitations? Surprisingly, notwithstanding computer science being a relatively nascent field, especially artificial intelligence, with vision systems even more so, it has been decades since the inception of the foundations of the respective fields, and yet we have not progressed to a technologically satisfactory point. For starters, here are the tools that are at our disposal.

* Differencing
* Thresholding
* Connected Components Analysis
* Optical Flow
* Gaussian/Mean Blur (For noise reduction)
* Shape recognition
* Edge detection

There are more, but we need not concern ourselves with the other more sophisticated ones, since they are very likely to be computationally expensive. Most computing systems are discrete in nature, this may be an advantage or disadvantage, depending on your point of view. Analog vision systems, if they ever do exist, must very likely be incomprehensibly sophisticated and hard to quantify (think of how you are going to represent an image abstractly, and how you are going to translate that into a physical representation). With discrete systems, we can build an image by means of pixels, basic individual units upon which we shall compute. While this simplifies the math, this also means we are severely incapacitated, in that we can only look at a pixel at a time, so to speak. With enormous processing speed and parallel processing, this may deceive our perception enough to deceive us.

So I talk about an easy way to solve the computer vision problem. No, don't get excited yet. I do not promise anything; however, it would be good to list out the kind of ideas I have, no matter how puerile they seem to be. Before we move on to that, let's talk a little bit about the visual cortex, and the various regions of it.

So basically, the visual cortex is separated into regions which we call V1, V2, V3, V4, and V5. Each of these regions is responsible for processing various aspects of the incoming visual stimulus. Such processings include, but not limited to motion, shape, depth, edge, color, orientation, to more complicated things like object recognition, facial recognition, and spatiotemporal features like speed and direction.

Let's not be ambitious here, after all, complex fourier transforms of signals at high bandwidth and what not on primitive computer chips are as its name suggests, complex and complicated, not to mention impractical. We aim to produce a relatively simple processing system that is a concoction of several filters mentioned earlier, coupled with simplified assumptions about the scenery perceived by the system. My belief is that there is a simple way to approximate primate visual processing capabilities very closely, or good enough for practical purposes, without resorting to insane mathematical model.

First off, we need to list out the kind of information we can obtain from a stream of visual stimulus. What are they? Patterns. Shapes. Edges. Motion. Orientation. Direction. Depth. Texture. Color. Some of these may be of a higher level in a hierarchy of signal processing, with each layer receiving information from the bottom layers and integrating them, effectively compressing into less data but with only pertinent visual information. Also, research evidence shows that the processing is not necessarily one way. Information from the top also affects the bottom. We may see this as some kind of a feedback loop.

Since the visual cortex is highly modularized, I believe we should do the same. Take the motion processing out, and the subject sees only discontinuous images without fluidic motion. Take the shape processing out, and the subject wouldn't be able to tell a sphere from a cube. So it seems there are many modules here, and we can't afford to lose even one of them. But the average computer can only do so much. What are our options here?

Perhaps we should review our priorities. Let's think for a second what do we intend to get out of this system. I think it suffice for a system to be able to differentiate objects and track them reliably. Also, as the object's visual stimulus transforms (either due to lighting, shadow, motion, or rotation), the inner representation of the object must not change, i.e. we should know this is the same blob we saw even when it looks different. So, obviously, we need to be able to perceive motion and individual objects. What kinds of simple processing do we need here?

Let's focus on recognizing individual objects first. How do we differentiate one object from the other? Or rather, how do we classify a blob of pixels as a particular distinct object? Usually, we make the distinction base on edges, textures, and colors. once we identify a particular object, despite the changes it goes through in the next moment, we should still be able to identify it and the pixels associated with it.

Edges and textures seem to be the more important ones here, since a color blind person can still perceive individual objects. Edge defines a boundary in which an object resides in, and textures identify that object and separate it from the other objects. There's also the issue of overlapping objects. Nevertheless, let's not get ahead of ourselves. We should review the most basic cases first.

Finding edges is simple since we have good algorithms for it. What's more important is, given a simple high contrast image of shape outlines, how do we make out which line belongs to which shape, and exactly what shapes can we infer out of it? Psychology has provided us with some good ideas. What we intend to do here is grouping, essentially bringing order and form to the sensations we perceive. They are grouped using several rules. Proximity. Similarity. Continuity. Connectedness. Closure. They may be more, but we'll focus on these.

So we have an image outline, with only lines and nothing else. This can be obtained from a Sobel filter and then pruning/thinning it. Some of the interesting features are curves/lines, joints, crosses, separation points, angle, closed loops, etc. How do we find these out, and how do we encode them into useable units? There's also the issue of very messy interconnected lines, like a web, which can basically take many forms of interpretations. Detecting this many features definitely requires a lot of processing. Perhaps it'd be wise to cut down on the number of features we want to detect. What are the minimum features that can still reliably describe an image? Let's discuss these features one by one.

TO BE CONTINUED

Friday, November 16, 2007

Strong May

Thursday, November 01, 2007

Preamble to the Computer Vision Problem

About Me

Scribblings

Links

Blog Links

Previous Posts

Archives