Home > CSc-387 Parallel Processing

# CSc-387 Parallel Processing

CSc-387 Parallel Processing

CHAPTER 11 : IMAGE PROCESSING ALGORITHMS
========================================

11.1. Low-level Image Processing
--------------------------------

- Gray scale: 0 (black) - 255 (white), pixels are represented in 8 bits
- Color images: (R)ed, (G)reen, (B)lue 8 bits each. 24 bits total
or pixel value may be an index to a look-up table which stores colors
- assume that the stored image uses a coordinate system as shown in:

(* T Figure 11.1 *)

- Low-level image operations: embarrassingly parallel
e.g. noise reduction, edge detection, object labelling, template matching

Computational Requirements: 1024x1024 pixels ==> 2^20 (~10^6) = 1 MBytes
--------------------------
suppose each pixel must be operated once, using a computer with
speed = 10^{-8} sec/oper, it takes 10 ms to process 2^20 pixels.
In real time, we need to process 60-85 frames/second (1 frame in 12-16 ms).
Typically, many high-complexity operations must be performed on the
image. Therefore, computational requirements are very high and often
special purpose hardware is used.

POINT PROCESSING:
-----------------
embarrassingly parallel.

THRESHOLDING: if (Xi < threshold) Xi=0; else Xi=1

CONTRAST STRETCHING: map the interval (x_l, x_h) =====> (X_L, X_H)

X_L ............................ X_H
\ \ /
\ \ /
\ \ /
\ \ /
x_l ....x_i.... x_h

x_i = (x_i - x_l)*[(X_H - X_L)/(x_h - x_l)] + X_L

GREY LEVEL REDUCTION: reduce the number of bits to represent the pixels
in order to reduce storage requirements

11.3 HISTOGRAM
============== a global operation (all pixels are used)
mainly used before a thresholding operation

(* T Figure 11.2 *)

for(i=0; i < height_max; i++)
for(j=0; j < width_max; j++) hist[p[i][j]]++ ;

To do it in parallel, processors locally update their histogram arrays
and then perform a Vector Global Sum (global sum for each vector entry).
or send it to a master processor for a global update.

11.4 SMOOTHING, SHARPENING, and NOISE REDUCTION
===============================================

- local operations which require the value of neighboring pixels

- Smoothing suppresses large fluctuations in intensity by reducing the
high frequency content

- Sharpening accentuates the transitions, enhancing the detail. Two ways:
reduce low-freq. content or accentuate changes through differentiation

- Noise reduction: suppresses noise through averaging and median filtering

Smoothing:
----------
new value for x4 (x4') is obtained by taking the average of 9 pixels
in the 3x3 region around x4:
(* T Figure 11.3 *)

new value for x4 can also be obtained by finding the MEDIAN of the
pixels. This obviously requires sorting 9 values. Efficient
approximate sorting strategies (i.e. shearsort) can be employed.

Parallel Code: suppose one processor is assigned for each pixel.
-------------- Then, each PE perfroms 4 communications with
its L, R, U, and D neighbors to get neighboring pixel values:

(* T Figure 11.4 *) (* T Figure 11.5 *)

After the values are obtained, new pixel values can be computed.
This is an embarrassingly parallel application and it is suitable
for synchronous computation in lock-step mode (SIMD type).

--------------
Operations described so far can be generalized in the form of a
"weighed mask operation" as shown in:

(* T Figure 11.7 *)

For example, mask for computing the mean is: (* T Figure 11.8 *)
Mask for noise reduction: (* T Figure 11.9 *)
A High-pass sharpening filter mask: (* T Figure 11.10 *)

===================
11.5 EDGE DETECTION
===================

Edge Detection using differentiation: (* T Figure 11.11 *)

Two important parameters to determine: (* T Figure 11.12 *)

Gradient (magnitude) = Df = SQRT[(df/dx)^2 + (df/dy)^2]

Gradient Direction = Theta(x,y) = Tan^{-1}[(df/dy) / (df/dx)]

- Gradient (magnitude) can be approximated to |df/dx| + |df/dy|

=====================
For discrete functions, the derivatives are approximated by
differences:
df/dx = (x5 - x3) df/dy = (x7 - x1)

Gradient = Df = |x5 - x3| + |x7 - x1|

Prewitt Operator:
-----------------
Better results can be obtained by using more pixel values.
For example:
(* T Figure 11.13 *)

df/dy = (x6 - x0) + (x7 - x1) + (x8 - x2)
df/dx = (x2 - x0) + (x5 - x3) + (x8 - x6)

Df = |df/dy| + |df/dx|

Sobel Operator:
--------------- (* T Figure 11.14 *)

df/dy = (x6 + 2x7 + x8) - (x0 + 2x1 + x2)
df/dx = (x2 + 2x5 + x8) - (x0 + 2x3 + x6)

Edge Detection Example (with Sobel operator): (* T Figure 11.15 *)

Laplace Operator: second-order derivative which can be approximated to:
-----------------
(* T Figure 11.16 *) (* T Figure 11.17 *)

D^2 f = 4.x4 - (x1+x3+x5+x7)

Notice that this computation will yield to zero for a constant change
in gray level across the mask.

- Usually, a threshold is applied after the edge operators to get
either black or white pixels:

(* T Figure 11.18 *) shows the result of Laplace Operator.

11.6 THE HOUGH TRANSFORM
========================

- Edge detection described so far identifies the pixels which might be
on an edge, but it does not tell much about how they connect to each
other to make an edge. Hough transform tries to do exactly that.

Line equation: y = ax + b b = -xa + y

- A single pixel at (x1, y1) has an infinite number of lines that
could pass through it.
- If a and b are discretized, a finite set of lines will exist.

(* T Figure 11.19 *)

- There will be a common point (S, T) in the parameter space (a-b)
into which all the points that lie on a specific line in the x-y
space will map. These pixels may also map onto other points in (a-b)
space, but (S, T) will get the most hits.
If we keep a count of these hits, then those (a, b) values with most
points mapped onto will be the best candidates for edges in the image.

- However, in order to make this method to work with
vertical/horizontal lines, we need to rewrite the line
equation in polar coordinates:

y = ax + b ============> (polar) r = xcos(A) + ysin(A)

where r is the perpendicular distance from the origin to the line,
and (A) is the angle between r and x-axis.

(* T Figure 11.20 *)

Implementation:
---------------

- r is always positive, (A) is between 0 - 360 degrees
- The parameter space is divided into small rectangular regions.
r is divided into increments of 5 and (A) increments of 10 degrees.
An accumulator is provided for each rectangular region:

(* T Figure 11.22 *)

- Accumulators of those regions that a pixel maps into are incremented.
Therefore, the computation time will depend upon the number of regions.
- If there are n pixels and k discrete values of (A) are tried,
then complexity is O(kn)
- To reduce the computational complexity, we might make k smaller.
If only the Gradient Angle at that pixel is used, then k=1.
In this case, time complexity for n pixels is O(n).

SEQUENTIAL CODE
---------------
for(x=0; x < Xmax; x++)
for(y=0; y < Ymax; y++) {
sobel(x, y, dx, dy);
if(magnitude > treshold) {
Theta = grad_dir(dx, dy); /* atan2() function */
Theta = Theta_quantize(Theta);
r = x * cos(Theta) + y * sin(Theta);
r = r_quantize(r);

acc[r][Theta]++; /* increment accumulator */

append(r, Theta, x, y); /* append point to line */
}

}

- In the end, the most likely lines are those with the most pixels,
and accumulators with local maxima are chosen.

PARALLEL CODE
-------------
Given
P: number of processors N*N : Image size
M1: number of discrete intervals used for r
M2: number of discrete intervals used for Theta

If [ (M1 > M2) and (P < N) and (P < M1) ]
Then a PARALLEL ALGORITHM with time complexity O(N^2/P + M1*M2*logM1)
can be developed to perform Hough Transform on the given image:

- Computation for each accumulator is independent of the others.
Image could be sliced into strips and assigned onto processors.
Each PE locally updates the accumulators corresponding to the pixels
in its own region. When done, accumulators must be merged.
Creative solutions could be adapted for the final merge step.

- One simple solution is that a Master PE could collect all the
results and add them up to form a Global Accumulator Matrix.

- Or PEs may use a Binary Tree approach for merging the local
matrices. This takes O(M1*M2*logP) where M1 and M2 are the number
of discrete intervals used for r and (Theta) respectively.
Hence M1*M2 is the size of the Accumulator Matrix.
After the matrices are merged, entries must be analyzed to find the
local maximas. This can also be done through sorting which takes
O(M1*M2*logM1*M2) = O(M1*M2*(logM1 + logM2))

- How about a special mergeANDsort operation in which processors try
to sort their accumulator values? Whenever they come across the
same accumulator index, they reduce them into one value, and
participate in the sort as one value instead of two separate values.
- Indeed, this can be a nice term project for those who are interested!

11.7.2 DISCRETE FOURIER TRANSFORM (DFT)
=======================================
Given N real discrete values x0, x1, ..., x_(N-1), DFT is computed as:

X_k = 1/N * SUM_{j=0}^{N-1} xj * e^{-2*PI*ijk/N} for k = 0,1,...N-1

Inverse DFT is:

x_k = SUM_{j=0}^{N-1} Xj * e^{2*PI*ijk/N} for k = 0,1,...N-1

- X_k s are complex values whereas xk's are real.
- Note: by definition, e^{iA} = cosA + i*sinA
- Sometimes the DFT is given without the scale factor 1/N

ONE APPLICATION: Earlier, Frequency Filtering was achieved with
weighted masks, which can be described by the CONVOLUTION operation:

h(j, k) = g(j, k) * f(j, k)

where g(j, k) describes the weighted mask and f(j, k) describes the image.

The CONVOLUTION of two functions can be obtained by taking the Fourier
transforms of each function (H(j,k), G(j,k), F(j,k)) and multiplying them:

H(j, k) = G(j, k) * F(j, k) (* T Figure 11.24 *)

- Doing it this way does not reduce the time complexity but working in
Frequency domain has other advantages, e.g. eliminating Very High
and Very Low frequencies, etc.

IMPLEMENTATION of DFT:
======================
- Sequential Time Complexity for computing DFT of N values: O(N^2)

w = e^{-2*PI*i/N};
for (k=0; k < N; k++) /* for every point */
X[k] = 0;
for (j=0; j < N; j++) /* compute summation */
X[k] = X[k} + w^{j*k} * x[j] ;

- Note that the summation step requires complex number arithmetic.
Since w^{(j-1)*k}*w^k = w^{j*k}, the code could be rewritten as:

w = e^{-2*PI*i/N};
for (k=0; k < N; k++) { /* for every point */
X[k] = 0; a = 1;
for (j=0; j < N; j++) { /* compute summation */
X[k] = X[k} + a * x[j] ;
a = a * w^k ;
}
}

PARALLEL IMPLEMENTATION of DFT:
--------------------------------
- N DFTs and N processors. Each PE computes one DFT. Pi computes Xi
- Each PE needs a copy of N discrete values x0, x1, ..., x_(N-1)
- Therefore, Tpar = O(N) with N processors
- If P << N, then each PE computes N/P DFTs.

FAST FOURIER TRANSFORM (FFT)
============================
- FFT is a fast method of obtaining DFT
- has a time complexity of O(NlogN) instead of O(N^2)
- In the following, it will be assumed that N is a power of 2.

- It can be shown that, DFT for Xk can be rewritten as:

Xk = 1/2 [ X_even + w^k * X_odd ] for k = 0,1, ..., N-1

X_{k+N/2} = 1/2 [ X_even + w^{k+N/2} * X_odd ]
= 1/2 [ X_even - w^k * X_odd ]
(because w^{N/2} = -1)

where X_even is the N/2 point DFT of the numbers with even indices
(x0, x2, x4, .....) and X_odd is the N/2 point DFT of the numbers
with odd indices (x1, x3, x5, .....).

(* T Figure 11.28 *)

- Using the above formula in a recursive manner, we can further
represent X_even and X_odd in terms of 2 summations each.
This yields a recursive divide-and-conquer approach to computing FFT:

(* T Figure 11.29 *)
(* T Figure 11.30 *)
(* T Figure 11.31 *)

SEQUENTAL IMPLEMENTATION of FFT:
--------------------------------
There are logN phases in the computation of FFT (see Figure 11.31)
each phase taking O(N) time. Therefore,

Tseq = O(NlogN)

PARALLEL IMPLEMENTATION of FFT:
--------------------------------
- N FFTs and N processors. Each PE computes one FFT.
- Pi computes Xi
- Unlike the DFT algorithm, PEs do not need to access all of the N
discrete values x0, x1, ..., x_(N-1) at once. At each step PEs
need only two values to compute (one stored locally another one
received from the neighbor processor which differs in only one bit.

- Processors run a Binary Exchange Algorithm in which they interact
(send/receive) with those processors who differ in exacly one
bit. At each phase the differing bit changes moving from left to
right.

If P=N then Tpar = O(logN) (* Linear speedup *)

- If P << N, then each PE computes N/P FFTs. After logP steps, they
can find the data they need locally. In this case,

Tpar = O(NlogN/P) (* Linear speedup *)

Search more related documents:CSc-387 Parallel Processing