Tuesday, July 17, 2007

Nutch

Trying to use Nutch on my own laptop. Faced many problems just running it, mainly due to the unfamiliarity with Unix and Cygwin.

Problem 1: JAVA_HOME not set

Go to Windows System -> Environment to add a new Variable 'JAVA_HOME' with the path as the java folder. Restart Cygwin and change will be detected

Problem 2: NoClassDefFoundException

Problem with the spaces of folders. Instead of 'Nutch 0.9', I changed to Nutch-0.9.
Also in environment, add Variable CLASSPATH with path to the jars of the Nutch folder.


Google for an entire day just to fix this. This post is mainly to remind myself on how to fix the problem again should I ran into it.

Monday, June 25, 2007

Update - 250607

- Completed block segmentation to separate individual blocks from the slides, notably to extract visual elements such as graphs, charts and tables from the slides (see Fig1).

- Using the original image to segment instead of the extracted slides, the problems of blurry and non-complete images can be solved (see Fig 1)

- Found a program pdf2Text to extract all the texts from a pdf slide set into text files. Using that, the next step is to think about what are the key words to be linked to a particular visual element.

- Started on reading up of how google carries out queries and the data structures used to store the index.

TO DO:
- Which key words to be linked to a visual element in a slide?
- To identify what kind of images the segmented images are: text, tables, charts, pictures etc.
Fig. 1: The segmented visual elements (example) cropped from the original slide to prevent any loss of quality.

Thursday, June 14, 2007

Update - 15/6/07

Updates

- Successfully separated the background image from a series of slides (see Fig 1)
- Extracting of foreground images from slides (see Fig 2)
- Detected some memory leaks which caused program to be slow

- Background extracted may contain missing or distorted pixels which may cause some foreground images to be unreadable.
- Tried to smooth the images extracted by:
a) Taking the dominant colour of the surrounding pixels
b) Taking the average rgb values of the surrounding pixels
- Results unsatisfactory

- Started on block segmentation
- Successfully implemented a code to push all the pixels to the left (see Fig. 3) and to the bottom to form necessary histograms for segmentation

TO DO:
- Refine the background / foreground extraction method to produce better quality output
- Analyse the histograms produced in order to segment the images





Figure 1










Figure 2












Figure 3

Thursday, May 31, 2007

Update

Log of project (in point form for simplicity)

*Unable to find a suitable viewer or utility to make use of the .sep file that was generated with the gsdjvu library.

*Started work on writing a program to separate foreground/background.

*Began reading up on suitable libraries available.

*Tried sixlegs png java package -- found not to be suitable due to lack of documentation on it.

*Found a suitable library -> java advanced imaging (JAI)

*Started implementation on java which includes:

- learning how to use JAI as well as some basics of image processing due to lack of prior experience in it.

- able to read image pixels from a png image and dump them to a text file

- able to read a folder of png images and dump the pixel values into their corresponding output text files

- ran into problems with heap size -> search online for a suitable solution -> using parameter -Xmx to solve it

- found out more about writing images by setting individual pixels' RGB values


TODO:

Start to implement a counter for the dominant pixel value across a set of slides and to use that pixel value to write to an image file. To see if an accurate background pixel can be obtained.

Wednesday, May 23, 2007

.sep files

Finally managed to compile the gsdjvu source. Took me more than a day to do so. First trying on windows and cygwin and then linux which I am new to.

Using one of the gsdjvu's method, it is supposed to separate the foreground and background of a ps or pdf file. However, it generates the output as a .sep file which I have not figured out on how to read it.

It seems to be a TIFF separation file or something like that. But so far, all efforts to open it with the existing viewers have not succeeded.

Wednesday, May 16, 2007

DjVu Image Compression Format and Ruby on Rails

While reading up on foreground/background separation techniques, I came across the DjVu image compression format.

DjVu is able to store compressed images of documents as very small files, yet not compromising on the readability. It first separates the foreground and the background of the document, and then compresses the background while keeping the texts at high resolutions. In addition, the foreground text can be OCR-ed as well.

This seems to be related to what I need for the first portion of the project and I shall see whether I can make use of this over the next few days.

There is also a Ruby on Rails meeting coming up later and I shall be attending it to see if it can be useful for the project. In any case, it is perhaps good to learn about it, though I currently have no knowledge of Ruby, what more one on rails?

Also required to try to decide on the software part: programming languages to be used, libraries etc.

Thursday, May 10, 2007

Project Initialisation

The project for my honours year is called "Visual Slide Presentation Analysis". The main goal of the project is to be able to analyse and extract valuable visual information from slides as well as to segment and classify them.

The first portion of the project will involve the foreground and background separation of the slides which I will be finding out more about these few weeks.

The following are the tasks that are lined up for the next few weeks as I begin the project.

1) Find out more about the PNG image format
2) Start learning a scripting language (either Perl or Ruby)
3) Find out more about existing techniques (if any) on foreground/background separation
4) Find out what users look out for in slides and if there is any value-add in most slides or if they are just a summarized versions of papers