Spreadsheet Toolkit

corpus.gobbler
Class GoogleHTTPGetSearch

java.lang.Object
  |
  +--corpus.gobbler.SearchMethod
        |
        +--corpus.gobbler.GoogleHTTPGetSearch

public class GoogleHTTPGetSearch
extends SearchMethod

Title:

Description:

Copyright: Copyright (c) 2002

Company: VUW:MCS


Field Summary
 
Fields inherited from class corpus.gobbler.SearchMethod
fileType, numPerPage, result, searchString
 
Constructor Summary
GoogleHTTPGetSearch(java.lang.String searchString, java.lang.String fileType, int numPerPage, Gobbler gobbler)
           
 
Method Summary
protected  int EstimateTotalResults()
           
protected  java.lang.String sendSocket(java.lang.String query, java.lang.String filetype, int numPerPage, int startnum, java.lang.String host, int port)
          Opens a socket connection to host(Google) and sends a HTTP GET with the search string.
protected  SearchResult StartSearch(int startNumber)
          Perform a http based search
protected  java.lang.String[] stripURLs(java.lang.String htmlpage, java.lang.String[] excludes)
          Parses the html and uses a regular expression to rip out the http://.......
 
Methods inherited from class corpus.gobbler.SearchMethod
html, newEstimate, performSearch, status, urlCountTick
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

GoogleHTTPGetSearch

public GoogleHTTPGetSearch(java.lang.String searchString,
                           java.lang.String fileType,
                           int numPerPage,
                           Gobbler gobbler)
Method Detail

EstimateTotalResults

protected int EstimateTotalResults()
Specified by:
EstimateTotalResults in class SearchMethod

StartSearch

protected SearchResult StartSearch(int startNumber)
Perform a http based search

Specified by:
StartSearch in class SearchMethod
Returns:
The updated array of results

sendSocket

protected java.lang.String sendSocket(java.lang.String query,
                                      java.lang.String filetype,
                                      int numPerPage,
                                      int startnum,
                                      java.lang.String host,
                                      int port)
Opens a socket connection to host(Google) and sends a HTTP GET with the search string.

Parameters:
query - The query terms seperated by spaces.
numPerPage - How many results to ask for on each page.
filetype - Which file type to ask for
startnum - Ask for results starting from this point.
host - Should be www.google.com at this stage
port - Should be 80.
Returns:
the resulting html

stripURLs

protected java.lang.String[] stripURLs(java.lang.String htmlpage,
                                       java.lang.String[] excludes)
Parses the html and uses a regular expression to rip out the http://....... Google uses http://images.google.com, http://www.google.com and http://groups.google.com within it's page. Also there references to http://www.dictionary.com that we don't require.

Parameters:
htmlpage - The source code from web page to extract the URL's from.
excludes - Any URL's that match the start of one of these URL's will be excluded from the results.

Spreadsheet Toolkit

Project Home Page