XINABSE

common.pl

This include contains functions used by the spider

getDBH

returns a new databasehandle

deprecated - use doSQL instead.

doSQL query

Takes a a SQL statement as argument and executes it. If no global database handle is available it is created. Returns twodimensional array with the result set. in [0][0] is the insert ID on a insert statement.

utf8_quote $string

quotes a string for database and converts perl 5.8+ Unicodestrings back to latin1

returns returns quoted string

storeKeyword $keyword,$weight,$siteid

Inserts a keyword for a site into the database

returns keyid

storeSite $url,$lastmod,$title,$level[,$extract]

Inserts (updates) a site and its data into the database

returns siteid

deleteSite $url

Deletes a site from the database

returns deleted siteid or 0 if site wasn't found

splitURL $url

splits URL into parts

returns domain,path,page

fetch $url,$level[,startid,starturl]

fetches URL and reachs the content to the parsers

returns nothing

countKeywords $hashref,$text [,$multiplier]

counts all keywords (counts are multiplied with $multiplier) and stores the number of found keywords in the given hash

returns nothing

parseDocument $header,$document,$level,$url,$base,$startid,$starturl

parses header (HTTP::Headers) and document

returns nothing

stripHTML $html

strips tags and newlines from HTML

returns textonly

stripSpecials $text

strips special chars and lowercase's

returns textonly

stripHTML $siteid

removes all hits for a site

returns nothing

findLinks $html,$base,$level,$startid,$starturl,$domain

removes finds all links and puts them into db

returns nothing

addURL $url,$level[,$startid]

adds a new url.

returns nothing

debug $verbositylevel,$message

prints a message if $VERBOSITY is right

returns nothing

clearDB

deletes everything in database

returns nothing

getOutdatedURL $reindexperiod

fetches an url that is older than $reindexperiod (in hours)

returns true|false

markURL $url

updates the lastcheck field of an URL to make sure the next run doesn't pick that url again:

returns nothing

loadStopwords $file

loads stopwords from $file

returns @stopwords

isin $needle,@haystack

checks if $needle is in @haystack

returns true|false

printStatistics

prints statistical data

returns nothing

showMostCommon $limit

show most common $limit keywords

returns nothing

showStartURLs

prints all startURLs

returns nothing

readConfig $config

reads a config file

returns hash with config data

cleanUp

removes duplicates and senseless data from the database

returns nothing

deleteSite url

Deletes a startURL and all dependant sites and exits.

returns nothing