I am using the script below to do part-of-speech tagging on textual data.The way it works is it goes into a lexicon table in MySQL and it retrieves a tag for each word.
I usually perform these analyses on small datasets. I am trying to run the script on a large dataset and it would take weeks to complete at this speed.
I think the script is very slow because the mysql query is in the foreach loop. I tried to take it out, without success. I think it is too complicated for my skill level.
Any advice on how to proceed?
Thanks a lot.
class PosTagger { /* private $ dict; /* public function __construct($ lexicon) { $ fh = fopen($ lexicon, 'r'); while($ line = fgets($ fh)) { $ tags = explode(' ', $ line); $ this->dict[strtolower(array_shift($ tags))] = $ tags; } fclose($ fh); } */ public function tag($ text) { preg_match_all("/[\w\d\.]+/", $ text, $ matches); $ nouns = array('NN', 'NNS'); $ return = array(); $ i = 0; foreach($ matches[0] as $ token) { // default to a common noun $ return[$ i] = array('token' => $ token, 'tag' => 'NN'); // remove trailing full stops if(substr($ token, -1) == '.') { $ token = preg_replace('/\.+$ /', '', $ token); } // get from dict if set /*if(isset($ this->dict[strtolower($ token)])) { $ return[$ i]['tag'] = $ this->dict[strtolower($ token)][0]; } */ if($ row = mysql_fetch_array(mysql_query('SELECT tags FROM lexicon WHERE lemma = \''.mysql_real_escape_string($ token).'\''),MYSQL_ASSOC)) { $ return[$ i]['tag'] = $ row['tags'][0]; } // If we get noun noun, and the second can be a verb, convert to verb if($ i > 0) { if(in_array($ return[$ i]['tag'], $ nouns) && in_array($ return[$ i-1]['tag'], $ nouns) && isset($ this->dict[strtolower($ token)])) { if(in_array('VBN', $ this->dict[strtolower($ token)])) { $ return[$ i]['tag'] = 'VBN'; } else if(in_array('VBZ', $ this->dict[strtolower($ token)])) { $ return[$ i]['tag'] = 'VBZ'; } } } $ i++; } return $ return; } }