I’m running the following code using numpy arrays, I get a MemoryError in Ubuntu, while the same code runs on Mac OSX. (Pagination is automatically setup in Mac) This process consumes around 30 GB. Original post here.
I’m trying to replace numpy
array with another structure which doesn’t consume so much memory and improve the for loops
.
def create_sequences(tokenizer, max_length, descriptions, photos): """Creates sequences of images, input sequences and output words for an image. X1, X2 (text sequence), y (word) photo startseq, little photo startseq, little, girl photo startseq, little, girl, running photo startseq, little, girl, running, in photo startseq, little, girl, running, in, field photo startseq, little, girl, running, in, field, endseq :param tokenizer: :param max_length: :param descriptions: :param photos: :return: """ X1, X2, y = [], [], [] # Walk through each image identifier. for desc_key, desc_list in descriptions.iteritems(): # Walk through each description for the image. for desc in desc_list: # Encode the sequence. seq = tokenizer.texts_to_sequences([desc])[0] # Split one sequence into multiple X,Y pairs. for i in range(1, len(seq)): # Split into input and output pair. in_seq, out_seq = seq[:i], seq[i] # Pad input sequence. in_seq = pad_sequences([in_seq], maxlen=max_length)[0] # Encode output sequence out_seq = to_categorical([out_seq], num_classes=vocab_size)[0] # Store. X1.append(photos[desc_key][0]) X2.append(in_seq) y.append(out_seq) print len(X1), len(X2), len(y) print type(X1[0]) #return array(X1), array(X2), array(y)
Output:
Dataset: 6000 train images. Descriptions: train=6000 Vocabulary Size: 7579 Photos: train=6000 Description Length: 34 Preparing text sequences for training. 306404 306404 306404 <type 'numpy.ndarray'>