Francisca Dias
from IPython.display import Image
Image("stack_overflow.png")
Above are two questions that were asked on Stack Overflow. The first has a tag #.net and the second one is related to ruby-on-rails.
In this report I will build a model that can predict the tag on new questions on Stack Overflow.
In this dataset there are 20 tags.
Each tag corresponds to a programming language or a topic related to technology:
c#, python, css, javascript, androind, iphone, ios, etc.
This dataset can be found here.
import pandas as pd
data = pd.read_csv("stack-overflow-data.csv")
data.head()
How many instances our data has?
len(data)
Split data into train and test
train_size = int(len(data) * .8)
print(train_size)
print ("Train size: %d" % train_size)
print ("Test size: %d" % (len(data) - train_size))
train_comments = data['post'][:train_size]
train_labels = data['tags'][:train_size]
test_comments = data['post'][train_size:]
test_labels = data['tags'][train_size:]
We have to process our features in a format that Keras can read
from keras.preprocessing import text
max_words = 1000
t = text.Tokenizer(num_words=max_words, char_level=False)
Fit on train data
t.fit_on_texts(train_comments)
There are 129402 distinct words on X_train
print("There are %d" % len(t.word_counts), "distinct words on X_train")
print("In: %d" %t.document_count, "documents or instances")
Documents have to be encoded using the Tokenizer by calling texts to matrix()
X_train = t.texts_to_matrix(train_comments)
X_test = t.texts_to_matrix(test_labels)
Use LabelBinarizer class to convert label strings to numbered
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
encoder.fit(train_labels)
y_train = encoder.transform(train_labels)
y_test = encoder.transform(test_labels)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)
Now let us convert the labels to a one-hot representation
# Converts the labels to a one-hot representation
import numpy as np
from keras import utils
num_classes = np.max(y_train) + 1
y_train = utils.to_categorical(y_train, num_classes)
y_test = utils.to_categorical(y_test, num_classes)
print('x_train shape:', X_train.shape)
print('x_test shape:', X_test.shape)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)
from keras.models import Sequential
model = Sequential()
from keras.layers import Dense, Activation, Dropout
max_words = 1000
model.add(Dense(512, input_shape=(max_words,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
Now we’re ready to add our input layer, that is called Dense layer, which means each neuron in this layer will be fully connected to all neurons in the next layer.
We pass the Dense layer two parameters:
1- the dimensionality of the layer’s output (number of neurons): It’s common to use a power of 2 as the number of dimensions, so we’ll start with 512.
2- the shape of our input data: max_words (equals 1000) arrays for each comment
The number of rows in our input data will be the number of posts we’re feeding the model at each training step (called batch size),
and the number of columns will be the size of our vocabulary.
With that, we’re ready to define the Dense input layer.
The activation function tells our model how to calculate the output of a layer (you can read more about ReLU here).
history = model.fit(X_train, y_train,
batch_size=32,
epochs=2,
verbose=1,
validation_split=0.1)
Evaluate the model
score = model.evaluate(X_test, y_test,
batch_size=32, verbose=1)
print('Test score:', score[0])
print('Test accuracy:', score[1])
Lets make predicitions on real examples
Remember that test_comments contained all the comments text:
test_comments.head()
Let us see how well the model predicts the first five comments
Let´s make predicitions on new data, that is, on test data, for the first 5 comments:
for i in range(5):
prediction = model.predict(np.array([X_test[i]]))
predicted_label = encoder.classes_[np.argmax(prediction)]
print(test_comments.iloc[i][:50], "...")
print('Actual label:' + test_labels.iloc[i])
print("Predicted label: " + predicted_label + "\n")
It looks like in 5 questions we only missed one. The fourth question is related to c++ but the model predicted c#.