Document sentiment classification using bag-of-words in Python

For online Python training registration, click here !

Sentiment classification is a type of machine learning methods, and a subfield of natural language processing (NLP). It is a kind of supervised machine learning task. With classification algorithms, such as logistic regression model, text data can be trained with respect to their labels, e.g. positive and negative.

The main procedure of a sentiment classification implementation contains the following jobs:

Raw text data are cleaned and preprocessed, in which the unwanted tags are removed.
Text are tokenized into word or token and a tfidf matrix is created.
Machine learning model applied on the tokenized matrix.

In the following example, we show how to perform a sentiment classification task to movie review data from IMDB.

After data is downloaded and extracted, we load it into Python working session.

basepath = 'aclImdb'

labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000, stream=sys.stdout)
df = pd.DataFrame()

for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file), 
                      'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = pd.concat([df, pd.DataFrame([txt, labels[l]]).transpose()], 
                           ignore_index=True)
            pbar.update()

Then we can show the data contents after rows of data frame is shuffled.

df.columns = ['review', 'sentiment']
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.head(3)
df.shape

Following code is for tfidf creation, and preprocessing raw text data, where we removed html tags and keep emoticons in the text. Several functions associated with tokenizer and stop words are created also.


tfidf = TfidfTransformer(use_idf=True, norm=None, smooth_idf=True)
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    return text
df['review'] = df['review'].apply(preprocessor)
porter = PorterStemmer()

def tokenizer(text):
    return text.split()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]
nltk.download('stopwords')
stop = stopwords.words('english')

Next we can train the data, using logistic regression. The best performance in terms of hyperparameter combination search is carried out with grid search.

X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values
tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)
small_param_grid = [{'vect__ngram_range': [(1, 1)],
                     'vect__stop_words': [None],
                     'vect__tokenizer': [tokenizer, tokenizer_porter],
                     'clf__penalty': ['l2'],
                     'clf__C': [1.0, 10.0]},
                    {'vect__ngram_range': [(1, 1)],
                     'vect__stop_words': [stop, None],
                     'vect__tokenizer': [tokenizer],
                     'vect__use_idf':[False],
                     'vect__norm':[None],
                     'clf__penalty': ['l2'],
                  'clf__C': [1.0, 10.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(solver='liblinear'))])
gs_lr_tfidf = GridSearchCV(lr_tfidf, small_param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)
gs_lr_tfidf.fit(X_train, y_train)

After the training process is finished, we can print out the hyperparameters associated with the best model, as well as the accuracy of the best model on both training data and test data.

print(f'Best parameter set: {gs_lr_tfidf.best_params_}')
print(f'CV Accuracy: {gs_lr_tfidf.best_score_:.3f}')
clf = gs_lr_tfidf.best_estimator_
print(f'Test Accuracy: {clf.score(X_test, y_test):.3f}')
#output
Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x000001EB20C0B380>}
{gs_lr_tfidf.best_score_:.3f}')
CV Accuracy: 0.897
{clf.score(X_test, y_test):.3f}')
Test Accuracy: 0.899

If you want to take a look at more details of the code in python source file, you can click the following link to download the file ch08.py.

You can also watch the video for this application on our YouTube channel.

Published by wilsonzhang746 on August 30, 2025August 30, 2025

0 Comments

Leave a Reply Cancel reply

Install PyTorch on Windows

Topic Modeling using Latent Dirichlet Allocation with Python