Abstract
Sentiment analysis is essential for understanding consumer opinions, yet selecting the optimal models and embedding methods remains challenging, especially when handling ambiguous expressions, slang, or mismatched sentiment-rating pairs. This study provides a comprehensive comparative evaluation of sentiment classification models across three paradigms: traditional machine learning, pre-transformer deep learning, and transformer-based models. Using the Amazon Magazine Subscriptions 2023 dataset, we evaluate a range of embedding techniques, including static embeddings (GloVe, FastText) and contextual transformer embeddings (BERT, DistilBERT, etc.). To capture predictive confidence and model uncertainty, we include categorical cross-entropy as a key evaluation metric alongside accuracy, precision, recall, and F1-score. In addition to detailed quantitative comparisons, we conduct a systematic qualitative analysis of misclassified samples to reveal model-specific patterns of uncertainty. Our findings show that FastText consistently outperforms GloVe in both traditional and LSTM-based models, particularly in recall, due to its subword-level semantic richness. Transformer-based models demonstrate superior contextual understanding and achieve the highest accuracy (92%) and lowest cross-entropy loss (0.25) with DistilBERT, indicating well-calibrated predictions. To validate the generalisability of our results, we replicated our experiments on the Amazon Gift Card Reviews dataset, where similar trends were observed. We also adopt a resource-aware approach by reducing the dataset size from 25 K to 20 K to reflect real-world hardware constraints. This study contributes to both sentiment analysis and sustainable AI by offering a scalable, entropy-aware evaluation framework that supports informed, context-sensitive model selection for practical applications.