Ali Safaya

Koç University, CS – PhD Student


NLP, Transformers, BERT, CNN

KUISAIL at SemEval-2020 task 12: BERT-CNN for offensive speech identification in social media.

Ali Safaya, Moutasem Abdullatif, and Deniz Yuret. 2020. KUISAIL at SemEval-2020 task 12: BERT-CNN for offensive speech identification in social media. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 2054–2059, Barcelona (online), December. International Committee for Computational Linguistics.

Offensive Speech Identification

Recently, we have observed an increase in social media usage and a similar increase in hateful and offensive speech. Solutions to this problem vary from manual control to rule-based filtering systems; however, these methods are time-consuming or prone to errors if the full context is not taken into consideration while assessing the sentiment of the text (Saif et al. 2016).

We joined the shared task of Multilingual Offensive Language Identification (OffensEval2020), where we focused on detecting offensive language on social media platforms, more specifically, on Twitter. This task is associated with data sets of tweets that were collected from Twitter and annotated manually. We worked on tweets from Arabic, Greek, and Turkish languages. This task can be categorized under Text Classification tasks. In this blog, we explain our work on this task, and how our model managed to get into the first four ranking models.



Arabic @USER يا مريوووم كل سنة و انتي طيبة يا قلبي و يا رب دايما مبسوطة كدا و عقبال العمر كله يا قمر لاڤ يو ♥️♥️♥️ Not Offensive
Turkish @USER Hepimize şerefsiz dedin farkındamısın? *** bey zavallı dedi diye kıyamet koparıyorsun. Offensive
Greek @USER Και οι μαλάκες που δικαιολογησαν τον μπάτσο επίσης Offensive

Example tweets from the data sets

Each dataset consists of two columns for each tweet. We have the preprocessed text of the tweet and a binary label for it. This label can be either “Offensive” (Positive) or “Not-Offensive” (Negative). In the table below you can see how these datasets are distributed in an unbalanced way. This unbalance in labels might lead to a biased evaluation if we do not use the right metric for this task. Due to that, our task is evaluated using macro-averaged F1-Score.


Arabic Greek Turkish
Train Dev Test Train Dev Test Train Dev Test
Not Offensive 5,785 626 1,607 5,642 616 1,120 23,084 2,543 2,740
Offensive 1,415 174 393 2,228 258 424 4,885 632 788
Total 7,200 800 2,000 7,869 874 1,545 28,581 3,175 3,528

Tweets distribution over data sets

Previous work and baselines

Extensive work has been performed to solve the task of offensive speech identification, which classifies among text classification tasks. Approaches to solving this problem vary from using lexical resources, linguistic features, and meta information (Schmidt and Wiegand 2017), to machine learning (ML) models (Davidson et al. 2017), and more recently, deep neural models like CNN and Long-Short Term Memory (LSTM) and their derivatives (Zhang et al. 2018).

Following previous work, we started experimenting by building a baseline model using a classic ML approach for text classification. Then, we worked with more recent approaches.


We implemented the first baseline model using scikit-learnThis model is a combination of Term Frequency-Inverse Document Frequency (TF-IDF) with Support Vector Machine (SVM). We find that a Count Vectorizer with the feature set size of 3000 to be the most effective for this task.


Following Convolutional Neural Networks for Sentence Classification (Kim 2014), we build a CNN based classification model. This CNN-Text model uses randomly initialized embeddings of size 300, which are trained along with the model. The difference between the results obtained using pre-trained BERT and randomly initialized embeddings was significant as we will in the Results section.


While CNNs could be used to capture local features of the text, Long Short Term Memory (LSTM) networks (Hochreiter and Schmidhuber 1997) which also have shown remarkable performance in text classification tasks, can capture the temporal information. In our experiments, we used two layers of Bidirectional LSTM (BiLSTM) with a hidden size of 128, and randomly initialized embeddings of size 300. However, this was still outperformed by CNN-Text on average.


Pre-trained masked language models as BERT (Devlin et al. 2019), have gained popularity due to their proved performance. These models can be fine-tuned, or used directly as a feature extractor for various textual tasks. In our experiments, three pre-trained language-specific BERT models were used along with Multilingual-BERT (mBERT) model. Those models are GreekBERT model for Greek, BERTurk for Turkish, and ArabicBERT for Arabic. We use BERT for classification by feeding the pooled representation into a top projection linear layer as explained in (Devlin et al. 2019).


Our proposed model maximizes the utilization of knowledge embedded in pre-trained BERT language models by feeding the contextualized embeddings output of its last four hidden layers into several filters and convolution layers of the CNN. Finally, the output of the CNN was passed to a linear layer and the predictions were obtained.

Architecture of BERT-CNN model
Devlin et al. (2019) showed by comparing different combinations of layers of BERT, that the output of the last four hidden layers combined, encodes more information than the output of the top layer.

We set the maximum sequence length of each text sample (tweet) to 64 tokens, and feed the tokens to BERT, then we concatenate the output of the last four hidden layers of base sized pre-trained BERT to get vector representations of size 768x4x64 as shown in figure above. Next, these embeddings were passed in parallel into 160 convolutional filters of five different sizes (768×1, 768×2, 768×3, 768×4, 768×5), 32 filters for each size. Each kernel takes the output of the last four hidden layers of BERT as 4 different channels and applies convolution operation on it. After that, the output is passed through ReLU Activation function and a Global Max-Pooling operation. Finally, the output of the pooling operation is concatenated and flattened to be later on passed through a dense layer and a Sigmoid function to get the final binary label.

This model was trained for 10 epochs with learning rate of 2e-5, and the model with the best macro averaged F1-Score on the development set was saved.


As mentioned above the macro-averaged F1-Score metric was used for evaluation in this shared task. The results of our submissions were shown in comparison with other experiments in the table below.


Model Arabic Greek Turkish Average
SVM with TF-IDF 0.772 0.823 0.685 0.760
Multilingual BERT 0.808 0.807 0.774 0.796
BiLSTM 0.822 0.826 0.755 0.801
CNN-Text 0.840 0.825 0.751 0.805
BERT 0.884 0.822 0.816 0.841
BERT-CNN (Ours) 0.897 0.843 0.814 0.851

Macro averaged F1-Scores of our submissions and the other experiments on test data

By looking at the average results of BERT model on its own, we can see the improvement that was achieved by combining BERT with CNN. Additionally, we can clearly observe the advantage of using Language-specific pre-trained models over the Multilingual one.

In conclusion, our proposed model, with minimum text pre-processing was able to achieve very good results on average and our team was ranked among the highest four participating teams for all languages in the scope of the OffensEval2020.