ROCLING 2025 Shared Task
Chinese Dimensional Sentiment Analysis for Medical Self-Reflection Texts (DSA-MST)

Organizers

李龍豪 Lung-Hao Lee、林孜彌 Tzu-Mi Lin
國立陽明交通大學 智能系統所
Institute of Artificial Intelligence Innovation
National Yang Ming Chiao Tung University

施琇敏 Hsiu-Min Shih、徐國鎧 Kuo-Kai Shyu
國立中央大學 電機工程學系
Department of Electrical Engineering
National Central University

許善淳 Anna S. Hsu、呂佩穎 Peih-Ying Lu
高雄醫學大學 醫學系 醫學人文與教育學科
Department of Medical Humanities and Education
Kaohsiung Medical University

Registration

CodaBench page: https://www.codabench.org/competitions/3306/

Contact

Lung-Hao Lee: lhlee@nycu.edu.tw

Tzu-Mi Lin: ltmdegf4.ii12@nycu.edu.tw

Background

Sentiment analysis has emerged as a leading technique to automatically identify affective information within texts. In sentiment analysis, affective states are generally represented using either categorical or dimensional approaches (Calvo and Kim, 2013). The categorical approach represents affective states as several discrete classes (e.g., positive, negative, neutral), while the dimensional approach represents affective states as continuous numerical values on multiple dimensions, such as valence-arousal (VA) space (Russell, 1980), as shown in Fig. 1. The valence represents the degree of pleasant and unpleasant (or positive and negative) feelings, and the arousal represents the degree of excitement and calm. Based on this two-dimensional representation, any affective state can be represented as a point in the VA coordinate plane by determining the degrees of valence and arousal of given words (Wei et al., 2011; Malandrakis et al., 2013; Wang et al., 2016; Du and Zhang, 2016; Wu et la., 2017; Yu et al., 2020; Deng et al., 2022) or texts (Kim et al., 2010; Paltoglou et al, 2013; Goel et la., 2017; Zhu et al., 2019; Wang et al., 2019; 2020; Deng et al., 2023).

The first dimensional sentiment analysis (DSA) task for Chinese words (Yu et al., 20216) was organized at the IALP 2016 conference. The second edition of DSA task was organized at the IJCNLP 2017 conference to include both Chinese words and phrases (Yu et al., 2017). The third edition was organized at the ROCLING 2021 conference to explore the sentence-level dimensional sentiment analysis task on educational texts (students’ self-evaluated comments) (Yu et al., 2021). This year, we organize the fourth edition of DSA task to analyze medical multi-sentence texts (doctors’ self-reflection feelings).

Task Description

Dimensional sentiment analysis is an effective technique to recognize the valence-arousal ratings from texts, indicating the degree from most negative to most positive for valence, and from most calm to most excited for arousal. In this task, participants are asked to provide a real-valued score from 1 to 9 for both valence and arousal dimensions for each doctors’ self-reflection texts. The input format is “ID, texts”, and the output format is “ID, vallence_rating, arousal_rating”. Below are the input/output formats of the example sentences.

Example 1

Input: ex01, 主治醫師曾經多次強調血液透析和輸血,以病人的狀況就是不建議,已經在加護病房積極治療了兩個禮拜,家屬却遲遲無法達到共識。

Output: ex01, 4.750, 2.750

Example 2

Input: ex02, 視病如親,這個成語一直是一個難以達成的理想,但在ICU我感受到醫療端與病人和家屬站在同一陣線、共同努力對抗病魔,完成病人的願望的努力,讓我十分的動容。

Output: ex02, 6.900, 5.600

Data


Training Set: Chinese EmoBank

http://nlp.innobic.yzu.edu.tw/resources/ChineseEmoBank.html


The Chinese EmoBank (Lee et al., 2022) is a dimensional sentiment resource annotated with real-valued scores for both valence and arousal dimensions. The valence represents the degree of positive and negative sentiment, and arousal represents the degree of calm and excitement. Both dimensions range from 1 (highly negative or calm) to 9 (highly positive or excited). The Chinese EmoBank features various levels of text granularity including two lexicons called Chinese valence-arousal words (CVAW, 5,512 single words) and Chinese valence-arousal phrases (CVAP, 2,998 multi-word phrases) and two corpora called Chinese valence-arousal sentences (CVAS, 2,582 single sentences) and Chinese valence-arousal texts (CVAT, 2,969 multi-sentence texts).


Notes

  • The datasets are authorized to use for research purposes.
  • The policy of this shared task is an open test. Participating systems are allowed to use other publicly available data for this shared task, but the use of other data should be specified in the final system description paper.

Validation Set

There are 994 doctors' self-reflection texts for system development.

Test Set

We will provide at least 1,500 doctors’ self-reflection texts for system performance evaluation.

Evaluation

The performance is evaluated by examining the difference between machine-predicted ratings and human-annotated ratings (valence and arousal are treated independently). The evaluation metrics include: Mean Absolute Error (MAE) and Pearson Correlation Coefficient (PCC) , defined as follows

$$ MAE = \frac{1}{n} \sum_{i=1}^{n}|a_{i}-p_{i}| $$

$$ PCC = \frac{1}{n-1} \sum_{i=1}^{n}(\frac{a_{i}-\mu_{A}}{\sigma_{A}})(\frac{p_{i}-\mu_{P}}{\sigma_{P}}) $$

where \( a_{i}\in{A} \) and \( p_{i}\in{P} \) respectively denote the i-th actual value and predicted value, n is the number of test samples, and \( \mu_{A} \) and \( \sigma_{A} \) respectively represent the mean value and the standard deviation of A, while \( \mu_{P} \) and \( \sigma_{P} \) respectively represent the mean value and the standard deviation of P.

The actual and predicted real values range from 1 to 9, so MAE measures the error rate in a range where the lowest value is 0 and the highest value is 8. A lower MAE indicates more accurate prediction performance. The PCC is a value between −1 and 1 that measures the linear correlation between the actual value and the predicated value. A lower MAE and a higher PCC indicate more accurate prediction performance. Each metric for the valence and arousal dimensions is ranked independently. A model’s overall ranking is computed based on the mean rank across the four metrics. The lower the mean rank, the better the system performance.

Official Ranking

Notes: Each metric in individual subtask is ranked independently. (*) means the rank for each metric. A system’s overall ranking is computed based on the mean rank. The lower the mean rank, the better the system performance.


Team Evaluation Metrics Overall Rank
V-MAE V-PCC A-MAE A-PCC
CYUT-NLP 0.46 0.78 0.74 0.63 1
TCU 0.46 0.81 0.76 0.61 2
ntulaw 0.5 075 0.79 0.59 3
SCU-NLP 0.51 0.76 0.87 0.59 4
Monokeros 0.53 0.76 0.82 0.58 5
Hey Vergil 0.63 0.62 1.01 0.21 6

Important Dates

  • Release of test data: August 25, 2025
  • Testing results submission due: August 27, 2025
  • Release of evaluation results: August 29, 2025
  • System description paper due: September 12, 2025 September 26, 2025
  • Notification of Acceptance: October 12, 2025
  • Camera-ready deadline: October 23, 2025

References

  • Rafael A. Calvo, and Sunghwan Mac Kim. 2013. Emotions in text: dimensional and categorical models. Computational Intelligence, 29(3):527-543.
  • Munmun De Choudhury, Scott Counts, and Michael Gamon. 2012. Not all moods are created equal! Exploring human emotional states in social media. In Proc. of ICWSM-12, pages 66-73.
  • Yu-Chih Deng, Cheng-Yu Tsai, Yih-Ru Wang, Sin-Horng Chen, and Lung-Hao Lee. 2022. Predicting Chinese Phrase-level Sentiment Intensity in Valence-Arousal Dimensions with Linguistic Dependency Features. IEEE Access, 10:126612-126620.
  • Yu-Chih Deng, Yih-Ru Wang, Sin-Horng Chen, and Lung-Hao Lee. 2023. Towards Transformer Fusions for Chinese Sentiment Intensity Prediction in Valence-Arousal Dimensions. IEEE Access, 11:109974-109982.
  • Steven Du and Xi Zhang. 2016. Aicyber’s system for IALP 2016 shared task: Character-enhanced word vectors and Boosted Neural Networks, In Proc. of IALP-16, pages 161–163.
  • Pranav Goel, Devang Kulshreshtha, Prayas Jain and Kaushal Kumar Shukla. 2017. Prayas at EmoInt 2017: An Ensemble of Deep Neural Architectures for Emotion Intensity Prediction in Tweets, In Proc. of WASSA-17, pages 58–65.
  • Sunghwan Mac Kim, Alessandro Valitutti, and Rafael A. Calvo. 2010. Evaluation of unsupervised emotion models to textual affect recognition. In Proc. of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, pages 62-70.
  • Lung-Hao Lee, Jian-Hong Li, and Liang-Chih Yu. 2022. Chinese EmoBank: Building Valence-Arousal Resources for Dimensional Sentiment Analysis. ACM Transactions on Asian and Low-Resource Language Information Processing, 21(4): Article 65, 1-18.
  • N. Malandrakis, A. Potamianos, E. Iosif, and S. Narayanan. 2013. Distributional semantic models for affective text analysis. IEEE Transactions on Audio, Speech, and Language Processing, 21(11): 2379-2392.
  • Myriam Munezero, Tuomo Kakkonen, and Calkin S. Montero. 2011. Towards automatic detection of antisocial behavior from texts. In Proc. of the Workshop on Sentiment Analysis where AI meets Psychology (SAAIP) at IJCNLP-11, pages 20-27.
  • Georgios Paltoglou, Mathias Theunis, Arvid Kappas, and Mike Thelwall. 2013. Predicting emotional responses to long informal text. IEEE Trans. Affective Computing, 4(1):106-115.
  • Jie Ren and Jeffrey V. Nickerson. 2014. Online review systems: How emotional language drives sales. In Proc. of AMCIS-14.
  • James A. Russell. 1980. A circumplex model of affect. Journal of Personality and Social Psychology, 39(6):1161.
  • Wen-Li Wei, Chung-Hsien Wu, and Jen-Chun Lin. 2011. A regression approach to affective rating of Chinese words from ANEW. In Proc. of ACII-11, pages 121-131.
  • Liang-Chih Yu, Cheng-Wei Lee, Huan-Yi Pan, Chih-Yueh Chou, Po-Yao Chao, Zhi-Hong Chen, Shu-Fen Tseng, Chien-Lung Chan and K. Robert Lai. 2018. Improving early prediction of academic failure using sentiment analysis on self-evaluated comments, Journal of Computer Assisted Learning, 34(4):358-365.
  • Liang-Chih Yu, Lung-Hao Lee, Shuai Hao, Jin Wang, Yunchao He, Jun Hu, K. Robert Lai, and Xuejie Zhang. 2016a. Building Chinese affective resources in valence-arousal dimensions. In Proc. of NAACL/HLT-16, pages 540-545.
  • Liang-Chih Yu, Lung-Hao Lee, Jin Wang and Kam-Fai Wong. 2017. IJCNLP-2017 Task 2: Dimensional sentiment analysis for Chinese phrases, In Proc. of IJCNLP-17, pages 9-16.
  • Liang-Chih Yu, Lung-Hao Lee and Kam-Fai Wong. 2016b. Overview of the IALP 2016 shared task on dimensional sentiment analysis for Chinese words, In Proc. of IALP-16, pages 156-160.
  • Liang-Chih Yu, Jin Wang, K. Robert Lai and Xuejie Zhang. 2020. Pipelined neural networks for phrase-level sentiment intensity prediction, IEEE Transactions on Affective Computing, 11(3), 447-458.
  • Liang-Chih Yu, Jin Wang, Bo Peng, Chu-Ren Huang. 2021. ROCLING-2021 shared task: dimensional sentiment analysis for educational texts, In Proc. of ROCLING-21, pages 385-388.
  • Jin Wang, Liang-Chih Yu, K. Robert Lai and Xuejie Zhang. 2016. Community-based weighted graph model for valence-arousal prediction of affective words, IEEE/ACM Trans. Audio, Speech and Language Processing, 24(11):1957-1968.
  • Jin Wang, Liang-Chih Yu, K. Robert Lai and Xuejie Zhang. 2020. Tree-structured regional CNN- LSTM model for dimensional sentiment analysis, IEEE/ACM Transactions on Audio Speech and Language Processing, 28, 581–591.
  • Chuhan Wu, Fangzhao Wu, Yongfeng Huang, Sixing Wu and Zhigang Yuan. 2017. THU NGN at IJCNLP-2017 Task 2: Dimensional sentiment analysis for Chinese phrases with deep LSTM, In Proc. of IJCNLP-17, pages 42-52.
  • Suyang Zhu, Shoushan Li and Guodong Zhou. 2019. Adversarial attention modeling for multi- dimensional emotion regression, In Proc. of ACL-19, pages 471–480.