TY - GEN
T1 - Automatic classification of securities using hierarchical clustering of the 10-Ks
AU - Yang, Hoseong
AU - Lee, Hye Jin
AU - Cho, Sungzoon
AU - Cho, Eugene
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016
Y1 - 2016
N2 - Industry classification has been rigorously utilized in academic research and business analytics. The existing classification schemes, however, have been constructed and maintained manually by domain experts, which require exhaustive time and human effort while vulnerable to subjectivity. Hence, the existing classification systems do not properly reflect the fast-changing trends of the firms and the capital market. As a remedy to such shortcomings, this paper proposes a new classification scheme, Business Text Industry Classification (BTIC), namely, that automatically clusters securities based on the textual information from the corporate disclosures. BTIC exploits the business section of the Form 10-Ks, in which firms provide their self-identities in a rich context. We employ doc2vec for document embedding and apply Ward's hierarchical clustering method to categorize securities into BTIC groups. Evaluation results using 12 financial ratios commonly found in financial research show that BTIC performs just as good as SIC and GICS in terms of inter-and intra-industry homogeneity, especially for the higher level of clustering. Given that, we claim that BTIC outperforms SIC and GICS in four aspects: process automation, objectivity, clustering flexibility, and result interpretability.
AB - Industry classification has been rigorously utilized in academic research and business analytics. The existing classification schemes, however, have been constructed and maintained manually by domain experts, which require exhaustive time and human effort while vulnerable to subjectivity. Hence, the existing classification systems do not properly reflect the fast-changing trends of the firms and the capital market. As a remedy to such shortcomings, this paper proposes a new classification scheme, Business Text Industry Classification (BTIC), namely, that automatically clusters securities based on the textual information from the corporate disclosures. BTIC exploits the business section of the Form 10-Ks, in which firms provide their self-identities in a rich context. We employ doc2vec for document embedding and apply Ward's hierarchical clustering method to categorize securities into BTIC groups. Evaluation results using 12 financial ratios commonly found in financial research show that BTIC performs just as good as SIC and GICS in terms of inter-and intra-industry homogeneity, especially for the higher level of clustering. Given that, we claim that BTIC outperforms SIC and GICS in four aspects: process automation, objectivity, clustering flexibility, and result interpretability.
KW - 10-K
KW - Capital market research
KW - Doc2vec
KW - GICS
KW - Hierarchical clustering
KW - Industry classification
KW - SIC
UR - http://www.scopus.com/inward/record.url?scp=85015187185&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85015187185&partnerID=8YFLogxK
U2 - 10.1109/BigData.2016.7841069
DO - 10.1109/BigData.2016.7841069
M3 - Conference contribution
AN - SCOPUS:85015187185
T3 - Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016
SP - 3936
EP - 3943
BT - Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016
A2 - Ak, Ronay
A2 - Karypis, George
A2 - Xia, Yinglong
A2 - Hu, Xiaohua Tony
A2 - Yu, Philip S.
A2 - Joshi, James
A2 - Ungar, Lyle
A2 - Liu, Ling
A2 - Sato, Aki-Hiro
A2 - Suzumura, Toyotaro
A2 - Rachuri, Sudarsan
A2 - Govindaraju, Rama
A2 - Xu, Weijia
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 4th IEEE International Conference on Big Data, Big Data 2016
Y2 - 5 December 2016 through 8 December 2016
ER -