Automatic classification of securities using hierarchical clustering of the 10-Ks

Hoseong Yang, Hye Jin Lee, Sungzoon Cho, Eugene Cho

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Scopus citations

Abstract

Industry classification has been rigorously utilized in academic research and business analytics. The existing classification schemes, however, have been constructed and maintained manually by domain experts, which require exhaustive time and human effort while vulnerable to subjectivity. Hence, the existing classification systems do not properly reflect the fast-changing trends of the firms and the capital market. As a remedy to such shortcomings, this paper proposes a new classification scheme, Business Text Industry Classification (BTIC), namely, that automatically clusters securities based on the textual information from the corporate disclosures. BTIC exploits the business section of the Form 10-Ks, in which firms provide their self-identities in a rich context. We employ doc2vec for document embedding and apply Ward's hierarchical clustering method to categorize securities into BTIC groups. Evaluation results using 12 financial ratios commonly found in financial research show that BTIC performs just as good as SIC and GICS in terms of inter-and intra-industry homogeneity, especially for the higher level of clustering. Given that, we claim that BTIC outperforms SIC and GICS in four aspects: process automation, objectivity, clustering flexibility, and result interpretability.

Original languageEnglish (US)
Title of host publicationProceedings - 2016 IEEE International Conference on Big Data, Big Data 2016
EditorsRonay Ak, George Karypis, Yinglong Xia, Xiaohua Tony Hu, Philip S. Yu, James Joshi, Lyle Ungar, Ling Liu, Aki-Hiro Sato, Toyotaro Suzumura, Sudarsan Rachuri, Rama Govindaraju, Weijia Xu
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages3936-3943
Number of pages8
ISBN (Electronic)9781467390040
DOIs
StatePublished - 2016
Externally publishedYes
Event4th IEEE International Conference on Big Data, Big Data 2016 - Washington, United States
Duration: Dec 5 2016Dec 8 2016

Publication series

NameProceedings - 2016 IEEE International Conference on Big Data, Big Data 2016

Conference

Conference4th IEEE International Conference on Big Data, Big Data 2016
Country/TerritoryUnited States
CityWashington
Period12/5/1612/8/16

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Information Systems
  • Hardware and Architecture

Keywords

  • 10-K
  • Capital market research
  • Doc2vec
  • GICS
  • Hierarchical clustering
  • Industry classification
  • SIC

Fingerprint

Dive into the research topics of 'Automatic classification of securities using hierarchical clustering of the 10-Ks'. Together they form a unique fingerprint.

Cite this