Anomaly detection is a fundamental problem in data mining field with many real-world applications. A vast majority of existing anomaly detection methods predominately focused on data collected from a single source. In real-world applications, instances often have multiple types of features, such as images (ID photos, finger prints) and texts (bank transaction histories, user online social media posts), resulting in the so-called multi-modal data. In this paper, we focus on identifying anomalies whose patterns are disparate across different modalities, i.e., cross-modal anomalies. Some of the data instances within a multi-modal context are often not anomalous when they are viewed separately in each individual modality, but contains inconsistent patterns when multiple sources are jointly considered. The existence of multi-modal data in many real-world scenarios brings both opportunities and challenges to the canonical task of anomaly detection. On the one hand, in multimodal data, information of different modalities may complement each other in improving the detection performance. On the other hand, complicated distributions across different modalities call for a principled framework to characterize their inherent and complex correlations, which is often difficult to capture with conventional linear models. To this end, we propose a novel deep structured anomaly detection framework to identify the cross-modal anomalies embedded in the data. Experiments on real-world datasets demonstrate the effectiveness of the proposed framework comparing with the state-of-the-art.