Software Dependability and Security
Software dependability and security are critical in assuring the resilience of these complex systems. Despite decades of work in this area, software remains a weak link in system integrity, leading to failures that compromise safety and/or impose financial costs. The challenge posed is at once of critical importance and immense. We believe progress is best made through a new approach that focuses on mitigating the types of software bugs that are most difficult to address with conventional methods, and the team we have assembled is singularly well qualified to pursue this path. To meet the challenge, the proposed program will carry out education in the area of software faults, failures and their mitigations at development cycle and specifically during system operation. We have pioneered a course for graduate students, young researchers and software engineers studying or working in software engineering field. This new course is playing an important role in both the masters program for electrical and computer engineering and the undergraduate program for interdisciplinary data science at Duke Kunshan University.
Environment Diversity-based Software Fault Tolerance and Its Applications
Modern life depends on devices and systems containing a moderate to significant amount of software whose reliability is critical to the reliability of a system as a whole. Software fault-tolerance has hitherto been based on design diversity, and its high implementation cost has largely limited the scope of application to safety-critical systems. Affordable software fault tolerance using the newer notion of environmental diversity is being studied in this project. The key idea is predicated on the existence of elusive software faults known as environment dependent bugs or Mandelbugs with transient characteristics in their manifestation. The environment for a software system here is taken to mean the operating systems resources and other concurrently running applications. This project mainly focuses on the following four research aspects: environmental factor identification, key environmental control techniques, environmentally diversity-based fault tolerance approaches, and applications to Android systems. Research on the failure data analysis, experimental research with accelerated life testing, analytic modeling and optimization techniques of open source software is being carried out. The fruits of the research will effectively contribute to reduction of the cost of software fault tolerance, while reducing the impact of environment dependent bugs on software reliability/availability. It will also contribute to the emergence and development of the environment dependent bugs related research in software engineering.
Audio Speech and Language Processing
Prof. Ming Li and his lab conduct research in the area of Audio, Speech and Language Processing. In the 2021 calendar year, they have published more than 10 top conference or journal papers in this filed. The topics include speaker recognition, speaker diarization, speech synthesis, speech separation, paralinguistic speech attribute recognition, etc. They have collaborated with multiple industry leaders and local companies in terms of collaborative research and technology transfer.
Multimodal Behavior Signal Analysis
Prof. Ming Li and his lab conduct research in the area of Multimodal Behavior Signal Analysis and Interpretation towards the AI assisted Autism Spectrum Disorder (ASD) diagnose. They have developed an AI studio for the early screening of ASD. The studio’s four walls are programmable projection screens that can recreate a variety of settings, such as a forest environment, with sound delivered through multichannel audio equipment. The therapist can use the studio to interact with the child, such as asking him or her to point at a certain object projected onto the wall to observe their reaction. At the same time, cameras capture the movements of the child and the therapist, including gestures, gazes and other actions. The studio is equipped with more than 10 technologies that have obtained or are in the process of obtaining patents. These include technologies that assist with gaze detection, human pose estimation, face detection, face recognition; speech recognition and paralinguistic attribute detection.
Predicting the Risk of Rupture for Vertebral Aneurysm based on geometric Features of Blood
A significant proportion of the adult population worldwide suffers from cerebral aneurysms. In this project, we investigate the possibility of using machine learning algorithms to predict rupture risk of vertebral artery fusiform aneurysms based on geometric features of the blood vessels surrounding but excluding the aneurysm. The decision tree model using two of the features (standard deviation of the eccentricity of the proximal vessel, and diameter at the distal endpoint) achieved 83.8% classification accuracy. Additionally, with support vector machine and logistic regression, we also achieved 83.8% accuracy with another set of two features (ratio of mean curvature between distal and proximal parts, and diameter at the distal endpoint). Combining the aforementioned three features with integration of curvature of the proximal vessel and also ratio of mean the cross-sectional area between distal and proximal parts, these models achieve an impressive 94.6% accuracy. These results strongly suggest the usefulness of geometric features in predicting the risk of rupture.
Data Analytics for Smart Manufacturing
To ensure high quality and yield, today’s advanced manufacturing systems are equipped with thousands of sensors to continuously collect measurement data for process monitoring, defect diagnosis and yield learning. In particular, the recent adoption of Industry 4.0 has promoted a set of enabling technologies for low-cost data sensing, processing and storage of manufacturing process. While a large amount of data has been created by the manufacturing industry, statistical algorithms, methodologies and tools are immediately needed to process the complex, heterogeneous and high-dimensional data in order to address the issues posed by process complexity, process variability and capacity constraint. The objective of this project is to explore the enormous opportunities for data analytics in the manufacturing domain and provide data-driven solutions for manufacturing cost reduction.
Artificial intelligence for DRAM RAS and Storage Unit Isolation
Reliability, Availability, and Serviceability (RAS) are the core competencies of cloud services. As one of the fastest-growing components in von Neumann’s computer architecture, the memory component plays an important role. As a key component that directly provides data cache to the central processor, the failure of the memory system will directly cause the processor to stop responding, or even the system to crash. In this project, we proposed a new uncorrectable error (UCE) prediction algorithm based on the Correctable Error (CE) spatial-temporal information by using the machine learning method. Under the test of data from the public cloud with 3113 servers’ log information, the recall is 20% higher than the current industrial and academic results with the same precision. At the same time, by combining cluster analysis and physical mechanism, physical fault detection and risk region localization strategy are also proposed. Finally, a DRAM fault simulator is established to study the RAS of DRAM.
Digital Marketing Based on Data Analytics
In this project, close collaboration is made with leading enterprises in domestic industry. By using the sales and logistics data, we provide customers with guidance on pricing and discounts on all category of products. The project is combined with new retailing, using data-driven methodology for all aspects from production to sales, and providing advice on enterprise data management.
Intelligent Seal Recognition and Authentication Based on Deep Learning
In this project, we aim to verify the stamps on scanned voucher by comparing their images with the pre-saved (i.e., true) copies. The proposed algorithm flow is compared of two major steps: (i) extracting the binary masks for the stamp images from the scanned voucher and the true copy respectively, and (ii) comparing the resulting binary masks with consideration of environmental non-idealities such as shifting, rotation, scaling, illumination variations, background noises, etc.
To facilitate robust seal recognition and authentication, a number of novel techniques have been proposed based on deep learning. First, the problem of binary mask extraction is cast to a semantic segmentation task. By building an appropriate encoder-decoder based on convolutional neural network (CNN), improving loss function for classification and exploiting data augmentation technology, the proposed approach can accurately and efficiently extract the required binary masks for different colors and shapes in presence of large-scale illumination variations and background noises. Second, once the binary masks are available, a set of deep neural networks (DNNs) are further developed for efficient seal recognition and authentication, as shown in the following figure. In the proposed network architecture, a CNN is used for key-points detection, a graph neural network-based image registration is adopted to match the two masks from the scanned voucher and the true copy and, finally, a DNN is trained for error classification in order to generate the authentication outcome.
Two-Way Street: Cultural Exchanges Between the Chinese-Speaking World and the Portuguese- and Spanish-Speaking Worlds
The project, which combines Data Science with the Humanities, seeks to map out every single Chinese book that has been officially published in Portuguese and Spanish (be it a translation from the Chinese, or a topic that relates to the Chinese-speaking world), as well as every single Portuguese and Spanish book that has been officially published in Chinese (be it a translation from the Portuguese and the Spanish, or a topic that relates to the Portuguese- and Spanish-speaking worlds). No comprehensive studies on this topic have been carried out anywhere in the world.
The objective of the project is to map out the cultural exchanges among these three spheres and to establish a comprehensive chronology of what has been published, where, and to draw conclusions from this data.
- Liaise with National Libraries
- Follow-up on published titles with missing information
- Extract, clean, and structure data from the library catalogues
- Establish the chronology of the publications
We have completed our data extraction and have gathered over 5000 titles of books and articles translated form Portuguese and Spanish into Chinese. We are now analyzing this data in order to draw patterns and conclusions that will shed a light on Sino-Portuguese-Spanish relations over the past hundred years.
The Mystery of China Innovation Quality
Technological progress propels economic growth. China’s economy has been growing dramatically with an average annual rate of 8.7 percent from 1980 to 2015. The miracle of China’s economic growth is mirrored by the immerse investment in research & development (R&D), with spending as a percent of GDP rising from 0.72 percent in 1990 to 2.13 percent in 2017. China’s R&D expenditure has reached $442.7 billion, becoming the second-largest country after the United States. As the backbone of innovation, R&D investments have propelled China to the world leader in patent applications. In 2019, China became the world leader in international patent applications with more than 1.4 million applications, overtaking the United States, which has been on the top for more than four decades according to the World Intellectual Property Organization. Several studies on patent surging in China have summarized some general data patterns and analyzed crucial factors behind innovation such as R&D investments, international trade, and foreign direct investment (Hu and Jefferson, 2009; Hu, Zhang, and Zhao, 2017; Wei, Xie, and Zhang, 2017). Despite this burgeoning literature, our understanding of the innovation quality in China is limited. Open but important questions include the evolution of the quality gap in patenting innovation between China and other innovation leading countries, how it is closely related to industry and public policies directing to science and technology, and the role of the innovation network formed by inventors or assignees. This research seeks to provide some quantitative evidence on these and related questions.
The Chinese Factory Project: A Data Analytics and Digital History Project
The Chinese Factory Project (CFP) is a multidisciplinary data analytics and digital history project, designed to collect, analyze, and publicize archival and quantitative data sources on the industrial factories in modern China. Rooted in a wealth of primary economic and industrial data sources, including both quantitative and archival sources, the CFP is developing an original database containing a sample of up to 2000 factory cases.
In the academic year 2021-2022, Dr. Zhaojin Zeng worked with a team of undergraduate research assistants to continue working on data collection. The student workers came from a wide range of majors, including Math, Data Science, Political Economy, and Global China Studies.
This year, we also developed collaborative relationships with the School of Economics at Fudan University. Their doctoral student Xiaomin Liu has worked actively to advise our undergraduate students and provided support to our research.
We were also invited to participate in a conference at the University of Pittsburgh, and we would present a research project examining chemical fertilizer factories and their production and operation in the late twentieth century.
Online Environmental Communication in China: A quantitative and qualitative analysis of internet text data
The project examines the role of the internet and social media in environmental communication in China. It aims at understanding how, on the one hand, the Chinese state has come to use the internet and social media to communicate about its policies and promote its actions in the field of environment, and, on the other hand, how Chinese people have come to use the internet to access legal information to resolve the environmental problems they face locally. The first component of the project analyzes the communication strategy of Chinese local Environmental Protection. The project studies the on-and-offline dynamics of environmental disputes between the political actors and local environmental activists. We have completed two rounds of data collection and finished the preliminary analysis, We are now moving into the drafting process.
The second component of the project focuses on how the internet has eased access to environmental law for average Chinese people. The collected data corpus consists of almost 4,000 questions put forward by citizens to the public online legal advice platform China.findlaw.cn regarding environmental issues, and the answers posted by online lawyers in response to them. The project originally aimed at producing a combined quantitative and content analysis of the data and investigated how, when, where, and how often do common people use legal advice websites to seek environmental justice; how different actions taken by the state might impact the likelihood of common people seeking environmental justice; and how, when, where, and how often lawyers reply to questions posted on legal advice sites – comparing environmental questions with other types of questions. The paper is in the second submission process.
The third component of the project focuses on how the China’s Environmental Protection Bureaus (EPBs) strategically use the Weibo posts to shape the public participation towards environmental protection through “mobilizing distraction” discourse strategies. The collected data corpus consists of a whole-year (From January 2021 to December 2021) period of Weibo data from 52 EPBs, including Ministry of Ecology and Environment (MEE), 31 Provincial EPBs and top20 Municipal EPBs who are ranked top in the performance ranking list issued by MEE. That’s total of around 260,000 posts. The project aimed at producing a combined discourse analysis and potential quantitative analysis of the data. We have collected the data and now are moving forward with the manual coding.
Spatial Big Data Analysis of Socio-ecological Benefits from Urban Parks in China
This project analyzes spatial big data related to urban green spaces in China in order to empirically measure these green spaces’ contributions to the public’s wellbeing. Both social media data and satellite data were applied by this project. The project collected social media data, including both texts and images posted by urban park visitors in Shanghai, Suzhou, Kunshan, and Chengdu. These big data showed urban park visitors’ experiences in green spaces, indicating the quality of these green spaces (or cultural ecosystem services). The project also hired and trained eight DKU undergraduate students. Using Google Earth Engine, these student assistants mapped all urban parks in these cities and conducted land use classification within these green space areas. The students have been analyzed the collected social media data by applying Python and natural language processing. The results of this project would help decision-makers in China improve urban park management by supporting them to understand different social-ecological characteristics of urban parks revealed by these big data. These results would also contribute to the literature by demonstrating big data applications to monitoring social-ecological quality or urban parks.
Antarctica Palmer Station: A Data-Driven Alternate Reality Game
The Antarctica: Palmer Station project is a mixed reality data-rich simulation game and VR experience that investigates the complex realities of life and work on the frozen continent. The piece presents a simulated reality through the lens of Palmer Station, a permanent U.S. research program based on Anvers Island on the tip of the Antarctic Peninsula. Through layering real-world information and data from scientific, historical, cultural, and geopolitical perspectives into the simulated reality, Antarctica: Palmer Station creates a parallel universe that allows audiences to experience and contemplate issues surrounding environmental and climate crises, ethics, and human survival.
The project is developed in collaboration with oceanographer Prof. Yajuan Lin, who’s research in marine micro-plankton diversity and carbon cycles around the Southern Ocean informs the scientific component in the game world and provides the foundation for the project’s world view. Specifically, the work features Lin’s research data collected between 2012-2017 near and around Palmer Station that provide a window into the region’s climate fluctuations. The piece utilizes mixed reality approaches in bringing complex abstract topics into tangible experiences for audiences. The piece is designed for VR platforms, but also includes an accessible video game version to allow broader public engagement beyond the gallery and museum space. The piece consists of a virtual game, a physical installation presence, and public education and citizen sciences programs.
This is the first in a series of case studies by designer/artist duo Benjamin Bacon and Vivian Xu in the development of a multi-chapter omniverse built on real-world environmental data. This project is funded by the Duke Kunshan University Data Science Research Center Data+X Grant 2021-2022 and supported by the Design, Technology and Radical Media Labs.
Digitizing Religion in China: Building an Integrative Data Science Platform
We have been working with a team of undergraduate students to build an open-source and interactive mapping website that is dedicated to Chinese religious sites and activities. The website is designed to collect any information on religiously ceremonial gatherings from religious scholars and followers. To do so, we have collected data on religious sites from religious authorities and scholars. For the next step, we plan to do so with religious leaders and ordinary religious practitioners.
However, one major concern on how to conduct this study ethically when certain religious sites can be targeted by some policy makers discriminately. How can we gather the information of new religious sites while, in the same time, protect the identity of our informants? To address this issue, we have designed a two-layer system. In the first layer, anonymous users can submit a location of a religious site to our WeChat public account. We only ask for approximate locations, for example street names. We then geo-locate the approximate location to our mapping platform that is maintained on a server in the US. We have also collected religious activities on social media, such as WeChat and Weibo. We use machine-learning algorithms to filter out the recent posts sent from ordinary religious users. For the next step, we plan to integrate these religious posts from ordinary citizens to our WeChat public account and seek information from ordinary citizens.