Original

Introduce the background and survey process of CDL’s DataUp project, analyze scientists’ research data management needs and the requirements for DataUp tool, display the project’s outcome, reveal the enlightenments for domestic research data management.


DataUp Project Research Design
Before designing the DataUp data tools, in order to get a better understanding of the research data management model of the earth, the environment, and ecologists, and capture their management requirements, the project team conducted an investigation visit to 133 scientific researchers from August to December 2011. In addition, the project team has extensively collected a large number of professional recommendations from data management agencies such as academic pavilions and data centers, etc. In summary, the DataUp project research design has the following characteristics: (1) The multi-channel collection of information. The research team used project-specific sites, Data Pub blogs, Twitter, interviews, conferences, webinar and professional seminars to collect information.
(2) The professional relevance of the questionnaire design. In order to understand the Excel data management mode of the researchers in different fields, and to develop highly practical tools, the survey team

Three Kinds of Researcher Data Management Level
The survey results show that the data management capabilities of the three types of researchers in the Earth, the environment, and the ecology are generally flawed. They are: (1) Lack of practical experience in scientific data management. (2) Little is known about data center and metadata standards.
(3) Not fully aware of the value that data management and data sharing can bring.

Three Kinds of Researcher Data Management Mode
(1) In operating system selection, the Windows operating system occupies an overwhelming majority, with the 74% used by the interviewed users, and other 23% and 2% of the users choose to use Mac and Linux operating system. (2) In the use frequency survey of Excel, 80% of users said that they use it every day, and another 8% and 12% of users said that they often and infrequently use it. (3) In the survey on the use of Excel features, 97% of the interviewed users stated that they often use the title line to create functions, 83% of users often use inline formula functions, and 74% often use cell shading as temporary metadata form. In addition, 50%, 41%, and 32% of users often use Excel's cells, pivot tables, and note features. (4) A survey of the usage of the Excel assistant software shows that, in addition to the use of the Excel data function, the interviewed researchers will also use software such as Microsoft Access, MATLAB, Sigma Plot, GIS, and SAS as a supplement to the Excel data management function in different proportions .

Scientific Researcher Excel Data Management Requirements Analysis
Many researchers spontaneously use Excel data management and storage, some results have been achieved, and some drawbacks are increasingly highlighted, Excel is not a specialized tool for scientific data management, and it does not provide an overview tool in the critical computing, management, and storage areas of research business. What's more, the use of simple spreadsheets can also lead to errors in processing results based on scientific data. Based on the analysis of the EuSpRIG (European spreadsheet Risk Interest Group), Excel data management specifically has the following deficiencies: (1) The irrationality of the data table structure.
(2) Metadata missing or metadata standard is not unified.
(3) The presence of embedded numbers, charts, and annotations makes spreadsheets incompatible with other non-Excel systems. (4) Lack of data procedures for calculating, counting, and using formulas .

DataUp Data Service Function Positioning
Combining the flaws in Excel data management and the in-depth analysis of the needs of researchers during the previous investigation, the DataUp project team developed the detailed functional development requirements for the management tools shown in Table 1. (1)  certain identification symbol to store files, so as to facilitate long-term preservation and retrieval of documents. (6) Before the data file is officially saved to a specific database, DataUp will check whether the file to be saved has passed the three-step process of uniform format, metadata, and reference file.
After that, it will generate the technical metadata required for the development of the database. (7)

DataUp Tool Publishing Form
The project team faces two choices in deciding the format of the DataUp software: the download and installation of the Excel add-in program, and the use of the Web application program. Although the former is more convenient and faster, it can only be downloaded and run in the Windows environment.
There are problems such as software compatibility and downloading updates in the future. The latter method has drawbacks in implementing Excel functions. The project team consulted more than 200 researchers on the recommendations of social networks, questionnaires, and other channels. 95% of them are willing to download add-ins, but 83% of them are assuming that the programs can also be run in the Mac environment. 72% of people mentioned various obstacles to the download and installation of add-ins. Based on the survey results, the project team believes that both of the two distribution methods have their own demand user groups, and determine the two-pronged DataUp tool release model.

DataUp Tool Operation Process
The operation process of the DataUp scientific data management tool strictly follows the established objectives of the project, i.e., best practice detection-establish standardized metadata-determine the dataset unique identifier-upload metadata to the repository. The specific implementation of each step will be described one by one.

Best Practices for Data Format Detection
The purpose of the best practice test is to ensure that the data format is well-formed and consistent with the best management practices. The key to detection is to identify hidden issues that are not conducive to data storage and management. Some normal display formats such as annotations, embedded charts, graph cells, etc., cannot be in non-Excel programs. Based on extensive research, the project team summarized 11 types of hidden dangers and corresponding suggestions for modification, as shown in Table 2.

Standardize Metadata Creation
The DataUp project regulates the metadata standards from the following two aspects: (1) Data standards at the file level, including specifications for file names, e-mail addresses, organizations, and data set titles; (2) Metadata attributes Specification, including variable information, units of measure, and column data descriptions in the data set. The DataUp project selectively used the EML (Ecological Metadata Language) standard that is more common in the academic world when defining the metadata standard. This choice is based on two reasons. One of the standards is widely used in the DataUp target customer group. The EML is a metadata standard that combines the characteristics of flexibility and extensibility. The project team can modify the optimal metadata normalization mode according to the project needs.

Data Set Identifier Generation
In order to expand data sharing and quotation, the DataUp project uses CDL's ARK (Archival Resource Key) uniform resource locator to identify data sets. The ARKs have many advantages such as simplicity, versatility, transparency, and identifiability. The identifier will be saved as part of the metadata.

Metadata Upload and Storage
Once the metadata is created, users can directly connect to the selected database through the DataUp tool to upload and store metadata files. Currently DataUp has built a project counterpart database ONE Share, which is a dedicated public data that anyone can use to store table data.

DataUp's Physical Connection System
As a network data management tool, DataUp inevitably generates links with various coding systems and storage systems, and forms a network data management system, as shown in Figure 1. DataUp's data encoding library is based on the NET application framework written using Visual Basic, while the online application version is provided through the Windows Azure cloud platform. Both the add-in terminal and the network service requester establish a link with one or more databases through a unified network transit service station. The add-in program runs directly through the Windows environment of the user's computer. The network service application and the transfer server use a unified OData service communication protocol. At the same time, all ports use the EML metadata standard, and the management system on Azure conducts network transit service management.

In-Depth Rsearch and Development of Scientific Data Management Requirements and Practices
The DataUp project team of CDL starting from the actual demand model and management model of scientific data users, insists on extensive research and in-depth analysis methods for each details, in an effort to make the tool design maximally satisfy the user's demand. The specific performance is as follows: (1) In order to understand the data management level and management model of the target scientific researchers, the project team conducted investigations through Microblog, Twitter, questionnaires, etc., and deeply understood the data management status of the scientific researchers. (2) When determining how to provide DataUp tools, the project team adopted a two-pronged approach to service provision on the basis of extensive recommendations.
(3) When summarizing the 11 types of hidden problems and corresponding modification suggestions, the project team passed interviews, questionnaires, etc. The method has collected a large number of practical experiences of scientific research scholars, database administrators, and academic curators who have rich experience in data management, and has listed the potential data management hidden problems and modification opinions of the form files to the greatest extent.

Solve Problems in Data Management from the Microscopic Point of View
From the 11 types of hidden dangers and suggestions for the corresponding amendments, it can be seen that the project team holds a rigorous work attitude and treats every microscopic detail in the design of data management tools. In the data management process, carefully consider the impact of the design of each minor link on the final result, make bold assumptions, and careful verification, through repeated practice and testing, not missed any minor potential factors that does not meet the requirements of the best practice model, be improved, and achieve refined management of scientific data.

Unified Management, Expanding the Sharing of Research Data and Improving the Efficiency
CDL's DataUp scientific data management tool project has established scientific data in the academic world through the establishment of uniform formats, metadata standards, and data identifiers for research data. The existence of a uniform format enables scientific data files to be identified and accessed by different programs. The formulation of the unified metadata standard can universalize the description of digital resources, display the essential nature, detailed information and characteristics of resources, and promote the sharing and utilization of resources. The generation of unique identifiers gives unique identities to data resources and facilitates the long-term preservation, retrieval, and use of data. Domestic scientific data management also needs to establish unified data standards, and languages, smooth data flow, expand data sharing, and provide data usage efficiency.