TECHNICAL NOTE | doi:10.20944/preprints202211.0220.v2
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Relational Database; Columnar Storage; Bloom Filter; Skip List; Field Level Lock; Read Write Concurrency; OLTP; OLAP; LSM-Tree; Token Bucket Algorithm
Online: 14 November 2022 (03:02:09 CET)
At present, diversified and highly concurrent businesses in the Internet industry often require heterogeneous databases formed by multiple databases to meet the needs. This report introduces database kernel SG-ColBase we developed. After achieving read and write concurrency control, data rollback, atomic log writing, and downtime data redo to ensure complete transaction support. The parallelism of database kernel execution is extended through field level locks and snapshot reads. Use the Bloom filter, resource cache pool, memory pool, skip list, non blocking log cache, and asynchronous data writing mechanism to improve the overall execution efficiency of the system. In terms of data storage, column storage, logical key and LSM-tree are introduced. While improving the data compression ratio and reducing data gaps, all disk data operations are written in incremental order. With the characteristics of asynchronous batch operation, the data writing speed is greatly improved. Thanks to the continuous feature of vertical data brought by column storage, the disk scanning brought by vertical traversal is reduced, which is a qualitative leap in efficiency compared with traditional relational databases in the big data analysis scenario. SG-ColBase can reduce the use of heterogeneous databases in business and improve R&D efficiency.
ARTICLE | doi:10.20944/preprints202301.0402.v1
Subject: Mathematics & Computer Science, Numerical Analysis & Optimization Keywords: Sequence Encoder; Autoregressive Sequence; Separated Model; Statistical Test; Neural Network
Online: 23 January 2023 (08:30:48 CET)
While the language model using the stop sign as an independent token has been widely used to decide when the model should stop, it may lead to the growth of vocabulary dimensions and further problems. Similarly, present research on game algorithms usually estimate stopping point related problems based on the evaluation of the winning rate. However, information redundancy may also exist in such models, thus increasing the training difficulty. Above two types of tasks (and similar autoregressive tasks) show a common problem of stopping point prediction. In this paper, we describe a design of separated model, trying to separate the complexity of stopping point prediction from the main task model, so that the information used for estimating stopping point can be reduced. On this basis, in order to verify the rationality of using separated model, we propose a model-free test method. It judges the separability of transformed data based on point difference and sequence difference metrics. In this way, it can predict the credibility of the separated model inference.
ARTICLE | doi:10.20944/preprints202301.0219.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Large Language Model; Natural Language Processing; Reading Comprehension; Computational linguistics; Information Retrieval; BM25
Online: 12 January 2023 (08:58:03 CET)
Large language model (LLM) is a representation of a major advancement in AI, and has been used in multiple natural language processing tasks. Nevertheless, in different business scenarios, LLM requires fine-tuning by engineers to achieve satisfactory performance, and the cost of achieving target performance and fine-tuning may not match. Based on the Baidu STI dataset, we study the upper bound of the performance that classical information retrieval methods can achieve under a specific business, and compare it with the cost and performance of the participating team based on LLM. This paper gives an insight into the potential of classical computational linguistics algorithms, and which can help decision-makers make reasonable choices for LLM and low-cost methods in business R&D.
DATA DESCRIPTOR | doi:10.20944/preprints202111.0511.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Social network analysis; Natural language processing; Dataset; Multimode; Opinion Dynamics
Online: 26 November 2021 (14:23:36 CET)
At the end of 2018, a high school student asked a question in Zhihu community, claiming that he had proved Goldbach's conjecture. The problem caused an explosive reaction and a large number of users participated in the discussion. And has caused the widespread influence. On January 1, 2019, the questioner issued his "proof". His proof was soon proved wrong. The heated discussion caused by the incident contains a lot of information of social science analysis value. Therefore, we follow up the event in the first time and build a time series dataset for the event. Taking the questioner's "proof" as the dividing line, all the answers, comments, sub comments and user information of writing these texts before and after two days were recorded. This series of temporal information can reflect the dynamic features of the interaction between user opinions, and the impact of exogenous shocks (proof release) on community opinions. The dataset can be used not only for the demonstration of various social network analysis algorithms, but also for a series of natural language processing tasks such as fine-grained sentiment analysis for long texts, as well as multimodal tasks combining natural language processing and social network analysis. This paper introduces the characteristics and structure of the dataset, shows the visualization effect of social network, and uses the dataset train the benchmark model of emotion analysis.
ARTICLE | doi:10.20944/preprints202111.0499.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Social Networks; Data Mining; Graph Structure; Natural Language Processing; Machine Learning
Online: 26 November 2021 (10:45:06 CET)
The herd effect is a common phenomenon in social society. The detection of this phenomenon is of great significance in many tasks based on social network analysis such as recommendation. However, the research on social network and natural language processing seldom focuses on this issue. In this paper, we propose an unsupervised data mining method to detect herding in social networks. Taking shopping review as an example, our algorithm can identify other reviews which are affected by some previous reviews and detect a herd effect chain. From the overall perspective, the cross effects of all views form the herd effect graph. This algorithm can be widely used in various social network analysis methods through graph structure, which provides new useful features for many tasks.