[期刊论文][Full-length article]


FastForest: Increasing random forest processing speed while maintaining accuracy

作   者:
Darren Yates;Md Zahidul Islam;

出版年:2021

页     码:130 - 152
出版社:Elsevier BV


摘   要:

Random Forest remains one of Data Mining’s most enduring ensemble algorithms, achieving well-documented levels of accuracy and processing speed, as well as regularly appearing in new research. However, with data mining now reaching the domain of hardware-constrained devices such as smartphones and Internet of Things (IoT) devices, there is continued need for further research into algorithm efficiency to deliver greater processing speed without sacrificing accuracy. Our proposed FastForest algorithm achieves this result through a combination of three optimising components - Subsample Aggregating (‘Subbagging’), Logarithmic Split-Point Sampling and Dynamic Restricted Subspacing. Empirical testing shows FastForest delivers an average 24% increase in model-training speed compared with Random Forest whilst maintaining (and frequently exceeding) classification accuracy over tests involving 45 datasets on both PC and smartphone platforms. Further tests show FastForest achieves favourable results against a number of ensemble classifiers including implementations of Bagging and Random Subspace. With growing interest in machine-learning on mobile devices, FastForest provides an efficient ensemble classifier that can achieve faster results on hardware-constrained devices, such as smartphones. Introduction Ensemble classifiers have a rich history in data mining and continue to feature in new research covering topics as varied as gene selection [1], software defect prediction [2], emotion prediction in social media [3] and energy consumption of buildings [4]. Their ability to identify multiple information patterns within datasets has seen their popularity continue to grow. Well-known examples include Bootstrapped Aggregating (or ‘Bagging’) [5], which samples the dataset with replacement to create multiple dataset derivatives and from which, multiple decision trees are created. Random Subspacing [6], another popular example, reduces the number of attributes for selection at each tree node through random selection to boost diversity. However, it is Random Forest [7], the algorithm that essentially combines Bagging with Random Subspacing, that continues to feature heavily in new research [8], [9], [10], [11], [12]. This is thanks largely to its ability to achieve high accuracy over a broad range of dataset types, yet do so with comparative computational efficiency. Random Forest has been further enhanced by parallelization, allowing the process of constructing decision trees to occur in parallel on systems with multi-core processors. As a result, multi-threaded versions of the algorithm have enabled greater processing speeds [13]. With the rise of smartphones and Internet of Things (IoT) over recent years, there is growing potential to bring data mining to hardware-constrained devices, despite the reduced levels of processing speed they may offer. As an example, our previous research [14] has brought locally-executed data mining to the Android platform through our open-source DataLearner app, available free on Google Play.1 Moreover, moves by cloud service providers such as Amazon Web Services to change the cost structure of these services from a per-hour to a per second basis [15] suggests there can also be direct cost savings by implementing more efficient algorithms in cloud-based applications. Thus, the need continues for further research into algorithms that deliver faster performance without sacrificing accuracy. FastForest is an optimised derivative of Random Forest that achieves improved levels of processing speed whilst maintaining (and frequently exceeding) the classification accuracy of Random Forest. This is achieved through the implementation of three components detailed in Section 3 – Subsample Aggregating (or ‘subbagging’), Logarithmic Split-Point Sampling (LSPS) and Dynamic Restricted Subspacing (DRS). As will be detailed throughout this paper, the original contributions of this research include: • Empirical testing of subbagging scale factors from 0.05 to 0.632 within Random Forest using 45 freely-available training datasets (Section 3.2). • Development of the LSPS component, inspired by our previous work on split-point sampling in the single-tree SPAARC algorithm [16] (Section 3.3). • Development of the DRS component, inspired by ‘dynamic subspacing’ [17] to achieve greater speed without loss of accuracy (Section 3.4). • Combining Subbagging, LSPS and DRS components into an ensemble classifier called ‘FastForest’ that appears to be faster than Random Forest without sacrificing accuracy. • Broad-scale empirical testing of FastForest over a cache of 45 freely-available categorical, numerical and mixed datasets on PC and smartphone hardware (Sections 4.1 to 4.5). • Implementation of FastForest into the DataLearner data-mining app freely available on Google Play, development of GPL3-licenced Java source code to be available on GitHub and implementation of FastForest for potential availability in Weka Package Manager. Common notation and abbreviations used in this paper are shown in Table 1, Table 2, respectively. This paper continues with Section 2 covering related work in ensemble classifiers and efforts in ensemble optimisation, Section 3 details the components of the proposed FastForest algorithm, while Section 4 covers experimentation and comparison with various ensemble algorithms, including Random Forest itself. Section 5 presents time complexity analysis of Random Forest and FastForest, while Section 6 concludes this paper. Figures Graphical framework of the FastForest algorithm. Total build speed and average classification accuracy of subbagging on 30 datasets of Groups 1 and 2 detailed in Table 3, Table 4 compared with Random Forest (RF, 85.77% accuracy/ 3.45 s speed). Total build speed and average classification accuracy of subbagging on 15 datasets of Groups 3 detailed in Table 5 compared with Random Forest (RF, 93.44% accuracy/ 125.1 s speed). Split point candidates selected in (a) Random Forest between each adjacent pair, (b) SPAARC using a fixed count of 20 candidates, and (c) LSPS using log2 of the record count at each node. Example showing DRS increasing the number of node attributes, k, only once the number of records at the node falls below 12.5% (1/8) of the total record count. Algorithm code for (a) the FastForest algorithm, (b) the subbagging component, (c) the tree-building function, incorporating DRS and (d) LSPS components. Show all figures Section snippets Related work Ensemble classifiers are classification systems that combine the output of multiple individual or ‘base’ classifiers to more accurately model not only training datasets, but also classify new records without a class value. By combining the outputs of multiple base classifiers, ensemble classifiers are often more robust to errors or ‘noise’ in the training dataset [18]. They are also better able to identify multiple patterns of knowledge within the data than a single classifier [19]. As such, Our ‘FastForest’ algorithm framework This section introduces our proposed ‘FastForest’ algorithm, beginning with an overview of the algorithm (Section 3.1) before introducing the components used, starting with Subbagging (3.2), followed by LSPS (3.3), DRS (3.4) and concludes with the algorithms for the proposed components (3.5). Experiments To quantify their effect on classification accuracy and processing speed, each of the three FastForest components will be tested, both individually and combined, as well as in comparison with other popular ensemble classifiers, including Random Forest itself. However, this section begins with an identification of the datasets used for testing. To validate the proposed methods in Section 3, a total of 45 datasets in three groups were utilised during testing. Group 1 consists of 15 datasets with Time-complexity analysis of Random Forest The time complexity of a single decision tree can be estimated as O ( m 2 · n ) , where m is the size of the non-class attribute space and n is the size of the training dataset [[20], [49], [50]]. Random Forest is an ensemble algorithm created from a combination of Bagging and Random Subspace. Bagging is the ensemble process of developing T number of decision trees, based on bootstrap-aggregated or ’bagged’ samples of the original training dataset, D [5]. Thus, the complexity of generating the Conclusion Through its combination of Bagging and Random Subspace techniques, Random Forest continues to be amongst the top ensemble classifiers in terms of classification accuracy and processing speed. Moreover, it also continues to feature consistently in new research. However, with the ubiquitousness of smartphones and the rise of the Internet of Things, in conjunction with new data mining platforms such as TensorFlow Lite and DataLearner, data mining on constrained hardware is seeing growing demand. CRediT authorship contribution statement Darren Yates: Conceptualization, Investigation, Methodology, Software, Writing - original draft, Writing - review & editing. Md Zahidul Islam: Supervision, Conceptualization, Methodology, Writing - review & editing. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgment This research is supported by an Australian Government Research Training Program (RTP) scholarship. References (50) Arie Ben-David About the relationship between roc curves and cohen’s kappa Eng. Appl. Artif. Intell. (2008) Gonzalo Martínez-Muñoz et al. Out-of-bag estimation of the optimal sample size in bagging Pattern Recogn. (2010) Jiawei Han et al. Data mining: concepts and techniques (2011) Su. Jiang et al. A fast decision tree learning algorithm AAAI (2006) Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and... Emma VA Sylvester, Paul Bentzen, Ian R Bradbury, Marie Clément, Jon Pearce, John Horne, and Robert G Beiko.... Faisal Zaman et al. Effect of subsampling rate on subbagging and related ensembles of stable classifiers Leo Breiman. Bagging predictors. Machine learning, 24 (2): 123–140, 1996. ISSN... Peter Buhlmann and Bin Yu. Analyzing bagging. The Annals of Statistics, 30 (4): 927–961, 2002. ISSN... Leo Breiman. Random forests. Machine learning, 45 (1): 5–32, 2001. ISSN... Jerome H Friedman and Peter Hall. On bagging and nonlinear estimation. Journal of statistical planning and inference,... Philipp Probst, Marvin N Wright, and Anne-Laure Boulesteix. Hyperparameters and tuning strategies for random forest.... Zeyu Wang, Yueren Wang, Ruochen Zeng, Ravi S Srinivasan, and Sherry Ahrentzen. Random forest based hourly building... L Benali, G Notton, A Fouilloy, C Voyant, and R Dizene. Solar radiation forecasting using artificial neural network and... Andy Liaw et al. Classification and regression by randomforest R news (2002) Jianguo Chen et al. A parallel random forest algorithm for big data in a spark cloud computing environment IEEE Trans. Parallel Distrib. Syst. (2016) n.d. Class randomforest, n.d.a.... Juha Saarinen. Aws to switch to per-second billing for linux instances, 2017.... Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten. The weka data mining... Md Nasim Adnan and Md Zahidul Islam. Effects of dynamic subspacing in random forest. In International Conference on... Peter Buhlmann Bagging, boosting and ensemble methods (2012) Kagan Tumer and Joydeep Ghosh. Error correlation and error reduction in ensemble classifiers. Connection science, 8... Patrice Latinne et al. Limiting the number of trees in random forests Jianhua Jia, Zi Liu, Xuan Xiao, Bingxiang Liu, and Kuo-Chen Chou. ippi-esml: an ensemble classifier for identifying the... Thais Mayumi Oshiro et al. How many trees in a random forest? View more references Cited by (0) Recommended articles (6) Research article Optimizing vehicle routing via Stackelberg game framework and distributionally robust equilibrium optimization method Information Sciences, Volume 557, 2021, pp. 84-107 Show abstract In this paper, we study the vehicle routing problem (VRP) with multiple firms in depot with focus on the carbon cap-and-trade policy. By establishing the upper-level carbon trading revenue model from the government’s perspective and the lower-level model from the firm’s perspective, we describe the VRP based on the Stackelberg game framework. In addition, a traffic density objective is put forward. As a consequence, the lower-level model is a double objective optimization model including transportation cost and traffic density. Due to the influence of uncontrollable factors, the batches of distributed vehicles are uncertain and characterized as random fuzzy variables with uncertain credibility distribution. To address this situation, we propose a new distributionally robust equilibrium optimization (DREO) method, in which it describes the existence range of the real credibility distribution by constructing the uncertainty distribution set. For a given specific uncertainty distribution set, we derive its equivalent formulation for the proposed method. Moreover, given the single-period, we reduce the derived model to a single-layer multi-criteria optimization model by using the parameter-driven approach, so as to avoid solving the complicated original model. Besides, we analyze dynamic process of the multi-period Stackelberg game model. Finally, we apply a lexicographic optimization algorithm to solve the proposed model under the single-period, thereby verifying its effectiveness and also obtaining some promising results. Research article Metaheuristic-based possibilistic fuzzy k -modes algorithms for categorical data clustering Information Sciences, Volume 557, 2021, pp. 1-15 Show abstract Smart devices and technology applications are used in many fields. Much information is now recorded and collected rapidly so data analysis, especially clustering analysis, is vital to the process of analyzing and obtaining valuable information from datasets. However, data has different types of attributes: numerical, categorical, and mixed attributes. Some datasets also contain noise and outliers. An appropriate clustering is necessary to exploit the data structure. This study proposes a clustering algorithm that is called a possibilistic fuzzy k -modes (PFKM) algorithm. This combines the concept of possibility with the fuzzy k -modes (FKM) algorithm to address the effect of outliers and to improve the clustering results for categorical data. This study also implements three metaheuristics to increase clustering performance: a genetic algorithm (GA), a particle swarm optimization (PSO) and the sine-cosine algorithm (SCA). Three clustering algorithms are proposed: the GA-PFKM, PSO-PFKM, and SCA-PFKM algorithms. The performance of the algorithms is compared with that for the classical FKM algorithm using two indices: the sum-of-squared error ( SSE ) and the accuracy. The experimental results show that the PSO-PFKM and SCA-PFKM algorithms perform better for most datasets. Research article Finite-Time Fuzzy Adaptive Quantized Output Feedback Control of Triangular Structural Systems Information Sciences, 2020 Show abstract This paper proposes a finite-time fuzzy adaptive quantized control scheme for a class of triangular structural nonlinear systems. Specifically, a smooth function with intermediate control law is introduced in the modified backstepping recursive process to eliminate the effect of quantization. The output constraints issue is solved by employing a barrier Lyapunov function. Then, a command filter is applied to avoid repeatedly differentiating virtual control signals and reduce the computational complexity. Moreover, the proposed method can guarantee that the tracking error converges to a small neighborhood of the origin in the finite time. Finally, the performance and the effectiveness of the proposed method are demonstrated via simulation results. Research article Optimization-based group decision making using interval-valued intuitionistic fuzzy preference relations Information Sciences, 2020 Show abstract In this paper, we propose an optimization-based group decision making (GDM) method using interval-valued intuitionistic fuzzy preference relations (IVIFPRs). First, the concept of consistency of intuitionistic fuzzy preference relations (IFPRs) is provided. Moreover, the consistency index for IFPRs is presented. Subsequently, by splitting an IVIFPR into two IFPRs, an additive consistency is proposed for IVIFPRs. Afterward, a consensus index is presented for GDM. When the consistency and the consensus do not achieve the requirement, we propose several models to reach the requirement. Furthermore, individual IVIFPRs are integrated into a collective IVIFPR. After that, a procedure is offered to obtain the interval-valued intuitionistic fuzzy (IVIF) priority weights of the alternatives. Moreover, a new GDM method with IVIFPRs is offered. Finally, some application examples are offered. The proposed GDM method can conquer the shortcomings of the existing GDM methods. It offers us a useful way for GDM in the IVIF context. Research article Novel survival functions based on conditional aggregation operators Information Sciences, 2020 Show abstract In this paper we define a novel survival function motivated by real life problems, which generalizes the super level measure introduced by Do and Thiele (2015). This concept is based on conditional aggregation operators extending the definition of aggregation operators introduced by Calvo et al (2002) for all bounded measurable functions and not only for finite sequences. Some basic properties and several examples of conditional aggregation operators are presented. Using the novel survival function, the Choquet-Stieltjes functional is introduced and the conditions are indicated under which this functional can be called an integral. The new functional generalizes several known integrals including the Choquet-Stieltjes integral as well as Choquet integral with respect to level dependent capacity introduced by Greco et al (2011). Research article Cross-class generative network for zero-shot learning Information Sciences, Volume 555, 2021, pp. 147-163 Show abstract Zero-shot learning (ZSL) aims to recognize unseen classes when no training samples are provided for these classes. A traditional approach to solving ZSL is to generate samples of unseen classes, transforming it into a supervised task. However, the quality of these pseudo-samples is crucial for model performance. In this paper, we propose a novel network, called cross-class generative network, which includes two novel end-to-end models, to generate high-quality samples for unseen classes. Unlike previous work, our proposed models directly generate samples of unseen classes via samples of seen classes. As a result, generated samples are distributed more similarly to real samples. In addition, we propose an intra-class entropy to measure the discrepancy degree for selecting suitable source–target pairs. To the best of our knowledge, this intra-class entropy is proposed in ZSL for the first time. Our models include two versions, non-adversarial and adversarial ones, to support and explore different scenarios. We conduct extensive experiments on five benchmark datasets. A comprehensive comparison with state-of-the-art methods shows the superiority of our proposed models. View full text © 2021 Elsevier Inc. All rights reserved. About ScienceDirect Remote access Shopping cart Advertise Contact and support Terms and conditions Privacy policy We use cookies to help provide and enhance our service and tailor content and ads. By continuing you agree to the use of cookies . Copyright © 2021 Elsevier B.V. or its licensors or contributors. ScienceDirect ® is a registered trademark of Elsevier B.V. ScienceDirect ® is a registered trademark of Elsevier B.V.



关键字:

暂无


所属期刊
Information Sciences
ISSN: 0020-0255
来自:Elsevier BV