Traditional Culture Encyclopedia - Traditional stories - Xgboost reading guide and paper understanding

Xgboost reading guide and paper understanding

The optimized distributed gradient lifting algorithm does not need feature extraction from end to end. Input the original data, and you can output the target result.

The technical realization of the full text is divided into two parts.

Obviously, xgboost is a nonlinear addition model.

If it is a regression problem, it may be:?

? And the classification problem should be cross entropy, here:

Two classification problems:

Multi-classification problem:

To review here, for multi-classification and binary classification, cross entropy and soft formula, binary classification is a special case of multi-classification.

：

Original description: the default direction, I understand it should be: every iteration, every tree should treat a feature missing in the same direction, but the missing direction of different features is random; Different iterative subtrees have random strategies.

In the process of building trees, the most time-consuming thing is to find the optimal segmentation point, and the most time-consuming part in this process is to sort the data. In order to reduce sorting time, Xgboost uses a block structure to store data (the data in each block is stored in a compressed column (CSC) format, and each column is sorted by the relevant eigenvalue)?

For the approximation algorithm, Xgboost uses multiple blocks, which exist on multiple machines or disks. Each block corresponds to a subset of the original data. Different blocks can be calculated on different machines. This method is especially effective for local strategy, because the candidate segmentation points will be regenerated every time the local strategy branches.

One disadvantage of using block structure is that when taking gradients, they are obtained by index, and the order of obtaining these gradients is according to the size of features. This will lead to discontinuous memory access, which may make the cache hit rate of CPU cache low, thus affecting the efficiency of the algorithm.

In the non-approximate greedy algorithm? Use cache-aware prefetching. Specifically, each thread is allocated a continuous buffer, gradient information is read and stored in the buffer (thus realizing the transition from discontinuous to continuous), and then gradient information is counted.

In the approximation algorithm, the block size is set reasonably. The size of a block is defined as the maximum number of samples in the block. It is very important to set an appropriate size. Too large will easily lead to low hit rate, and too small will easily lead to low parallelization efficiency. It is found that 2 16 is better through experiments.

When the amount of data is too large to be stored in main memory, in order to make out-of-core calculation possible, the data is divided into multiple blocks and stored on the disk. When calculating, independent threads are used to put blocks into main memory in advance, so that you can read the disk while calculating. However, because the disk IO speed is too slow, it is usually not as fast as the calculation speed. Therefore, we should increase the sales of disk IO. Xgboost adopts two strategies:

Block compression: compress blocks by column (LZ4 compression algorithm? ), use another thread to decompress when reading. For row indexes, only the first index value is saved, then only the offset between the data and the first index value is saved, and a * * * bit is saved with 16 bit.

Therefore, a block generally has 16 samples of 2.

Block fragmentation: divide data into different disks, assign a prefetch thread to each disk, and extract data into the memory buffer. Then, the training thread alternately reads data from each buffer. This helps to improve the throughput of disk reading when multiple disks are available.

[ 1] ? R. Beckman The present and future of KDD Cup competition: an outsider's view. (XG boost application)

[2] ? Beckman, Bilenko and langford. ? Extended machine learning: parallel and distributed methods. Cambridge University Press, new york, NY, USA, 20 1 1. (Parallel Distributed Design)

[3] ? Bennett and lanning. Netflix award. Are you online? Proceedings of the 2007 KDD Cup Seminar, pp. 3-6, New York, August 2007. (XG Boost application)

[4] ? Bleiman. Random forest. ? Machining learning, 45 (1): 5–32, 20061October 5438+0. (Brayman's random forest paper)

[5] ? Burgess. From rankney to landranke to landamat: an overview. ? Learning,11:23–581,20 10.

[6] ? O. Chapelle and Y. Chang Yahoo! Overview of learning alignment challenges. ? Journal of machine learning research. CP， 14: 1–24，20 1 1。 (XG Boost application)

[7] ? Chen Tinghua, Li, Yang, Yu. Matrix factorization of general functions using gradient progression. Are you online? go on

30th International Conference on Machine Learning (the general matrix decomposition is realized by gradient lifting).

(ICML 13), vol. 1, PP 436-444, PP 20 13.

[8] T. Chen, S. Singer, B. Taskal and C. Getsterling. efficient

Two-step enhancement of conditional random fields. Are you online? 18 Proceedings of the Conference on Artificial Intelligence and Statistics (AI Stats' 15), vol. 1 20 15. (Conditional Random Fields with Quadratic Derivative Enhancement)

[9] Fan Ruoying, Zhang Guowei, Xie Zhenjie, Wang Xirui and Lin Zhenjie. LIBLINEAR: a large linear classification library. ? Journal of machine learning research, 9:1871–1874, 2008. (XG Boost application)

J. Friedman Greedy function approximation: gradient propulsion machine. ? Statistical Yearbook, 29 (5):1189–1232,2001. (greedy algorithm of GBM)

J. Friedman Random gradient enhancement. ? Computational statistics and mathematics. Data Analysis, 38 (4): 367–378, 2002.

(Random gradient descent)

[12] J. Friedman, T. Hasty and R. Blanie. Additive logistic regression: a statistical viewpoint of boosting. ? Statistical Yearbook, 28 (2): 337–407, 2000. (overlapping logistic regression)

J.H. Friedman and B.E. popescu. Important sampling study collection in 2003. (sampling study)

Mr greenwald and Mr Connor. On-line calculation of quantile summary for effective use of space. Are you online? 200 1 ACM SIGMOD international conference on data management, pp. 58-66, 200 1.

[15] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi,

A. attalla, R. Herbrich, S. Bowles and J. Q.N. Candeira. Practical experience in predicting facebook advertising clicks. Are you online?

Procedures of the 8th International Symposium on Online Advertising Data Mining, DKDD'14,2014. (XG Boost application)

Li. Robust Logitboost and adaptive base class (ABC) Logitboost. Are you online? Proceedings of 26th Annual Conference on Uncertainty of Artificial Intelligence (UAI' 10), pp. 302–311,20 10. (logitboost)

[17] P. Li, Q. Wu and C. J. Burgess. Mcrank: Learn to use multiple classification and gradient progression to sort. Are you online? Progress of neural information processing system 20,897-904, 2008. (Multi-classification application)

[18] X. Meng, J. Bradley, B. Yavuz, E. sparks,

Nan Wen katara Man, Liu, Freeman, Cai, Amde, Owen, Xin,

M (short for meter)) zaharia and Tavoka. Mllib: machine learning in Apache spark. ?

Journal of Machine Learning Research,17 (34):1–7,2016. (Distributed Machine Learning Design)

B. Panda, J. S. Herbach, S. Basu and R. J. Bayardo. Planet: Large-scale parallel learning tree set with mapreduce. ? Proceedings of VLDB Endurance Race, 2 (2):1426–1437, August 2009. (Distributed Machine Learning Design)

[20] F. Pedregosa，G. Varoquaux，A. Gramfort，V. Michel

B.Thirion，O. Grisel，M. Blondel，P. Prettenhofer，

R. enlarged picture by Jeffrey W.

D. Kurna wave, M. Brutsche, M. perrot and E. Desnai. Scikit-learn: machine learning in python. ?

Journal of machine learning research,12: 2825–2830, 20 1 1. (sklearn)

G. ridgway. ? Generalized augmented model: a guide to gbm software packages.

[22] S. Tyree, K. Weinberger, K. Agrawal and J. Paykin. Parallel enhanced regression tree for web search ranking. Are you online? Proceedings of the 20th International Conference on World Wide Web, 387-396. ACM，20 1 1。

[23] Ye Junjie, Zhou Junhui, Chen Junjie and Zheng Zhijun. Distributed decision tree with random gradient advancement. Are you online? CIKM'09 Proceedings of the 8th American Computer Society Conference on Information and Knowledge Management/kloc-0.

[24] Zhang Qiyue and Wang Wenwei. Fast algorithm for approximate quantile in high-speed data stream. Are you online? Proceedings of the19th international conference on scientific and statistical database management in 2007. (Accelerated calculation of data processing)

25t. Zhang and R. Johnson. Using regularized greedy forest to learn nonlinear function. ? IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 20 14.

Previous article:The world belongs to Korea.
Next article:Collection of speeches on the completion of ancestral temple