This work mainly considers the value range of the molecular descriptor when it has better biological activity and at least three of the five given ADMET properties are better. Starting from the actual situation, consider starting from the two perspectives of independent variables and sample sets. Firstly, the sample data is cleaned a second time, and the molecular descriptors with negative correlation coefficients between the independent variable and the dependent variable and the variance expansion coefficient greater than 10 are eliminated; then the samples that meet at least three good properties in ADMET are selected as the new training set. Combine the remaining independent variables and the new training set to establish a XGBoost molecular activity prediction model, and test it with the actual value. Finally, the significant association analysis between the variables is carried out through the association rule algorithm, and the interval is specified to obtain the value ran ge of the molecular descriptor. All simulations are based on the R project for statistical computing and Statistical Product and Service Solutions (SPSS).
2.1 Pearson correlation coefficient
Pearson correlation coefficient measures the linear correlation. If the value is 0, it can only be said that there is no linear correlation between the independent variable and the dependent variable, not that there is no correlation. The greater the absolute value of the correlation coefficient, the stronger the correlation: the closer the correlation coefficient is to 1 or -1, the stronger the correlation; the closer the correlation coefficient is to 0, the weaker the correlation. Pearson correlation coefficients of each molecular descriptor and pIC50 were calculated, and variables with Pearson correlation coefficients greater than 0.3 were selected. Pearson correlation coefficient calculation formula is as follows:

By calculating Pearson correlation coefficient, 83 independent variables with correlation coefficient greater than 0.3 were selected from 489 independent variables.
2.2 Screening of variables and samples
After variable screening of Pearson's correlation coefficient, sample screening is now carried out. In order for the sample to satisfy the given five ADMET properties, at least three of the properties are good. The sum I of ADMET values is calculated for the regularized data. If it is greater than or equal to 3, the sample is retained; otherwise, it is discarded. The specific conditions are as follows:

After this round of sample screening, 632 eligible samples were obtained from 1974 samples.
2.3 Variance Inflation Factor
According to the sorted variables, the coefficient of variance expansion is further calculated. After removing the variables with strong correlation between variables, the main variables of this problem are obtained. The coefficient of variance inflation is mostly used to test the independence of linear relations. It can be expressed as the ratio of the variance of the estimator of regression coefficient to the variance when the assumed independent variables are not linearly correlated. The variance inflation coefficient can measure the severity of multicollinearity in the multiple linear regression model, and its specific formula is as follows:

The closer the value of VIF is to 1, the lighter the multicollinearity is, and the heavier it is vice versa. VIF = 10 is usually taken as the criterion. When VIF < 10, there is no multicollinearity; When 10 ≦ VIF < 100, there is strong multicollinearity between variables; When VIF ≧ 100, severe multicollinearity is considered. Through 632 samples, VIF calculation is carried out for 83 existing independent variables and pIC50 dependent variables. If they are less than or equal to 10, the independent variable is retained.Finally, 632 sample data with ATSc4, C1SP3, minHBint10, maxssCH2 and MDEC_22 as independent variables and pIC50 as dependent variables were obtained.
2.4 Generalized Data
By means-variance classification method, the original numerical data is generalized and a new set is constructed. SPSS software was used for visual discretization of data. Based on the average sum of scanned cases, the values of variables were divided into 2i+1 group by adding or subtracting I standard deviation values (i = 1,2,3) from the mean value of variables. N partition points will generate N+1 intervals. In this paper, taking into account the actual situation and the maximum value of existing data, 1, 2 and 3 standard deviations plus or minus are selected as the segmentation points on the basis of the mean value, and the actual sample maximum value is combined to further refine each interval. Although the interval values in some intervals were divided into negative values, the sample data were not taken into this interval, so such intervals were not considered in subsequent association analysis.
2.5 Apriori Algorithm
Association rule algorithm is to find the relationship between item sets from known data and get strong association rules. Association rules often use the following indicators (support, confidence and lift) to indicate the significance and correctness of the rules, the calculation formula is as follows:

Apriori algorithm can be divided into the following steps:
(a) Scan the database quickly, search the project set from bottom to top, compare it with the minimum threshold value of support, if it passes the threshold, it can be regarded as the high-frequency project set, denoted as L1, and set K = 1.
(b) Set K = K+1, and generate a new candidate K item set. Delete any candidate set of K-1 sub-item set in candidate K item set that does not belong to L1, and record the filtered candidate item set as Ck.
(c) Calculate whether the corresponding support degree of set Ck is not lower than the minimum support degree set in advance. If there are unqualified project sets, they should be deleted, so as to obtain Lk of high-frequency project sets
(d) Determine whether all candidate project sets have been searched. If so, go to the next step; Otherwise, go back to step (b) until the search is complete.
(e) Find out significant association rules and make further decisions.
2.6 XGBoost Algorithm
XGBoost (Extreme Gradient Boosting) is a massively parallel TREE tool and the most used open source VTree toolkit.[9] Boosting is also a machine learning algorithm for reducing bias in supervised learning. Most Boosting algorithms consist of iteratively using weak learning classifiers and adding their results to a final strong learning classifier.[10,11] In addition, they are usually given different weights according to their classification accuracy. After weak learners are added, the data is usually re-weighted to reinforce the classification of previously misclassified data points. The central idea of XGboost algorithm is to perform Taylor's second-order expansion of the objective function at t = 0, and introduce regular terms to control the complexity of the established model.
The objective function can be defined as:

The newly generated number needs to fit the residual of the last prediction, so when t trees are generated, the objective function is rewritten as:

Taylor expansion of the objective function can be obtained:

Since the prediction score of the first T-1 tree and the residual difference of Y do not affect the optimization of the objective function, the objective function can be simplified as:

Combined with the above formula, the final objective function can be obtained:[12]
