Common Algorithms for Sharing Big Data (Applications)

Solve the excellent talent network · Posted on 4/27/2019 9:53:15 AM

Big data mining is the process of discovering valuable and potentially useful information and knowledge hidden in massive, incomplete, noisy, fuzzy, and random large databases, and it is also a decision support process. It is mainly based on artificial intelligence, machine learning, pattern learning, statistics, etc. Big data mining is the process of discovering valuable and potentially useful information and knowledge hidden in massive, incomplete, noisy, fuzzy, and random large databases, and it is also a decision support process. It is mainly based on artificial intelligence, machine learning, pattern learning, statistics, etc.

(1) Classification. Classification is to find out the common characteristics of a set of data objects in the database and divide them into different classes according to the classification pattern, the purpose of which is to map the data items in the database to a given category through the classification model. It can be applied to application classification and trend prediction, such as Taobao stores divide users' purchases into different categories over a period of time, and recommend related products to users according to the situation, thereby increasing the sales volume of the store. Many algorithms can be used for classification, such as decision trees, knn, Bayesian, etc

(2) Regression analysis. Regression analysis reflects the characteristics of attribute values of data in the database, and discovers the dependencies between attribute values by expressing the relationship of data mapping through functions. It can be applied to the prediction and correlation of data series. In marketing, regression analysis can be applied to various aspects. For example, through the regression analysis of sales in the current quarter, the sales trend of the next quarter is predicted and targeted marketing changes are made. Common regression algorithms include Ordinary Least Square, Logistic Regression, Stepwise Regression, Multivariate Adaptive Regression Splines, and Locally Estimated Scatterplot Smoothing）

(3) Clustering. Clustering is similar to classification, but unlike classification, it divides a set of data into categories based on similarities and differences in data. The similarity between data belonging to the same category is very large, but the similarity between data between different categories is very small, and the correlation between data across categories is very low. Common clustering algorithms include k-Means algorithm and expectation maximization (EM).

(4) Association rules. Association rules are associations or relationships between hidden data items, that is, the occurrence of other data items can be deduced based on the appearance of one data item. The mining process of association rules mainly includes two stages: the first stage is to find all high-frequency project groups from massive raw data; The second extreme is to generate association rules from these high-frequency project groups. Association rule mining technology has been widely used in financial industry enterprises to predict customer needs, and banks improve their marketing by bundling information that customers may be interested in for users to understand and obtain corresponding information on their ATMs. Common algorithms include Apriori algorithm and Eclat algorithm.

(5) Neural network method. As an advanced artificial intelligence technology, neural network is very suitable for dealing with nonlinear and processing problems characterized by vague, incomplete, and inaccurate knowledge or data, and its characteristics are very suitable for solving data mining problems. Typical neural network models are mainly divided into three categories: the first is the feedforward neural network model for classification prediction and pattern recognition, which is mainly represented by functional networks and perceptrons; The second category is the feedback neural network model for associative memory and optimization algorithms, represented by Hopfield's discrete model and continuous model. The third category is the self-organizing mapping method for clustering, represented by the ART model. Although there are many models and algorithms for neural networks, there is no uniform rule on which models and algorithms to use in specific fields of data mining, and it is difficult for people to understand the learning and decision-making process of networks.

(6) Web data mining. Web data mining is a comprehensive technology, which refers to the Web from the document structure and the set C used to discover the implicit pattern P, if C is regarded as the input, P is regarded as the output, then the web mining process can be regarded as a mapping process from input to output. At present, more and more web data appears in the form of data streams, so it is of great significance to web data flow mining. At present, the commonly used web data mining algorithms are: PageRank algorithm, HITS algorithm and LOGSOM algorithm. The users mentioned in these three algorithms are general users and do not distinguish between individual users. At present, web data mining is facing some problems, including: user classification, website content timeliness, user stay time on the page, page links in and out numbers, etc. In today's rapid development of web technology, these problems are still worth studying and solving.

(7) Deep learning
Deep learning algorithms are the development of artificial neural networks. It has recently won a lot of attention, especially after Baidu has also begun to develop deep learning, which has attracted a lot of attention in China. In today's world where computing power is becoming cheaper, deep learning attempts to build neural networks that are much larger and more complex. Many deep learning algorithms are semi-supervised learning algorithms used to process large datasets with a small amount of unidentified data. Common deep learning algorithms include: Restricted Boltzmann Machine (RBN), Deep Belief Networks (DBN), Convolutional Networks, and Stacked Auto-encoders.

(8) Integration algorithm
The ensemble algorithm uses some relatively weak learning models to independently train on the same sample, and then integrates the results for overall prediction. The main difficulty of the ensemble algorithm is which independent weaker learning models are integrated and how to integrate the learning results. This is a very powerful class of algorithms and at the same time very popular. Common algorithms include: Boosting, Bootstrapped Aggregation (Bagging), AdaBoost, Stacked Generalization (Blending), Gradient Boosting Machine (GBM), and Random Forest.

In addition, dimensionality reduction is also very important in data analysis engineering, like clustering algorithms, dimensionality reduction algorithms try to analyze the internal structure of data, but dimensionality reduction algorithms try to use less information to summarize or interpret data in an unsupervised learning way. These algorithms can be used to visualize high-dimensional data or to simplify data for supervised learning. Common algorithms include: Principle Component Analysis (PCA), Partial Least Square Regression (PLS), Sammon Mapping, Multi-Dimensional Scaling (MDS), Projection Pursuit, etc.

For detailed analysis of the advantages and disadvantages of some algorithms and algorithm selection references, you can take a look at the adaptation scenarios of several commonly used algorithms and their advantages and disadvantages in the following blog (very good)

The following is from a paragraph from the blog above:
Algorithm selection reference:

I have translated some foreign articles before, and one article gives a simple algorithm selection technique:

If its effect is not good, then its results can be used as a reference and compared with other algorithms on the basis.

Then try the decision tree (random forest) to see if it can dramatically improve your model performance. Even if you don't use it as the final model in the end, you can use a random forest to remove noise variables and select features;

If the number of features and observational samples are particularly large, then using SVM is an option when resources and time are sufficient (this premise is important).

Normally: [XGBOOST>=GBDT>=SVM>=RF>=Adaboost>=Other...], now deep learning is very popular, used in many fields, it is based on neural networks, I am currently learning myself, but the theoretical knowledge is not very thick, the understanding is not deep enough, I will not introduce it here.

Algorithms are important, but good data is better than good algorithms, and designing good features is of great benefit. If you have a very large dataset, then no matter which algorithm you use, it may not affect classification performance much (you can choose based on speed and ease of use).

Solve the excellent talent network · Posted on 5/27/2019 8:27:15 AM

Good morning everyone

Solve the excellent talent network · Posted on 9/16/2019 12:10:06 PM

Algorithms are highly paid and welcome to call

Common Algorithms for Sharing Big Data (Applications)

Sections viewed