Understanding before you get started with Hadoop

Little scum · Posted on 12/8/2017 1:33:48 PM

What is hadoop?
(1) Hadoop is an open-source framework for writing and running distributed applications to process large-scale data, designed for offline and large-scale data analysis, and is not suitable for the online transaction processing model of random reads and writes to several records. Hadoop = HDFS (file system, data storage technology related) + Mapreduce (data processing), Hadoop's data source can be in any form, it has better performance than relational databases in processing semi-structured and unstructured data, and has more flexible processing capabilities, regardless of whether any data form will eventually be converted into key/value, key/value is the basic data unit. Use functional expressions to replace SQL with Mapreduce, SQL is a query statement, and Mapreduce uses scripts and code, while for relational databases, Hadoop, which is accustomed to SQL, has an open source tool hive instead.
(2) Hadoop is a distributed computing solution.

What can hadoop do?
In 2009, 30% of non-programmers on Facebook used HiveQL for data analysis. Hive is also used for custom filters in Taobao search; Pig can also be used for advanced data processing, including Twitter and LinkedIn to discover people you may know, and can achieve Amazon.com-like collaborative filtering recommendation effects. Taobao's product recommendations are also recommended! In Yahoo! The40% of Hadoop jobs are run with pig, including spam identification and filtering, as well as user signature modeling. (New update on August 25, 2012, Tmall's recommendation system is hive, try mahout in small quantities!) ）
The latest version of hadoop download address: http://hadoop.apache.org/releases.html

Build and install Hadoop 2.x or later on Windows, link: https://wiki.apache.org/hadoop/Hadoop2OnWindows

1. Introduction

Hadoop version 2.2 and above includes native support for Windows. The official Apache Hadoop version does not include Windows binaries (as of January 2014). However, building a Windows package from source is fairly simple.

Hadoop is a complex system with many components. It's helpful to do some familiarization before attempting to build or install, or at a high level for the first time. If you need troubleshooting, you need to be familiar with Java.

Hadoop developers used Windows Server 2008 and Windows Server 2008 R2 during development and testing。 Windows Vista and Windows 7 may also work due to the similarity of the Win32 API to the respective server SKU. We haven't tested it on Windows XP or any earlier version of Windows, which is unlikely. Any issues reported in Windows XP or earlier versions will be considered invalid.

Do not try to run the installation in Cygwin. Cygwin neither requests nor supports it.

Understanding before you get started with Hadoop

Related Posts

Sections viewed