BigData-Hadoop: November 2016

Tuesday, November 15, 2016

Loading Data into Hbase from HDFS(MapRFS-mfs)

Step 1: HDFS file data

[root@localhost mapr]# hadoop fs -cat //user/mapr/sample.txt
100,sachin,Mum,India,Cricket
200,Samantha,Chennai,india,Actress
300,Modi,Ahd,Ind,politician

Step 2: Creating table in HBase with rowkey

hbase(main):006:0> create 'blkimport','cf'
0 row(s) in 1.2990 seconds

=> Hbase::Table - blkimport
hbase(main):007:0> list
blkimport

hbase(main):010:0> scan 'blkimport'
ROW                                   COLUMN+CELL
0 row(s) in 0.0180 seconds

Step 3: importing data from hdfs to HBase using ImportTsv

[mapr@localhost hbase-1.1.1]$ /opt/mapr/hbase/hbase-1.1.1/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns='HBASE_ROW_KEY,cf:pname,cf:loc,cf:country,cf:prefession'   blkimport /user/mapr/sample.txt

Here /user/mapr/sample.txt--> HDFS file
         blkimport --------------> HBase table
         cf------------------------> ColumnFamily
        HBASE_ROW_KEY-> RowKey
   pname,loc,country,prefession--> Column Names

Step 4: Scan table HBase

hbase(main):011:0> scan 'blkimport'
ROW                                   COLUMN+CELL
100                                  column=cf:country, timestamp=1477509673574, value=India
100                                  column=cf:loc, timestamp=1477509673574, value=Mum
100                                  column=cf:pname, timestamp=1477509673574, value=sachin
100                                  column=cf:prefession, timestamp=1477509673574, value=Cricket
200                                  column=cf:country, timestamp=1477509673574, value=india
200                                  column=cf:loc, timestamp=1477509673574, value=Chennai
200                                  column=cf:pname, timestamp=1477509673574, value=Samantha
200                                  column=cf:prefession, timestamp=1477509673574, value=Actress
300                                  column=cf:country, timestamp=1477509673574, value=Ind
300                                  column=cf:loc, timestamp=1477509673574, value=Ahd
300                                  column=cf:pname, timestamp=1477509673574, value=Modi
300                                  column=cf:prefession, timestamp=1477509673574, value=politician
3 row(s) in 0.0570 seconds

Spark-Hive integration

1.Connecting using HiveThriftServer(Port-10000)
2.Connecting using SparkThriftServer(Port-10001)

Enviroment:

# rpm -qa | grep mapr
mapr-hivemetastore-1.2.201609261225-1.noarch
mapr-spark-historyserver-1.6.1.201609271200-1.noarch
mapr-hiveserver2-1.2.201609261225-1.noarch
mapr-spark-1.6.1.201609271200-1.noarch
mapr-mapreduce2-2.7.0.37549.GA-1.x86_64
mapr-hive-1.2.201609261225-1.noarch

Configuration steps:

Add below properties to "spark-defaults.conf" file.
# vi /opt/mapr/spark/spark-1.6.1/conf/spark-defaults.conf

spark.yarn.dist.files=/opt/mapr/hive/hive-1.2/conf/hive-site.xml,/opt/mapr/hive/hive-1.2/lib/datanucleus-api-jdo-4.2.1.jar,/opt/mapr/hive/hive-1.2/lib/datanucleus-core-4.1.6.jar,/opt/mapr/hive/hive-1.2/lib/datanucleus-rdbms-4.1.7.jar
spark.sql.hive.metastore.version=1.2.0
spark.sql.hive.metastore.jars=/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/*:/opt/mapr/hive/hive-1.2/lib/accumulo-core-1.6.0.jar:/opt/mapr/hive/hive-1.2/lib/hive-contrib-1.2.0-mapr-1609.jar:/opt/mapr/hive/hive-1.2/lib/*

And add this below export property to spark-env.sh file
#vi /opt/mapr/spark/spark-1.6.1/conf/spark-env.sh
export HIVE_SERVER2_THRIFT_PORT=10001

Starting SparkThriftServer with port 10001 using below command:

[root@VM204 spark-1.6.1]# /opt/mapr/spark/spark-1.6.1/sbin/start-thriftserver.sh start --hive-conf hive.server2.thrift.port=10001

(OR)

[root@VM204 spark-1.6.1]# /opt/mapr/spark/spark-1.6.1/sbin/start-thriftserver.sh start hive.server2.thrift.port=10001

Check the port is running or not.?

[root@VM204 spark-1.6.1]# netstat -plant | grep 10001
tcp        0      0 :::10001                    :::*                        LISTEN      13588/java

Checking db's ,table's and data from Hive CLI:

[mapr@VM204 spark-1.6.1]$ hive

Logging initialized using configuration in file:/opt/mapr/hive/hive-1.2/conf/hive-log4j.properties
hive> show databases;
OK
default
dummy
Time taken: 1.57 seconds, Fetched: 2 row(s)

hive> use default;
OK
Time taken: 0.05 seconds

hive> show tables;
OK
b
b_ext
demotest1
test
Time taken: 0.049 seconds, Fetched: 4 row(s)

hive> select *from b;
OK
10      kuamr
20      nitin
30      shishir
Time taken: 1.138 seconds, Fetched: 3 row(s)
hive>

Check the same data from Spark beeline:

1.Connecting using HiveThriftServer(Port-10000)

[root@VM204 spark-1.6.1]# pwd
/opt/mapr/spark/spark-1.6.1

[mapr@VM204 spark-1.6.1]$ ./bin/beeline
Beeline version 1.6.1-mapr-1609 by Apache Hive

beeline> !connect jdbc:hive2://10.10.72.204:10000
Connecting to jdbc:hive2://10.10.72.204:10000
Enter username for jdbc:hive2://10.10.72.204:10000: mapr
Enter password for jdbc:hive2://10.10.72.204:10000: ****   <mapr>
Connected to: Apache Hive (version 1.2.0-mapr-1609)
Driver: Spark Project Core (version 1.6.1-mapr-1609)
Transaction isolation: TRANSACTION_REPEATABLE_READ

0: jdbc:hive2://10.10.72.204:10000> show databases;

+----------------+--+
| database_name |
+----------------+--+
| default        |
| dummy          |
+----------------+--+

2.Connecting using SparkThriftServer(Port-10001)
[mapr@VM204 spark-1.6.1]$ ./bin/beeline
Beeline version 1.6.1-mapr-1609 by Apache Hive

beeline> !connect jdbc:hive2://10.10.72.204:10001
Connecting to jdbc:hive2://10.10.72.204:10001
Enter username for jdbc:hive2://10.10.72.204:10001: mapr
Enter password for jdbc:hive2://10.10.72.204:10001: ****    <mapr>
Connected to: Spark SQL (version 1.6.1)
Driver: Spark Project Core (version 1.6.1-mapr-1609)
Transaction isolation: TRANSACTION_REPEATABLE_READ

0: jdbc:hive2://10.10.72.204:10001> show databases;
+----------+--+
| result |
+----------+--+
| default |
| dummy    |
+----------+--+
2 rows selected (4.289 seconds)
0: jdbc:hive2://10.10.72.204:10001>

0: jdbc:hive2://10.10.72.204:10001> use default;
+---------+--+
| result |
+---------+--+
+---------+--+
No rows selected (0.134 seconds)

0: jdbc:hive2://10.10.72.204:10001> show tables;
+------------+--------------+--+
| tableName | isTemporary |
+------------+--------------+--+
| b          | false        |
| b_ext      | false        |
| test       | false        |
+------------+--------------+--+
3 rows selected (0.077 seconds)

0: jdbc:hive2://10.10.72.204:10001> select *from b;
+-----+----------+--+
| a |    b     |
+-----+----------+--+
| 10 | kuamr    |
| 20 | nitin    |
| 30 | shishir |
+-----+----------+--+
3 rows selected (1.916 seconds)
0: jdbc:hive2://10.10.72.204:10001>

Loading data into MapR tables(MapRDB) using "importtsv"

Logging into HBase

[mapr@6VM201 root]$ hbase shell

Creating MapRDB table saying "/user/mapr/blk"
hbase(main):018:0> create '/user/mapr/blk','cf'
0 row(s) in 0.0390 seconds

=> Hbase::Table - /user/mapr/blk

File "/user/test.txt" need to be load into MapRDB table
[root@6VM201 ~]# hadoop fs -cat /user/test.txt
761377,kuamr,hyd,india
761344,nitin,mum,india

Command to load into MapRDB table using " importtsv"

[mapr@6VM201 hbase-1.1.1]$ /opt/mapr/hbase/hbase-1.1.1/bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,cf:name,cf:loc,cf:country -Dimporttsv.separator=',' -Dimporttsv.bulk.output=/user/mapr/nitintab3 /user/mapr/blk /user/test.txt

Here /user/mapr/blk --> MapRDB table
        /user/test.txt------> Source file to be loading into MapRDB table
      /user/mapr/nitintab3->final Status to be stored in this file

[mapr@6VM201 root]$ hbase shell

Check the data in MapRDB table
hbase(main):019:0> scan '/user/mapr/blk'
ROW                                               COLUMN+CELL
761344                                           column=cf:country, timestamp=1475197667463, value=india
761344                                           column=cf:loc, timestamp=1475197667463, value=mum
761344                                           column=cf:name, timestamp=1475197667463, value=nitin
761377                                           column=cf:country, timestamp=1475197667463, value=india
761377                                           column=cf:loc, timestamp=1475197667463, value=hyd
761377                                           column=cf:name, timestamp=1475197667463, value=kuamr
2 row(s) in 0.0510 seconds

Monday, November 14, 2016

MapR-Sqoop2-ORACLE data import steps

Steps:
1.Creating link for maprfs
2.Creating link for oracle
3.creating a job
4.Starting a job
5.Status of job.

Environment
mapr-sqoop2-server-2.0.0.201607271151-1.noarch
mapr-sqoop2-client-2.0.0.201607271151-1.noarch

[mapr@VM207 sqoop-2.0.0]$ ./bin/sqoop2-shell

sqoop:000> show link
0 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
+----+------+--------------+----------------+---------+
| Id | Name | Connector Id | Connector Name | Enabled |
+----+------+--------------+----------------+---------+
+----+------+--------------+----------------+---------+

sqoop:000> show connector
+----+------------------------+------------------+------------------------------------------------------
+----------------------+
| Id | Name | Version | Class | Supported Directions |
+----+------------------------+------------------+------------------------------------------------------
+----------------------+
| 1 | kite-connector | 1.99.6-mapr-1607 | org.apache.sqoop.connector.kite.KiteConnector |
FROM/TO |
| 2 | kafka-connector | 1.99.6-mapr-1607 | org.apache.sqoop.connector.kafka.KafkaConnector
| TO |
| 3 | hdfs-connector | 1.99.6-mapr-1607 | org.apache.sqoop.connector.hdfs.HdfsConnector |
FROM/TO |
| 4 | generic-jdbc-connector | 1.99.6-mapr-1607 |
org.apache.sqoop.connector.jdbc.GenericJdbcConnector | FROM/TO |
+----+------------------------+------------------+------------------------------------------------------
+----------------------+
sqoop:000>

1.Creating link for maprfs
sqoop:000> create link -c 3
Creating link for connector with id 3
Please fill following values to create new link object
Name: maprfs
Link configuration
HDFS URI: maprfs://10.10.72.207:7222
Hadoop conf directory: /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop
New link was successfully created with validation status OK and persistent id 2

2.Creating link for ORACLE JDBC conection
sqoop:000> create link -c 4
Creating link for connector with id 4
Please fill following values to create new link object
Name: oraclenew
Link configuration
JDBC Driver Class: oracle.jdbc.driver.OracleDriver
JDBC Connection String: jdbc:oracle:thin:@10.10.70.142:1521:VIJAYDB
Username: scott
Password: *****
JDBC Connection Properties:
There are currently 0 values in the map:
entry#
New link was successfully created with validation status OK and persistent id 5
sqoop:000> show link
+----+-----------+--------------+------------------------+---------+
| Id | Name | Connector Id | Connector Name | Enabled |
+----+-----------+--------------+------------------------+---------+
| 2 | maprfs | 3 | hdfs-connector | true |
| 5 | oraclenew | 4 | generic-jdbc-connector | true |
+----+-----------+--------------+------------------------+---------+

3.creating a job
sqoop:000> create job --from 5 --to 2
Creating job for links with from id 5 and to id 2
Please fill following values to create new job object
Name: newjob
From database configuration
Schema name:
Table name: TEST_TSS_ORDER_HEADERS_F_V
Table SQL statement:
Table column names:
Partition column name: ORDER_STATUS_LKP_KEY
Null value allowed for the partition column: false
Boundary query:
Incremental read
Check column:
Last value:
To HDFS configuration
Override null value:
Null value:
Output format:
0 : TEXT_FILE
1 : SEQUENCE_FILE
Choose: 0
Compression format:
0 : NONE
1 : DEFAULT
2 : DEFLATE
3 : GZIP
4 : BZIP2
5 : LZO
6 : LZ4
7 : SNAPPY
8 : CUSTOM
Choose: 0
Custom compression format:
Output directory: /user/mapr/ora4
Append mode:
Throttling resources
Extractors:
Loaders:
New job was successfully created with validation status OK and persistent id 11
sqoop:000> show job
+----+-----------+----------------+--------------+---------+
| Id | Name | From Connector | To Connector | Enabled |
+----+-----------+----------------+--------------+---------+
| 11 | newjob | 4 | 3 | true |

4.Starting a job
sqoop:000> start job -j 11
Submission details
Job ID: 11
Server URL: http://VM207:12000/sqoop/
Created by: mapr
Creation date: 2016-10-27 12:45:48 IST
Lastly updated by: mapr
External ID: job_1476676093162_0026
http://VM203:8088/proxy/application_1476676093162_0026/
Source Connector schema: Schema{name=TEST_TSS_ORDER_HEADERS_F_V,columns=[
Decimal{name=BL_ORDER_KEY,nullable=true,type=DECIMAL,precision=0,scale=-127},
Decimal{name=INCIDENT_ID,nullable=true,type=DECIMAL,precision=0,scale=-127},
Decimal{name=HEADER_ID,nullable=true,type=DECIMAL,precision=0,scale=-127},
Decimal{name=RMA_NUMBER,nullable=true,type=DECIMAL,precision=0,scale=-127},
Text{name=CUST_SERIAL_NUMBER,nullable=true,type=TEXT,charSize=null},
Date{name=ORDER_CREATION_DATE,nullable=true,type=DATE_TIME,hasFraction=true,hasTi
mezone=false},
Date{name=ORDER_CLOSE_DATE,nullable=true,type=DATE_TIME,hasFraction=true,hasTimez
one=false},
Decimal{name=ORDER_STATUS_LKP_KEY,nullable=true,type=DECIMAL,precision=0,scale=-
127}]}
2016-10-27 12:45:48 IST: BOOTING - Progress is not available

5.Status for Running job:
sqoop:000> status job -j 11
Submission details
Job ID: 11
Server URL: http://VM207:12000/sqoop/
Created by: mapr
Creation date: 2016-10-27 12:45:48 IST
Lastly updated by: mapr
External ID: job_1476676093162_0026
http://VM203:8088/proxy/application_1476676093162_0026/
2016-10-27 12:46:26 IST: RUNNING - 15.00 %
sqoop:000> status job -j 11
Submission details
Job ID: 11
Server URL: http://VM207:12000/sqoop/
Created by: mapr
Creation date: 2016-10-27 12:45:48 IST
Lastly updated by: mapr
External ID: job_1476676093162_0026
http://VM203:8088/proxy/application_1476676093162_0026/
2016-10-27 12:47:36 IST: RUNNING - 45.00 %

Check the loaded file in HDFS.
[root@VM204 hadoop]# hadoop fs -cat /user/mapr/ora4/0119d883-fd76-4431-af85-
cffa77f9c9e5.txt
3,3,3,3,'cust1','2014-07-03 06:13:00.000','2014-07-03 06:13:00.000',3

For reference:
https://github.com/splicemachine/mapr-hbase/tree/0.94.17-mapr
http://bruzah.blogspot.in/2012/02/java-google-protobufs-to-rpc-some-data.html
https://cwiki.apache.org/confluence/download/attachments/27362072/system_architecture.png?version=1&modificationDate=1414560669000&api=v2
http://hortonworks.com/blog/logparsing-with-cascading/
https://docs.google.com/document/d/12VBKeMgXKhWm0qIcRlrxpSJ-1GPUq463KlaWKrDB3qU/edit#heading=h.ljwprjkuevdk

mapr-SQOOP2-MySQL Data import

Enviroment:
mapr-sqoop2-server-2.0.0.201607271151-1.noarch
mapr-sqoop2-client-2.0.0.201607271151-1.noarch

Steps
1.Creating link for maprfs
2.Creating link for MySQL & maprfs
3.creating a job
4.Starting a job
5.Status of a job.

Step I:

Check available links

sqoop:000> show link
+----+------+--------------+----------------+---------+
| Id | Name | Connector Id | Connector Name | Enabled |
+----+------+--------------+----------------+---------+
+----+------+--------------+----------------+---------+

Check available Connectors.

sqoop:000> show connector
+----+------------------------+------------------+------------------------------------------------------
+----------------------+
| Id | Name | Version | Class | Supported Directions |
+----+------------------------+------------------+------------------------------------------------------
+----------------------+
| 1 | kite-connector | 1.99.6-mapr-1507 | org.apache.sqoop.connector.kite.KiteConnector |
FROM/TO |
| 2 | kafka-connector | 1.99.6-mapr-1507 | org.apache.sqoop.connector.kafka.KafkaConnector
| TO |
| 3 | hdfs-connector | 1.99.6-mapr-1507 | org.apache.sqoop.connector.hdfs.HdfsConnector |
FROM/TO |
| 4 | generic-jdbc-connector | 1.99.6-mapr-1507 |
org.apache.sqoop.connector.jdbc.GenericJdbcConnector | FROM/TO |
+----+------------------------+------------------+------------------------------------------------------
+----------------------+

Step II:
Create a link for RDBMS( from which DB we would like to import Data
NOTE: Provide the Connecotr ID for Name :generic-jdbc-connector . For Ex here Id is 4 for -c
arguement.)

1.Creating link for MySQL:

sqoop:000> create link -c 4
Creating link for connector with id 4
Please fill following values to create new link object
Name: <mysql>
Link configuration
JDBC Driver Class: com.mysql.jdbc.Driver
JDBC Connection String: jdbc:mysql://<DB HostName>/<Database>
Username: <sqoop>
Password: <*****>
JDBC Connection Properties:<Optional>
There are currently 0 values in the map:
entry#
New link was successfully created with validation status OK and persistent id 2

sqoop:000> show link
+----+-------+--------------+------------------------+---------+
| Id | Name | Connector Id | Connector Name | Enabled |
+----+-------+--------------+------------------------+---------+
| 2 | mysql | 4 | generic-jdbc-connector | true |
+----+-------+--------------+------------------------+---------+

Step 3:
Create a link for import location i.e. MFS location

sqoop:000> create link -c 3
Creating link for connector with id 3
Please fill following values to create new link object
Name: maprfs
Link configuration
HDFS URI:maprfs://<CLDB HostName>:7222
Hadoop conf directory: /opt/mapr/hadoop/hadoop-0.20.2/conf
New link was successfully created with validation status OK and persistent id 4

sqoop:000> show link
+----+--------+--------------+------------------------+---------+
| Id | Name | Connector Id | Connector Name | Enabled |
+----+--------+--------------+------------------------+---------+
| 2 | mysql | 4 | generic-jdbc-connector | true |
| 4 | maprfs | 3 | hdfs-connector | true |
+----+--------+--------------+------------------------+---------+

Step 4:
Creating a JOB

sqoop:000> create job --from 2 --to 4
Creating job for links with from id 2 and to id 4
Please fill following values to create new job object
Name: tetsjob
From database configuration
Schema name: mysql
Table name: <TableName>
Table SQL statement:<Optional>
Table column names:<Optional>
Partition column name: <Provide a ColumnNamefor Partitioning>
Null value allowed for the partition column: true
Boundary query:<Optional>
Incremental read
Check column:<Optional>
Last value:<Optional>
To HDFS configuration
Override null value:<Optional>
Null value:<Optional>
Output format:
0 : TEXT_FILE
1 : SEQUENCE_FILE
Choose: 0
Compression format:
0 : NONE
1 : DEFAULT
2 : DEFLATE
3 : GZIP
4 : BZIP2
5 : LZO
6 : LZ4
7 : SNAPPY
8 : CUSTOM
Choose: 0
Custom compression format:
Output directory: </MFS LOCATION NAME>
Append mode:<Optional>
Throttling resources
Extractors:<Optional>
Loaders:<Optional>
New job was successfully created with validation status OK and persistent id 12
sqoop:000> show job
sqoop:000> start job -j <Job Id>
Ex: start job -j 12
Submission details
Job ID: 12
Server URL:
Created by: mapr
Creation date:
Lastly updated by: mapr
External ID: job_<ID>
http://<Host>:8088/proxy/application_1461206632562_0005/
Source Connector schema: Schema{TABLE SCHEMA WILL BE DISPLAYED HERE}

5.Starting a job
sqoop:000> start job -j 11
Submission details
Job ID: 11
Server URL: http://VM207:12000/sqoop/
Created by: mapr
Creation date: 2016-10-27 12:45:48 IST
Lastly updated by: mapr
External ID: job_1476676093162_0026
http://VM203:8088/proxy/application_1476676093162_0026/

6.Check the status of running job:
sqoop:000> status job -j 12

For Reference:
https://github.com/splicemachine/mapr-hbase/tree/0.94.17-mapr
http://bruzah.blogspot.in/2012/02/java-google-protobufs-to-rpc-some-data.html
https://cwiki.apache.org/confluence/download/attachments/27362072/system_architecture.png?version=1&modificationDate=1414560669000&api=v2
http://hortonworks.com/blog/logparsing-with-cascading/
https://docs.google.com/document/d/12VBKeMgXKhWm0qIcRlrxpSJ-1GPUq463KlaWKrDB3qU/edit#heading=h.ljwprjkuevdk