Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Tuesday, May 02, 2017

Using Hadoop Compression


Hadoop Compression

Hive can read data from a variety of sources, such as text files, sequence files, or even custom formats using Hadoop’s InputFormat APIs as well as can write data to various formats using OutputFormat API. You can take the leverage from Hadoop to store data as compressed to save significant disk storage. Compression also can increase throughput and performance. Compressing and decompressing data incurs extra CPU overhead, however, the I/O savings resulting from moving fewer bytes into memory can result in a net performance gain.

Hadoop jobs tend to be I/O bound, rather than CPU bound. If so, compression will
improve performance. However, if your jobs are CPU bound, then compression will
probably lower your performance. The only way to really know is to experiment with
different options and measure the results.

Hadoop provides a number of available compression schemes, called codecs (shortened form of compressor/decompressor). some codecs support splittable compression in which files are split if they’re larger than the file’s block size setting and individual file splits can be processed in parallel by different mappers. Splittable compression is only a factor for text files. For binary files, Hadoop compression codecs compress data within a binary-encoded container, depending on the file type (for example, a SequenceFile, Avro, or ProtocolBuffer).

There are many different compression algorithms and tools, and their characteristics and strengths vary. The most common trade-off is between compression ratios (the degree to which a file is compressed) and compress/decompress speeds.



Below are some common codecs that are supported by the Hadoop framework.

Gzip: Generates compressed files that have a .gz extension.

Bzip2: generates a better compression ratio than does Gzip, but it’s much slower.

Snappy:
 modest compression ratios, but fast compression and decompression speeds.

LZO: Similar to Snappy, supports splittable compression, which enables the parallel processing of compressed text file splits by your MapReduce jobs.



Create compressed Table
First, let’s enable intermediate compression. This won’t affect the final output, however
the job counters will show less physical data transferred for the job, since the shuffle
sort data was compressed.

hive (scott)> set hive.exec.compress.intermediate=true;

CREATE TABLE intermediate_comp_translog ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' AS SELECT * FROM translog;



hive (scott)> CREATE TABLE intermediate_comp_translog ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' AS SELECT * FROM translog;
......
Moving data to directory hdfs://nn01:9000/user/hive/warehouse/scott.db/.hive-staging_hive_2017-05-01_10-22-29_762_6373409087183829364-1/-ext-10002
Moving data to directory hdfs://nn01:9000/user/hive/warehouse/scott.db/intermediate_comp_translog
MapReduce Jobs Launched:
......
translog.record_id      translog.emp_id translog.request_type   translog.json_input     translog.result_code    translog.result_description       translog.error_type     translog.start_time     translog.proccessing_time       translog.request_channel        translog.custom_1translog.custom_2        translog.custom_3       translog.custom_4       translog.custom_5       translog.custom_6       translog.year
Time taken: 92.632 seconds


As expected, intermediate compression did not affect the final output, which remains
uncompressed, please check below.

hive (scott)> !hdfs dfs -ls /user/hive/warehouse/scott.db/intermediate_comp_translog;
Found 39 items
-rwxrwxrwx   3 hdpclient supergroup  334958784 2017-05-01 10:23 /user/hive/warehouse/scott.db/intermediate_comp_translog/000000_0
-rwxrwxrwx   3 hdpclient supergroup  249179231 2017-05-01 10:23 /user/hive/warehouse/scott.db/intermediate_comp_translog/000001_0
-rwxrwxrwx   3 hdpclient supergroup  248390875 2017-05-01 10:23 /user/hive/warehouse/scott.db/intermediate_comp_translog/000002_0
-rwxrwxrwx   3 hdpclient supergroup  248096285 2017-05-01 10:23 
.......
-rwxrwxrwx   3 hdpclient supergroup  243161657 2017-05-01 10:24 /user/hive/warehouse/scott.db/intermediate_comp_translog/000036_0
-rwxrwxrwx   3 hdpclient supergroup  243851779 2017-05-01 10:24 /user/hive/warehouse/scott.db/intermediate_comp_translog/000037_0
-rwxrwxrwx   3 hdpclient supergroup   85779553 2017-05-01 10:23 /user/hive/warehouse/scott.db/intermediate_comp_translog/000038_0
hive (scott)>




hive (scott)> !hdfs dfs -cat /user/hive/warehouse/scott.db/intermediate_comp_translog/000000_0;


273483692,1084721533,"LOAD_EMP_PROFILE",1084721533,0,"?? ??? ??????? ?????","",03-DEC-16 02.47.57.856000000 AM,7,"WEB","(Google Chrome): Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36","","","192.168.155.140 ","",""

We can also chose an intermediate compression codec other then the default codec.
hive (scott)> set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GZipCodec;
hive (scott)> set hive.exec.compress.intermediate=true;

Next, we can enable output compression:


hive (scott)> set hive.exec.compress.output=true;


hive (scott)> CREATE TABLE intermediate_comp_translog_gz ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' AS SELECT * FROM translog;

hive (scott)> !hdfs dfs -ls /user/hive/warehouse/scott.db/intermediate_comp_translog_gz;

Found 39 items
-rwxrwxrwx   3 hdpclient supergroup   34790085 2017-05-01 12:43 /user/hive/warehouse/scott.db/intermediate_comp_translog_gz/000000_0.deflate
-rwxrwxrwx   3 hdpclient supergroup   25862237 2017-05-01 12:43 /user/hive/warehouse/scott.db/intermediate_comp_translog_gz/000001_0.deflate
-rwxrwxrwx   3 hdpclient supergroup   26318840 2017-05-01 12:43 /
....
-rwxrwxrwx   3 hdpclient supergroup   26008312 2017-05-01 12:43 /user/hive/warehouse/scott.db/intermediate_comp_translog_gz/000005_0.deflate

Trying to cat the file is not suggested, as you get binary output. However, Hive can
query this data normally.

hive (scott)> !hdfs dfs -cat /user/hive/warehouse/scott.db/intermediate_comp_translog_gz/000000_0.deflate


Observe the compressed files size of files now, its about 90% compression.




Now try to query the compressed table 

hive (scott)> select * from  intermediate_comp_translog_gz limit 1;
273483692,1084721533,"LOAD_EMP_PROFILE",1084721533,0,"?? ??? ??????? ?????","",03-DEC-16 02.47.57.856000000 AM,7,"WEB","(Google Chrome): Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36","","","192.168.155.140 ","",""

You get the actual data. This ability to seamlessly work with compressed files is not Hive-specific; Hadoop’s TextInputFormat is at work here.TextInput Format understands file extensions such as .deflate or .gz and decompresses these files on the fly. Hive is unaware if the underlying files are uncompressed or compressed using any of the supported compression schemes.


No comments: