Hive Error: Error: GC overhead limit exceeded

Cause of accident

Ten million levels of data fall into the partition table, and the inserted partition is set as a dynamic partition

According to the resource usage on yarn, it is found that the memory, CPU and core count of the cluster are normal, but it appears after 88% of the data import task is executed

GC overhead limit exceeded. According to the generated execution log, it is found that there is only map task for data import, and the data volume level is tens of millions

Setting JVM parameters

set mapred.child.java.opts=-Xmx8000m;
set mapreduce.map.java.opts=-Xmx8096m;
set mapreduce.reduce.java.opts=-Xmx8096m;
set mapreduce.map.memory.mb=8096;
set mapreduce.reduce.memory.mb=8096;

By increasing the JVM parameters, you can make the data run past, but as long as the amount of data doubles, an error will still be reported

GC overhead limit exceeded

Optimization scheme:

When inserting data, add cluster by after the insertion condition to the key data fields, disperse the data, and generate a certain amount of reduce tasks to process part of the data

Read More: