Cause of accident
Ten million levels of data fall into the partition table, and the inserted partition is set as a dynamic partition
According to the resource usage on yarn, it is found that the memory, CPU and core count of the cluster are normal, but it appears after 88% of the data import task is executed
GC overhead limit exceeded. According to the generated execution log, it is found that there is only map task for data import, and the data volume level is tens of millions
Setting JVM parameters
set mapred.child.java.opts=-Xmx8000m;
set mapreduce.map.java.opts=-Xmx8096m;
set mapreduce.reduce.java.opts=-Xmx8096m;
set mapreduce.map.memory.mb=8096;
set mapreduce.reduce.memory.mb=8096;
By increasing the JVM parameters, you can make the data run past, but as long as the amount of data doubles, an error will still be reported
GC overhead limit exceeded
Optimization scheme:
When inserting data, add cluster by after the insertion condition to the key data fields, disperse the data, and generate a certain amount of reduce tasks to process part of the data