Tag Archives: dloat

The precision of decimal calculation with double and float in Java is not accurate

The precision of decimal calculation with double and float in Java is not accurate

In most cases, using double and float to calculate the result is accurate, but in some systems with high accuracy requirements or known decimal calculation results will not be accurate, this problem is very serious.

《Effective Java “refers to a principle, that is, float and double can only be used for scientific calculation or engineering calculation, but we should use them in commercial calculation java.math.BigDecimal By using the BigDecimal class, we can solve the above problems. The designer of Java provides a very useful class BigDecimal for programmers, which can improve the problem that the float and double classes can’t calculate accurately I’m sorry.

The example code is as follows:

package ex;

import java.math.*;

public class BigDecimalDemo {
    public static void main(String[] args){
        System.out.println(ArithUtil.add(0.01, 0.05));
        System.out.println(ArithUtil.sub(1.0, 0.42));
        System.out.println(ArithUtil.mul(4.015, 100));
        System.out.println(ArithUtil.div(123.3, 100));
    }
}

class ArithUtil{
    private static final int DEF_DIV_SCALE=10;

    private ArithUtil(){}
    //+
    public static double add(double d1,double d2){
        BigDecimal b1=new BigDecimal(Double.toString(d1));
        BigDecimal b2=new BigDecimal(Double.toString(d2));
        return b1.add(b2).doubleValue();

    }
    //-
    public static double sub(double d1,double d2){
        BigDecimal b1=new BigDecimal(Double.toString(d1));
        BigDecimal b2=new BigDecimal(Double.toString(d2));
        return b1.subtract(b2).doubleValue();

    }
    //*
    public static double mul(double d1,double d2){
        BigDecimal b1=new BigDecimal(Double.toString(d1));
        BigDecimal b2=new BigDecimal(Double.toString(d2));
        return b1.multiply(b2).doubleValue();

    }
    // /
    public static double div(double d1,double d2){

        return div(d1,d2,DEF_DIV_SCALE);

    }

    public static double div(double d1,double d2,int scale){
        if(scale<0){
            throw new IllegalArgumentException("The scale must be a positive integer or zero");
        }
        BigDecimal b1=new BigDecimal(Double.toString(d1));
        BigDecimal b2=new BigDecimal(Double.toString(d2));
        return b1.divide(b2,scale,BigDecimal.ROUND_HALF_UP).doubleValue();

    }

}

Now let’s analyze in detail why floating-point operations cause precision loss?

1. Binary representation of decimals

 First we need to clarify the following two issues.

     (1) How to convert decimal integers to binary numbers

           The algorithm is simple. As an example, 11 is expressed as a binary number.

                     11/2=5 remainder 1

                       5/2=2 remainder 1

                       2/2=1 remainder 0

                       1/2=0 remainder 1

                          End of 0 11 is represented in binary as (from bottom to top):1011

          In other words, will the algorithm for converting all integers into binary numbers go on indefinitely? Absolutely not, integers can always be represented exactly in binary, but not decimals.

      (2) How decimal decimals are converted into binary numbers

           The algorithm is to multiply by 2 until there are no more decimals. As an example, 0.9 is expressed as a binary number

                     0.9*2=1.8 Take the integer part 1

                     0.8(fractional part of 1.8)*2=1.6 Take the integer part 1

                     0.6*2=1.2 Take the integer part 1

                     0.2*2=0.4 Take the integer part 0

                     0.4*2=0.8 Take integer part 0

                     0.8*2=1.6 Take integer part 1

                     0.6*2=1.2 Take integer part 0

                              .........      0.9 Binary representation (from top to bottom): 1100100100100 ......

           Note: The above calculation process is circular, which means that *2 can never eliminate the fractional part, so that the algorithm will go on indefinitely. Obviously, the binary representation of decimals is sometimes impossible to be exact. The reason is simple: can 1/3 be represented accurately in the decimal system? This explains the loss of precision in floating-point subtraction, where "subtraction is incomplete". 

.
2. Storage of float type in memory

It is well known that Java's float type takes up 4 bytes in memory. float's 32 binary bits are structured as follows



float memory storage structure

             4bytes 31 30 29 ----23 22 ----0         

                        Indicates real sign bit Exponential sign bit Exponential bit Valid number of bits

        Where the sign bit 1 means positive and 0 means negative. The valid bits are 24 bits, one of which is the real sign bit.

         The steps to convert a float type to memory storage format are

        (1) First convert the absolute value of this real number into binary format, noting that the binary methods for the integer and decimal parts of the real number have been discussed above. 
     (2) Shift the decimal point of this binary format real number left or right by n bits until the decimal point moves to the right of the first significant digit. 
     (3) Count out twenty-three digits from the first digit to the right of the decimal point into the 22nd to the 0th digit. 
     (4) If the real number is positive, put a "0" in the 31st place, otherwise put a "1". 
     (5) If n is shifted to the left, the exponent is positive and "1" is placed in the 30th place. If n is shifted to the right or n=0, then "0" is placed in the 30th position. 
     (6) If n is obtained by left shift, n is converted to binary by subtracting 1 and adding "0" to the left to make up the seven bits and put it into the 29th to 23rd place. If n is shifted to the right or n=0, then n is converted to binary and "0" is added to the left to make up the seven bits, and then each bit is inverted and placed in bits 29 to 23.

          Example: Memory storage format of 11.9

       (1) The memory storage format of 11.9 is approximately "1011. 11100110011001100110011001100..." after converting 11.9 to binary. .".

       (2) Shift the decimal point three places left to the right of the first significant digit: "1. 011 11100110011001100110". Ensure that the number of significant digits is 24, with the extra intercept on the right (where the error is created ).

       (3) This already has twenty-four valid digits, remove the leftmost "1" to get "011 11100110011001100110" with 23 bits. (4) Since 11.9 is a valid number, it is a valid number.

       (4) Since 11.9 is a positive number, put "0" in the 31st real sign bit.

       (5) Since we are shifting the decimal point to the left, we put "1" in the 30th exponent sign position.

       (6) Since we are shifting the decimal point 3 places to the left, we subtract 1 from 3 to get 2, convert it to binary, and add 7 bits to get 0000010, and put it in the 29th to 23rd place.

           The final representation of 11.9 is: 0 1 0000010 011 11100110011001100110

           Another example: the memory storage format of 0.2356
      (1) After converting 0.2356 to binary, it is approximately 0.00111100010100000100100000. 
      (2) Move the decimal point three places to the right to get 1.11100010100000100100000. 
      (3) Count out twenty-three valid digits from the right of the decimal point, i.e. 1110001010100000100100000.
into the 22nd to the 0th digit. 
      (4) Since 0.2356 is positive, put a "0" in the 31st place. 
      (5) Since we have shifted the decimal point to the right, we put a "0" in the 30th place. 
      (6) Because the decimal point is shifted to the right by 3 bits, so the 3 is converted to binary, and the "0" is added to the left to make up the full seven
(6) Because the decimal point is shifted to the right by 3 bits, the 3 is converted to binary, and the "0" is added to the left to make up the seven bits to get 0000011. 


           The final value of 0.2356 is: 0 0 1111100 11100010100000100100000

          Steps to convert a memory-stored float binary format to decimal. 
     (1) Write out the binary number from the 22nd to the 0th digit, and add a "1" to the leftmost digit to get twenty-four valid digits. Put the decimal point to the right of the leftmost "1". 
     (2) Take out the value n from the 29th to the 23rd digit, and invert the n digit when the 30th digit is "0". When the 30th digit is "1", increment n by 1. 
     (3) Shift the decimal point left by n (when the 30th digit is "0") or right by n (when the 30th digit is "1") to obtain a real number in binary. 
     (4) Convert this binary real number to decimal and add a plus or minus sign depending on whether the 31st bit is a "0" or a "1". 

3. Subtraction of floating point

The process of adding and subtracting floating point is more complex than the process of fixed point. The process of completing a floating point addition or subtraction operation is roughly divided into four steps.

(1) Checking of the 0 operand.
        If one of the two floating-point numbers to be added or subtracted is 0, the result of the operation is known without the need to perform some sequential operations. 

(2) Comparing the magnitude of the ordinal (exponent bits) and completing the pair order.
    To add or subtract two floating-point numbers, we must first see whether the exponent bits of the two numbers are the same, i.e., whether the decimal point positions are aligned. If the two numbers have the same exponent, it means the decimal points are aligned, and the addition and subtraction of the last number can be performed. On the contrary, if the two numbers have different ordinates, it means that the decimal points are not aligned, so we must make the ordinates of the two numbers the same, and this process is called pairing.

    How to pair order (assuming two floating point numbers with exponent bits Ex and Ey ).
    By shifting the mantissa to change Ex or Ey so that they are equal. Since floating point numbers are mostly speciated, shifting the trailing left number will cause the loss of the highest significant bit, resulting in a large error; while shifting the trailing right number will cause the loss of the lowest significant bit, but the error caused is smaller, therefore, the pairwise order operation requires that the trailing right number be shifted, and after the trailing right number is shifted, the order code is increased accordingly, while its value remains unchanged. Obviously, if one of the increased ordinals is equal to the other, the increased ordinal must be a small one. Therefore, in the order pairing, always make the small order to the large order, that is, the end of the small order shift to the right (equivalent to the decimal point left shift), each right shift one, its order code plus 1, until the two numbers of the order code equal, the number of right shift is equal to the order difference △ E. 

(3) The mantissa (valid digits) is added or subtracted.

(4) The result is normalized and rounded.