forked from riscvarchive/riscv-glibc
-
Notifications
You must be signed in to change notification settings - Fork 0
/
arith.texi
3318 lines (2842 loc) · 141 KB
/
arith.texi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
@node Arithmetic, Date and Time, Mathematics, Top
@c %MENU% Low level arithmetic functions
@chapter Arithmetic Functions
This chapter contains information about functions for doing basic
arithmetic operations, such as splitting a float into its integer and
fractional parts or retrieving the imaginary part of a complex value.
These functions are declared in the header files @file{math.h} and
@file{complex.h}.
@menu
* Integers:: Basic integer types and concepts
* Integer Division:: Integer division with guaranteed rounding.
* Floating Point Numbers:: Basic concepts. IEEE 754.
* Floating Point Classes:: The five kinds of floating-point number.
* Floating Point Errors:: When something goes wrong in a calculation.
* Rounding:: Controlling how results are rounded.
* Control Functions:: Saving and restoring the FPU's state.
* Arithmetic Functions:: Fundamental operations provided by the library.
* Complex Numbers:: The types. Writing complex constants.
* Operations on Complex:: Projection, conjugation, decomposition.
* Parsing of Numbers:: Converting strings to numbers.
* Printing of Floats:: Converting floating-point numbers to strings.
* System V Number Conversion:: An archaic way to convert numbers to strings.
@end menu
@node Integers
@section Integers
@cindex integer
The C language defines several integer data types: integer, short integer,
long integer, and character, all in both signed and unsigned varieties.
The GNU C compiler extends the language to contain long long integers
as well.
@cindex signedness
The C integer types were intended to allow code to be portable among
machines with different inherent data sizes (word sizes), so each type
may have different ranges on different machines. The problem with
this is that a program often needs to be written for a particular range
of integers, and sometimes must be written for a particular size of
storage, regardless of what machine the program runs on.
To address this problem, @theglibc{} contains C type definitions
you can use to declare integers that meet your exact needs. Because the
@glibcadj{} header files are customized to a specific machine, your
program source code doesn't have to be.
These @code{typedef}s are in @file{stdint.h}.
@pindex stdint.h
If you require that an integer be represented in exactly N bits, use one
of the following types, with the obvious mapping to bit size and signedness:
@itemize @bullet
@item int8_t
@item int16_t
@item int32_t
@item int64_t
@item uint8_t
@item uint16_t
@item uint32_t
@item uint64_t
@end itemize
If your C compiler and target machine do not allow integers of a certain
size, the corresponding above type does not exist.
If you don't need a specific storage size, but want the smallest data
structure with @emph{at least} N bits, use one of these:
@itemize @bullet
@item int_least8_t
@item int_least16_t
@item int_least32_t
@item int_least64_t
@item uint_least8_t
@item uint_least16_t
@item uint_least32_t
@item uint_least64_t
@end itemize
If you don't need a specific storage size, but want the data structure
that allows the fastest access while having at least N bits (and
among data structures with the same access speed, the smallest one), use
one of these:
@itemize @bullet
@item int_fast8_t
@item int_fast16_t
@item int_fast32_t
@item int_fast64_t
@item uint_fast8_t
@item uint_fast16_t
@item uint_fast32_t
@item uint_fast64_t
@end itemize
If you want an integer with the widest range possible on the platform on
which it is being used, use one of the following. If you use these,
you should write code that takes into account the variable size and range
of the integer.
@itemize @bullet
@item intmax_t
@item uintmax_t
@end itemize
@Theglibc{} also provides macros that tell you the maximum and
minimum possible values for each integer data type. The macro names
follow these examples: @code{INT32_MAX}, @code{UINT8_MAX},
@code{INT_FAST32_MIN}, @code{INT_LEAST64_MIN}, @code{UINTMAX_MAX},
@code{INTMAX_MAX}, @code{INTMAX_MIN}. Note that there are no macros for
unsigned integer minima. These are always zero. Similarly, there
are macros such as @code{INTMAX_WIDTH} for the width of these types.
Those macros for integer type widths come from TS 18661-1:2014.
@cindex maximum possible integer
@cindex minimum possible integer
There are similar macros for use with C's built in integer types which
should come with your C compiler. These are described in @ref{Data Type
Measurements}.
Don't forget you can use the C @code{sizeof} function with any of these
data types to get the number of bytes of storage each uses.
@node Integer Division
@section Integer Division
@cindex integer division functions
This section describes functions for performing integer division. These
functions are redundant when GNU CC is used, because in GNU C the
@samp{/} operator always rounds towards zero. But in other C
implementations, @samp{/} may round differently with negative arguments.
@code{div} and @code{ldiv} are useful because they specify how to round
the quotient: towards zero. The remainder has the same sign as the
numerator.
These functions are specified to return a result @var{r} such that the value
@code{@var{r}.quot*@var{denominator} + @var{r}.rem} equals
@var{numerator}.
@pindex stdlib.h
To use these facilities, you should include the header file
@file{stdlib.h} in your program.
@deftp {Data Type} div_t
@standards{ISO, stdlib.h}
This is a structure type used to hold the result returned by the @code{div}
function. It has the following members:
@table @code
@item int quot
The quotient from the division.
@item int rem
The remainder from the division.
@end table
@end deftp
@deftypefun div_t div (int @var{numerator}, int @var{denominator})
@standards{ISO, stdlib.h}
@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
@c Functions in this section are pure, and thus safe.
The function @code{div} computes the quotient and remainder from
the division of @var{numerator} by @var{denominator}, returning the
result in a structure of type @code{div_t}.
If the result cannot be represented (as in a division by zero), the
behavior is undefined.
Here is an example, albeit not a very useful one.
@smallexample
div_t result;
result = div (20, -6);
@end smallexample
@noindent
Now @code{result.quot} is @code{-3} and @code{result.rem} is @code{2}.
@end deftypefun
@deftp {Data Type} ldiv_t
@standards{ISO, stdlib.h}
This is a structure type used to hold the result returned by the @code{ldiv}
function. It has the following members:
@table @code
@item long int quot
The quotient from the division.
@item long int rem
The remainder from the division.
@end table
(This is identical to @code{div_t} except that the components are of
type @code{long int} rather than @code{int}.)
@end deftp
@deftypefun ldiv_t ldiv (long int @var{numerator}, long int @var{denominator})
@standards{ISO, stdlib.h}
@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
The @code{ldiv} function is similar to @code{div}, except that the
arguments are of type @code{long int} and the result is returned as a
structure of type @code{ldiv_t}.
@end deftypefun
@deftp {Data Type} lldiv_t
@standards{ISO, stdlib.h}
This is a structure type used to hold the result returned by the @code{lldiv}
function. It has the following members:
@table @code
@item long long int quot
The quotient from the division.
@item long long int rem
The remainder from the division.
@end table
(This is identical to @code{div_t} except that the components are of
type @code{long long int} rather than @code{int}.)
@end deftp
@deftypefun lldiv_t lldiv (long long int @var{numerator}, long long int @var{denominator})
@standards{ISO, stdlib.h}
@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
The @code{lldiv} function is like the @code{div} function, but the
arguments are of type @code{long long int} and the result is returned as
a structure of type @code{lldiv_t}.
The @code{lldiv} function was added in @w{ISO C99}.
@end deftypefun
@deftp {Data Type} imaxdiv_t
@standards{ISO, inttypes.h}
This is a structure type used to hold the result returned by the @code{imaxdiv}
function. It has the following members:
@table @code
@item intmax_t quot
The quotient from the division.
@item intmax_t rem
The remainder from the division.
@end table
(This is identical to @code{div_t} except that the components are of
type @code{intmax_t} rather than @code{int}.)
See @ref{Integers} for a description of the @code{intmax_t} type.
@end deftp
@deftypefun imaxdiv_t imaxdiv (intmax_t @var{numerator}, intmax_t @var{denominator})
@standards{ISO, inttypes.h}
@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
The @code{imaxdiv} function is like the @code{div} function, but the
arguments are of type @code{intmax_t} and the result is returned as
a structure of type @code{imaxdiv_t}.
See @ref{Integers} for a description of the @code{intmax_t} type.
The @code{imaxdiv} function was added in @w{ISO C99}.
@end deftypefun
@node Floating Point Numbers
@section Floating Point Numbers
@cindex floating point
@cindex IEEE 754
@cindex IEEE floating point
Most computer hardware has support for two different kinds of numbers:
integers (@math{@dots{}-3, -2, -1, 0, 1, 2, 3@dots{}}) and
floating-point numbers. Floating-point numbers have three parts: the
@dfn{mantissa}, the @dfn{exponent}, and the @dfn{sign bit}. The real
number represented by a floating-point value is given by
@tex
$(s \mathrel? -1 \mathrel: 1) \cdot 2^e \cdot M$
@end tex
@ifnottex
@math{(s ? -1 : 1) @mul{} 2^e @mul{} M}
@end ifnottex
where @math{s} is the sign bit, @math{e} the exponent, and @math{M}
the mantissa. @xref{Floating Point Concepts}, for details. (It is
possible to have a different @dfn{base} for the exponent, but all modern
hardware uses @math{2}.)
Floating-point numbers can represent a finite subset of the real
numbers. While this subset is large enough for most purposes, it is
important to remember that the only reals that can be represented
exactly are rational numbers that have a terminating binary expansion
shorter than the width of the mantissa. Even simple fractions such as
@math{1/5} can only be approximated by floating point.
Mathematical operations and functions frequently need to produce values
that are not representable. Often these values can be approximated
closely enough for practical purposes, but sometimes they can't.
Historically there was no way to tell when the results of a calculation
were inaccurate. Modern computers implement the @w{IEEE 754} standard
for numerical computations, which defines a framework for indicating to
the program when the results of calculation are not trustworthy. This
framework consists of a set of @dfn{exceptions} that indicate why a
result could not be represented, and the special values @dfn{infinity}
and @dfn{not a number} (NaN).
@node Floating Point Classes
@section Floating-Point Number Classification Functions
@cindex floating-point classes
@cindex classes, floating-point
@pindex math.h
@w{ISO C99} defines macros that let you determine what sort of
floating-point number a variable holds.
@deftypefn {Macro} int fpclassify (@emph{float-type} @var{x})
@standards{ISO, math.h}
@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
This is a generic macro which works on all floating-point types and
which returns a value of type @code{int}. The possible values are:
@vtable @code
@item FP_NAN
@standards{C99, math.h}
The floating-point number @var{x} is ``Not a Number'' (@pxref{Infinity
and NaN})
@item FP_INFINITE
@standards{C99, math.h}
The value of @var{x} is either plus or minus infinity (@pxref{Infinity
and NaN})
@item FP_ZERO
@standards{C99, math.h}
The value of @var{x} is zero. In floating-point formats like @w{IEEE
754}, where zero can be signed, this value is also returned if
@var{x} is negative zero.
@item FP_SUBNORMAL
@standards{C99, math.h}
Numbers whose absolute value is too small to be represented in the
normal format are represented in an alternate, @dfn{denormalized} format
(@pxref{Floating Point Concepts}). This format is less precise but can
represent values closer to zero. @code{fpclassify} returns this value
for values of @var{x} in this alternate format.
@item FP_NORMAL
@standards{C99, math.h}
This value is returned for all other values of @var{x}. It indicates
that there is nothing special about the number.
@end vtable
@end deftypefn
@code{fpclassify} is most useful if more than one property of a number
must be tested. There are more specific macros which only test one
property at a time. Generally these macros execute faster than
@code{fpclassify}, since there is special hardware support for them.
You should therefore use the specific macros whenever possible.
@deftypefn {Macro} int iscanonical (@emph{float-type} @var{x})
@standards{ISO, math.h}
@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
In some floating-point formats, some values have canonical (preferred)
and noncanonical encodings (for IEEE interchange binary formats, all
encodings are canonical). This macro returns a nonzero value if
@var{x} has a canonical encoding. It is from TS 18661-1:2014.
Note that some formats have multiple encodings of a value which are
all equally canonical; @code{iscanonical} returns a nonzero value for
all such encodings. Also, formats may have encodings that do not
correspond to any valid value of the type. In ISO C terms these are
@dfn{trap representations}; in @theglibc{}, @code{iscanonical} returns
zero for such encodings.
@end deftypefn
@deftypefn {Macro} int isfinite (@emph{float-type} @var{x})
@standards{ISO, math.h}
@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
This macro returns a nonzero value if @var{x} is finite: not plus or
minus infinity, and not NaN. It is equivalent to
@smallexample
(fpclassify (x) != FP_NAN && fpclassify (x) != FP_INFINITE)
@end smallexample
@code{isfinite} is implemented as a macro which accepts any
floating-point type.
@end deftypefn
@deftypefn {Macro} int isnormal (@emph{float-type} @var{x})
@standards{ISO, math.h}
@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
This macro returns a nonzero value if @var{x} is finite and normalized.
It is equivalent to
@smallexample
(fpclassify (x) == FP_NORMAL)
@end smallexample
@end deftypefn
@deftypefn {Macro} int isnan (@emph{float-type} @var{x})
@standards{ISO, math.h}
@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
This macro returns a nonzero value if @var{x} is NaN. It is equivalent
to
@smallexample
(fpclassify (x) == FP_NAN)
@end smallexample
@end deftypefn
@deftypefn {Macro} int issignaling (@emph{float-type} @var{x})
@standards{ISO, math.h}
@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
This macro returns a nonzero value if @var{x} is a signaling NaN
(sNaN). It is from TS 18661-1:2014.
@end deftypefn
@deftypefn {Macro} int issubnormal (@emph{float-type} @var{x})
@standards{ISO, math.h}
@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
This macro returns a nonzero value if @var{x} is subnormal. It is
from TS 18661-1:2014.
@end deftypefn
@deftypefn {Macro} int iszero (@emph{float-type} @var{x})
@standards{ISO, math.h}
@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
This macro returns a nonzero value if @var{x} is zero. It is from TS
18661-1:2014.
@end deftypefn
Another set of floating-point classification functions was provided by
BSD. @Theglibc{} also supports these functions; however, we
recommend that you use the ISO C99 macros in new code. Those are standard
and will be available more widely. Also, since they are macros, you do
not have to worry about the type of their argument.
@deftypefun int isinf (double @var{x})
@deftypefunx int isinff (float @var{x})
@deftypefunx int isinfl (long double @var{x})
@standards{BSD, math.h}
@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
This function returns @code{-1} if @var{x} represents negative infinity,
@code{1} if @var{x} represents positive infinity, and @code{0} otherwise.
@end deftypefun
@deftypefun int isnan (double @var{x})
@deftypefunx int isnanf (float @var{x})
@deftypefunx int isnanl (long double @var{x})
@standards{BSD, math.h}
@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
This function returns a nonzero value if @var{x} is a ``not a number''
value, and zero otherwise.
@strong{NB:} The @code{isnan} macro defined by @w{ISO C99} overrides
the BSD function. This is normally not a problem, because the two
routines behave identically. However, if you really need to get the BSD
function for some reason, you can write
@smallexample
(isnan) (x)
@end smallexample
@end deftypefun
@deftypefun int finite (double @var{x})
@deftypefunx int finitef (float @var{x})
@deftypefunx int finitel (long double @var{x})
@standards{BSD, math.h}
@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
This function returns a nonzero value if @var{x} is neither infinite nor
a ``not a number'' value, and zero otherwise.
@end deftypefun
@strong{Portability Note:} The functions listed in this section are BSD
extensions.
@node Floating Point Errors
@section Errors in Floating-Point Calculations
@menu
* FP Exceptions:: IEEE 754 math exceptions and how to detect them.
* Infinity and NaN:: Special values returned by calculations.
* Status bit operations:: Checking for exceptions after the fact.
* Math Error Reporting:: How the math functions report errors.
@end menu
@node FP Exceptions
@subsection FP Exceptions
@cindex exception
@cindex signal
@cindex zero divide
@cindex division by zero
@cindex inexact exception
@cindex invalid exception
@cindex overflow exception
@cindex underflow exception
The @w{IEEE 754} standard defines five @dfn{exceptions} that can occur
during a calculation. Each corresponds to a particular sort of error,
such as overflow.
When exceptions occur (when exceptions are @dfn{raised}, in the language
of the standard), one of two things can happen. By default the
exception is simply noted in the floating-point @dfn{status word}, and
the program continues as if nothing had happened. The operation
produces a default value, which depends on the exception (see the table
below). Your program can check the status word to find out which
exceptions happened.
Alternatively, you can enable @dfn{traps} for exceptions. In that case,
when an exception is raised, your program will receive the @code{SIGFPE}
signal. The default action for this signal is to terminate the
program. @xref{Signal Handling}, for how you can change the effect of
the signal.
@noindent
The exceptions defined in @w{IEEE 754} are:
@table @samp
@item Invalid Operation
This exception is raised if the given operands are invalid for the
operation to be performed. Examples are
(see @w{IEEE 754}, @w{section 7}):
@enumerate
@item
Addition or subtraction: @math{@infinity{} - @infinity{}}. (But
@math{@infinity{} + @infinity{} = @infinity{}}).
@item
Multiplication: @math{0 @mul{} @infinity{}}.
@item
Division: @math{0/0} or @math{@infinity{}/@infinity{}}.
@item
Remainder: @math{x} REM @math{y}, where @math{y} is zero or @math{x} is
infinite.
@item
Square root if the operand is less than zero. More generally, any
mathematical function evaluated outside its domain produces this
exception.
@item
Conversion of a floating-point number to an integer or decimal
string, when the number cannot be represented in the target format (due
to overflow, infinity, or NaN).
@item
Conversion of an unrecognizable input string.
@item
Comparison via predicates involving @math{<} or @math{>}, when one or
other of the operands is NaN. You can prevent this exception by using
the unordered comparison functions instead; see @ref{FP Comparison Functions}.
@end enumerate
If the exception does not trap, the result of the operation is NaN.
@item Division by Zero
This exception is raised when a finite nonzero number is divided
by zero. If no trap occurs the result is either @math{+@infinity{}} or
@math{-@infinity{}}, depending on the signs of the operands.
@item Overflow
This exception is raised whenever the result cannot be represented
as a finite value in the precision format of the destination. If no trap
occurs the result depends on the sign of the intermediate result and the
current rounding mode (@w{IEEE 754}, @w{section 7.3}):
@enumerate
@item
Round to nearest carries all overflows to @math{@infinity{}}
with the sign of the intermediate result.
@item
Round toward @math{0} carries all overflows to the largest representable
finite number with the sign of the intermediate result.
@item
Round toward @math{-@infinity{}} carries positive overflows to the
largest representable finite number and negative overflows to
@math{-@infinity{}}.
@item
Round toward @math{@infinity{}} carries negative overflows to the
most negative representable finite number and positive overflows
to @math{@infinity{}}.
@end enumerate
Whenever the overflow exception is raised, the inexact exception is also
raised.
@item Underflow
The underflow exception is raised when an intermediate result is too
small to be calculated accurately, or if the operation's result rounded
to the destination precision is too small to be normalized.
When no trap is installed for the underflow exception, underflow is
signaled (via the underflow flag) only when both tininess and loss of
accuracy have been detected. If no trap handler is installed the
operation continues with an imprecise small value, or zero if the
destination precision cannot hold the small exact result.
@item Inexact
This exception is signalled if a rounded result is not exact (such as
when calculating the square root of two) or a result overflows without
an overflow trap.
@end table
@node Infinity and NaN
@subsection Infinity and NaN
@cindex infinity
@cindex not a number
@cindex NaN
@w{IEEE 754} floating point numbers can represent positive or negative
infinity, and @dfn{NaN} (not a number). These three values arise from
calculations whose result is undefined or cannot be represented
accurately. You can also deliberately set a floating-point variable to
any of them, which is sometimes useful. Some examples of calculations
that produce infinity or NaN:
@ifnottex
@smallexample
@math{1/0 = @infinity{}}
@math{log (0) = -@infinity{}}
@math{sqrt (-1) = NaN}
@end smallexample
@end ifnottex
@tex
$${1\over0} = \infty$$
$$\log 0 = -\infty$$
$$\sqrt{-1} = \hbox{NaN}$$
@end tex
When a calculation produces any of these values, an exception also
occurs; see @ref{FP Exceptions}.
The basic operations and math functions all accept infinity and NaN and
produce sensible output. Infinities propagate through calculations as
one would expect: for example, @math{2 + @infinity{} = @infinity{}},
@math{4/@infinity{} = 0}, atan @math{(@infinity{}) = @pi{}/2}. NaN, on
the other hand, infects any calculation that involves it. Unless the
calculation would produce the same result no matter what real value
replaced NaN, the result is NaN.
In comparison operations, positive infinity is larger than all values
except itself and NaN, and negative infinity is smaller than all values
except itself and NaN. NaN is @dfn{unordered}: it is not equal to,
greater than, or less than anything, @emph{including itself}. @code{x ==
x} is false if the value of @code{x} is NaN. You can use this to test
whether a value is NaN or not, but the recommended way to test for NaN
is with the @code{isnan} function (@pxref{Floating Point Classes}). In
addition, @code{<}, @code{>}, @code{<=}, and @code{>=} will raise an
exception when applied to NaNs.
@file{math.h} defines macros that allow you to explicitly set a variable
to infinity or NaN.
@deftypevr Macro float INFINITY
@standards{ISO, math.h}
An expression representing positive infinity. It is equal to the value
produced by mathematical operations like @code{1.0 / 0.0}.
@code{-INFINITY} represents negative infinity.
You can test whether a floating-point value is infinite by comparing it
to this macro. However, this is not recommended; you should use the
@code{isfinite} macro instead. @xref{Floating Point Classes}.
This macro was introduced in the @w{ISO C99} standard.
@end deftypevr
@deftypevr Macro float NAN
@standards{GNU, math.h}
An expression representing a value which is ``not a number''. This
macro is a GNU extension, available only on machines that support the
``not a number'' value---that is to say, on all machines that support
IEEE floating point.
You can use @samp{#ifdef NAN} to test whether the machine supports
NaN. (Of course, you must arrange for GNU extensions to be visible,
such as by defining @code{_GNU_SOURCE}, and then you must include
@file{math.h}.)
@end deftypevr
@deftypevr Macro float SNANF
@deftypevrx Macro double SNAN
@deftypevrx Macro {long double} SNANL
@deftypevrx Macro _FloatN SNANFN
@deftypevrx Macro _FloatNx SNANFNx
@standards{TS 18661-1:2014, math.h}
@standardsx{SNANFN, TS 18661-3:2015, math.h}
@standardsx{SNANFNx, TS 18661-3:2015, math.h}
These macros, defined by TS 18661-1:2014 and TS 18661-3:2015, are
constant expressions for signaling NaNs.
@end deftypevr
@deftypevr Macro int FE_SNANS_ALWAYS_SIGNAL
@standards{ISO, fenv.h}
This macro, defined by TS 18661-1:2014, is defined to @code{1} in
@file{fenv.h} to indicate that functions and operations with signaling
NaN inputs and floating-point results always raise the invalid
exception and return a quiet NaN, even in cases (such as @code{fmax},
@code{hypot} and @code{pow}) where a quiet NaN input can produce a
non-NaN result. Because some compiler optimizations may not handle
signaling NaNs correctly, this macro is only defined if compiler
support for signaling NaNs is enabled. That support can be enabled
with the GCC option @option{-fsignaling-nans}.
@end deftypevr
@w{IEEE 754} also allows for another unusual value: negative zero. This
value is produced when you divide a positive number by negative
infinity, or when a negative result is smaller than the limits of
representation.
@node Status bit operations
@subsection Examining the FPU status word
@w{ISO C99} defines functions to query and manipulate the
floating-point status word. You can use these functions to check for
untrapped exceptions when it's convenient, rather than worrying about
them in the middle of a calculation.
These constants represent the various @w{IEEE 754} exceptions. Not all
FPUs report all the different exceptions. Each constant is defined if
and only if the FPU you are compiling for supports that exception, so
you can test for FPU support with @samp{#ifdef}. They are defined in
@file{fenv.h}.
@vtable @code
@item FE_INEXACT
@standards{ISO, fenv.h}
The inexact exception.
@item FE_DIVBYZERO
@standards{ISO, fenv.h}
The divide by zero exception.
@item FE_UNDERFLOW
@standards{ISO, fenv.h}
The underflow exception.
@item FE_OVERFLOW
@standards{ISO, fenv.h}
The overflow exception.
@item FE_INVALID
@standards{ISO, fenv.h}
The invalid exception.
@end vtable
The macro @code{FE_ALL_EXCEPT} is the bitwise OR of all exception macros
which are supported by the FP implementation.
These functions allow you to clear exception flags, test for exceptions,
and save and restore the set of exceptions flagged.
@deftypefun int feclearexcept (int @var{excepts})
@standards{ISO, fenv.h}
@safety{@prelim{}@mtsafe{}@assafe{@assposix{}}@acsafe{@acsposix{}}}
@c The other functions in this section that modify FP status register
@c mostly do so with non-atomic load-modify-store sequences, but since
@c the register is thread-specific, this should be fine, and safe for
@c cancellation. As long as the FP environment is restored before the
@c signal handler returns control to the interrupted thread (like any
@c kernel should do), the functions are also safe for use in signal
@c handlers.
This function clears all of the supported exception flags indicated by
@var{excepts}.
The function returns zero in case the operation was successful, a
non-zero value otherwise.
@end deftypefun
@deftypefun int feraiseexcept (int @var{excepts})
@standards{ISO, fenv.h}
@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
This function raises the supported exceptions indicated by
@var{excepts}. If more than one exception bit in @var{excepts} is set
the order in which the exceptions are raised is undefined except that
overflow (@code{FE_OVERFLOW}) or underflow (@code{FE_UNDERFLOW}) are
raised before inexact (@code{FE_INEXACT}). Whether for overflow or
underflow the inexact exception is also raised is also implementation
dependent.
The function returns zero in case the operation was successful, a
non-zero value otherwise.
@end deftypefun
@deftypefun int fesetexcept (int @var{excepts})
@standards{ISO, fenv.h}
@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
This function sets the supported exception flags indicated by
@var{excepts}, like @code{feraiseexcept}, but without causing enabled
traps to be taken. @code{fesetexcept} is from TS 18661-1:2014.
The function returns zero in case the operation was successful, a
non-zero value otherwise.
@end deftypefun
@deftypefun int fetestexcept (int @var{excepts})
@standards{ISO, fenv.h}
@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
Test whether the exception flags indicated by the parameter @var{except}
are currently set. If any of them are, a nonzero value is returned
which specifies which exceptions are set. Otherwise the result is zero.
@end deftypefun
To understand these functions, imagine that the status word is an
integer variable named @var{status}. @code{feclearexcept} is then
equivalent to @samp{status &= ~excepts} and @code{fetestexcept} is
equivalent to @samp{(status & excepts)}. The actual implementation may
be very different, of course.
Exception flags are only cleared when the program explicitly requests it,
by calling @code{feclearexcept}. If you want to check for exceptions
from a set of calculations, you should clear all the flags first. Here
is a simple example of the way to use @code{fetestexcept}:
@smallexample
@{
double f;
int raised;
feclearexcept (FE_ALL_EXCEPT);
f = compute ();
raised = fetestexcept (FE_OVERFLOW | FE_INVALID);
if (raised & FE_OVERFLOW) @{ /* @dots{} */ @}
if (raised & FE_INVALID) @{ /* @dots{} */ @}
/* @dots{} */
@}
@end smallexample
You cannot explicitly set bits in the status word. You can, however,
save the entire status word and restore it later. This is done with the
following functions:
@deftypefun int fegetexceptflag (fexcept_t *@var{flagp}, int @var{excepts})
@standards{ISO, fenv.h}
@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
This function stores in the variable pointed to by @var{flagp} an
implementation-defined value representing the current setting of the
exception flags indicated by @var{excepts}.
The function returns zero in case the operation was successful, a
non-zero value otherwise.
@end deftypefun
@deftypefun int fesetexceptflag (const fexcept_t *@var{flagp}, int @var{excepts})
@standards{ISO, fenv.h}
@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
This function restores the flags for the exceptions indicated by
@var{excepts} to the values stored in the variable pointed to by
@var{flagp}.
The function returns zero in case the operation was successful, a
non-zero value otherwise.
@end deftypefun
Note that the value stored in @code{fexcept_t} bears no resemblance to
the bit mask returned by @code{fetestexcept}. The type may not even be
an integer. Do not attempt to modify an @code{fexcept_t} variable.
@deftypefun int fetestexceptflag (const fexcept_t *@var{flagp}, int @var{excepts})
@standards{ISO, fenv.h}
@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
Test whether the exception flags indicated by the parameter
@var{excepts} are set in the variable pointed to by @var{flagp}. If
any of them are, a nonzero value is returned which specifies which
exceptions are set. Otherwise the result is zero.
@code{fetestexceptflag} is from TS 18661-1:2014.
@end deftypefun
@node Math Error Reporting
@subsection Error Reporting by Mathematical Functions
@cindex errors, mathematical
@cindex domain error
@cindex range error
Many of the math functions are defined only over a subset of the real or
complex numbers. Even if they are mathematically defined, their result
may be larger or smaller than the range representable by their return
type without loss of accuracy. These are known as @dfn{domain errors},
@dfn{overflows}, and
@dfn{underflows}, respectively. Math functions do several things when
one of these errors occurs. In this manual we will refer to the
complete response as @dfn{signalling} a domain error, overflow, or
underflow.
When a math function suffers a domain error, it raises the invalid
exception and returns NaN. It also sets @code{errno} to @code{EDOM};
this is for compatibility with old systems that do not support @w{IEEE
754} exception handling. Likewise, when overflow occurs, math
functions raise the overflow exception and, in the default rounding
mode, return @math{@infinity{}} or @math{-@infinity{}} as appropriate
(in other rounding modes, the largest finite value of the appropriate
sign is returned when appropriate for that rounding mode). They also
set @code{errno} to @code{ERANGE} if returning @math{@infinity{}} or
@math{-@infinity{}}; @code{errno} may or may not be set to
@code{ERANGE} when a finite value is returned on overflow. When
underflow occurs, the underflow exception is raised, and zero
(appropriately signed) or a subnormal value, as appropriate for the
mathematical result of the function and the rounding mode, is
returned. @code{errno} may be set to @code{ERANGE}, but this is not
guaranteed; it is intended that @theglibc{} should set it when the
underflow is to an appropriately signed zero, but not necessarily for
other underflows.
When a math function has an argument that is a signaling NaN,
@theglibc{} does not consider this a domain error, so @code{errno} is
unchanged, but the invalid exception is still raised (except for a few
functions that are specified to handle signaling NaNs differently).
Some of the math functions are defined mathematically to result in a
complex value over parts of their domains. The most familiar example of
this is taking the square root of a negative number. The complex math
functions, such as @code{csqrt}, will return the appropriate complex value
in this case. The real-valued functions, such as @code{sqrt}, will
signal a domain error.
Some older hardware does not support infinities. On that hardware,
overflows instead return a particular very large number (usually the
largest representable number). @file{math.h} defines macros you can use
to test for overflow on both old and new hardware.
@deftypevr Macro double HUGE_VAL
@deftypevrx Macro float HUGE_VALF
@deftypevrx Macro {long double} HUGE_VALL
@deftypevrx Macro _FloatN HUGE_VAL_FN
@deftypevrx Macro _FloatNx HUGE_VAL_FNx
@standards{ISO, math.h}
@standardsx{HUGE_VAL_FN, TS 18661-3:2015, math.h}
@standardsx{HUGE_VAL_FNx, TS 18661-3:2015, math.h}
An expression representing a particular very large number. On machines
that use @w{IEEE 754} floating point format, @code{HUGE_VAL} is infinity.
On other machines, it's typically the largest positive number that can
be represented.
Mathematical functions return the appropriately typed version of
@code{HUGE_VAL} or @code{@minus{}HUGE_VAL} when the result is too large
to be represented.
@end deftypevr
@node Rounding
@section Rounding Modes
Floating-point calculations are carried out internally with extra
precision, and then rounded to fit into the destination type. This
ensures that results are as precise as the input data. @w{IEEE 754}
defines four possible rounding modes:
@table @asis
@item Round to nearest.
This is the default mode. It should be used unless there is a specific
need for one of the others. In this mode results are rounded to the
nearest representable value. If the result is midway between two
representable values, the even representable is chosen. @dfn{Even} here
means the lowest-order bit is zero. This rounding mode prevents
statistical bias and guarantees numeric stability: round-off errors in a
lengthy calculation will remain smaller than half of @code{FLT_EPSILON}.
@c @item Round toward @math{+@infinity{}}
@item Round toward plus Infinity.
All results are rounded to the smallest representable value
which is greater than the result.
@c @item Round toward @math{-@infinity{}}
@item Round toward minus Infinity.
All results are rounded to the largest representable value which is less
than the result.
@item Round toward zero.
All results are rounded to the largest representable value whose
magnitude is less than that of the result. In other words, if the
result is negative it is rounded up; if it is positive, it is rounded
down.
@end table
@noindent
@file{fenv.h} defines constants which you can use to refer to the
various rounding modes. Each one will be defined if and only if the FPU
supports the corresponding rounding mode.
@vtable @code
@item FE_TONEAREST
@standards{ISO, fenv.h}
Round to nearest.
@item FE_UPWARD
@standards{ISO, fenv.h}
Round toward @math{+@infinity{}}.
@item FE_DOWNWARD
@standards{ISO, fenv.h}
Round toward @math{-@infinity{}}.
@item FE_TOWARDZERO
@standards{ISO, fenv.h}
Round toward zero.
@end vtable
Underflow is an unusual case. Normally, @w{IEEE 754} floating point
numbers are always normalized (@pxref{Floating Point Concepts}).
Numbers smaller than @math{2^r} (where @math{r} is the minimum exponent,
@code{FLT_MIN_RADIX-1} for @var{float}) cannot be represented as
normalized numbers. Rounding all such numbers to zero or @math{2^r}
would cause some algorithms to fail at 0. Therefore, they are left in
denormalized form. That produces loss of precision, since some bits of
the mantissa are stolen to indicate the decimal point.
If a result is too small to be represented as a denormalized number, it
is rounded to zero. However, the sign of the result is preserved; if
the calculation was negative, the result is @dfn{negative zero}.