More than 5 years have passed since last update.

x87 instruction setを使う

Posted at 2018-06-23

はじめに

x87 instruction set は浮動小数点数を扱います。x86 instruction set の subset です。
x87 instruction setを使い方を説明します。次の Document の内容を抜粋します。

Intel 64 and IA-32 Architectures Software Developer's Manual
CHAPTER 8 PROGRAMMING WITH THE X87 FPU

sample code は MINGW64 の gcc で make します。

console

$ gcc --version
gcc.exe (Rev2, Built by MSYS2 project) 6.2.0
Copyright (C) 2016 Free Software Foundation, Inc.

Data Register

80bit の Data Register が 8個 (R0-R7) あります。

Data Register Stack

Data Register は Register Name(R0-R7) 指定の直接アクセスはできません。
Stack になっており Top-of-stack (TOP) からの相対位置でアクセスします。
ST(0)がTOPです。ST(0) から ST(7) がアクセス可能です。
TOPの初期値は0です。0-7の値を取ります。

FLD 命令は TOP をデクリメントし、
その後、Memoryからデータを読み込んで、80bitの形式に変換し、ST(0)に保存します。

FSTP 命令は ST(0)からデータを読み込んで、データ形式を変換し、Memoryに保存します。
その後、TOPをインクリメントします。 FLD の逆操作を行います。
データを Stack に残したい場合は FST 命令を使います。TOPは変化しません。

Data Type

80bit の Data Register に読み込める Memory 上のData Typeは次の通りです。
Data Register は常に80bitの形式で保持します。FLD / FST 命令時に自動変換します。

Instruction List

x87のInstruction Listを記載します。

Data Transfer Instructions

Load Constant Instructions

Name	Description
FLDZ	Load +0.0
FLD1	Load +1.0
FLDPI	Load $\pi$
FLDL2T	Load log2 10
FLDL2E	Load log2 e
FLDLG2	Load log10 2
FLDLN2	Load loge 2

Basic Arithmetic Instructions

Name	Description
FADD/FADDP	Add floating point
FIADD	Add integer to floating point
FSUB/FSUBP	Subtract floating point
FISUB	Subtract integer from floating point
FSUBR/FSUBRP	Reverse subtract floating point
FISUBR	Reverse subtract floating point from integer
FMUL/FMULP	Multiply floating point
FIMUL	Multiply integer by floating point
FDIV/FDIVP	Divide floating point
FIDIV	Divide floating point by integer
FDIVR/FDIVRP	Reverse divide
FIDIVR	Reverse divide integer by floating point
FABS	Absolute value
FCHS	Change sign
FSQRT	Square root
FPREM	Partial remainder
FPREM1	IEEE partial remainder
FRNDINT	Round to integral value
FXTRACT	Extract exponent and significand

Comparison and Classification Instructions

Name	Description
FCOM/FCOMP/FCOMPP	Compare floating point and set x87 FPU condition code flags.
FUCOM/FUCOMP/FUCOMPP	Unordered compare floating point and set x87 FPU condition code flags.
FICOM/FICOMP	Compare integer and set x87 FPU condition code flags.
FCOMI/FCOMIP	Compare floating point and set EFLAGS status flags.
FUCOMI/FUCOMIP	Unordered compare floating point and set EFLAGS status flags.
FTST	Test (compare floating point with 0.0).
FXAM	Examine.

Trigonometric Instructions

Name	Description
FSIN	Sine
FCOS	Cosine
FSINCOS	Sine and cosine
FPTAN	Tangent
FPATAN	Arctangent

Trigonometric Instructions

Name	Description
FINIT/FNINIT	Initialize x87 FPU
FLDCW	Load x87 FPU control word
FSTCW/FNSTCW	Store x87 FPU control word
FSTSW/FNSTSW	Store x87 FPU status word
FCLEX/FNCLEX	Clear x87 FPU exception flags
FLDENV	Load x87 FPU environment
FSTENV/FNSTENV	Store x87 FPU environment
FRSTOR	Restore x87 FPU state
FSAVE/FNSAVE	Save x87 FPU state
FINCSTP	Increment x87 FPU register stack pointer
FDECSTP	Decrement x87 FPU register stack pointer
FFREE	Free x87 FPU register
FNOP	No operation
WAIT/FWAIT	Check for and handle pending unmasked x87 FPU exceptions

sample code

fsin

x87 FPU には sin を取得する fsin 命令があります。

次にsample codeを示します。
fldl 命令は TOPをデクリメントして Memory から double のデータを ST(0) 保存します。
fsin 命令は ST(0)のデータをxとしてsin(x)を計算してST(0)に保存します。
fstpl 命令は ST(0)をMemoryに保存してTOPをインクリメントします。

test_s.s

    .globl  fsin
fsin:
    movq   %xmm0, -8(%rsp)
    fldl   -8(%rsp)
    fsin
    fstpl  -8(%rsp)
    movq   -8(%rsp), %xmm0
    ret

$-\pi$ ~ $\pi$ を1000分割してsin(x)を求めます。
fsin と C library sin の処理時間を比較します。

test.c

# include <stdio.h>
# include <stdint.h>
# include <inttypes.h>
# include <math.h>

# define NUM_OF_ARRAY (1024)

extern double sin(double x);
extern double fsin(double x);

static inline uint64_t RDTSCP()
{
  uint32_t hi, lo, aux;
  __asm__ volatile("rdtscp" : "=a" (lo), "=d" (hi), "=c" (aux));
  return ((uint64_t)hi << 32) | lo;
}

static double bufx[NUM_OF_ARRAY+1];
static double bufs[NUM_OF_ARRAY+1];
static double bufc[NUM_OF_ARRAY+1];

int main(int argc, char* argv[])
{
  uint64_t start, end, diffs, diffc;
  int count = NUM_OF_ARRAY, i;
  double xst=-1.0*M_PI, len = 2.0*M_PI;
  
  /* x */
  for (i=0; i<=count; i++) {
    double w = len * (double)i / count;
    bufx[i] = xst+w;
  }
  
  /* fsin */
  start = RDTSCP();
  for (i=0; i<=count; i++) {
    bufs[i]  = fsin(bufx[i]);
  }
  end = RDTSCP();
  diffs = end - start;
  
  /* sin(C) */
  start = RDTSCP();
  for (i=0; i<=count; i++) {
    bufc[i]  = sin(bufx[i]);
  }
  end = RDTSCP();
  diffc = end - start;
  
  /* print */
  printf("x\tfsin(x)\tsin(x)\n");
  for (i=0; i<=count; i++) {
    printf("%+.16e\t%+.16e\t%+.16e\n", bufx[i], bufs[i], bufc[i]);
  }
  
  printf("fsin tsc %" PRIu64 " sin tsc %" PRIu64 "\n", diffs/NUM_OF_ARRAY, diffc/NUM_OF_ARRAY);
  return 0;
}

makefile

CFLAGS=-I. -Wall -Werror -O2
INCS=
OBJS=test.o test_s.o
LIBS=
TARGET=test

all: $(TARGET)

%.o: %.c $(INCS)
	$(CC) $(CFLAGS) -c -o $@ $<
	objdump -d $@ > $@.disas

%.o: %.s $(INCS)
	$(CC) $(CFLAGS) -c -o $@ $<

$(TARGET): $(OBJS)
	$(CC) $(CFLAGS) -o $@ $^ $(LIBS)
	objdump -d $@.exe > $@.exe.disas

clean:
	rm -rf $(TARGET) *.o *.disas

make & execute します。

console

$ make && ./test.exe
cc -I. -Wall -Werror -O2 -c -o test.o test.c
objdump -d test.o > test.o.disas
cc -I. -Wall -Werror -O2 -c -o test_s.o test_s.s
cc -I. -Wall -Werror -O2 -o test test.o test_s.o
objdump -d test.exe > test.exe.disas
x       fsin(x) sin(x)
-3.1415926535897931e+000        -1.2246063538223773e-016        -1.2246063538223773e-016
-3.1354567304382504e+000        -6.1358846491547988e-003        -6.1358846491547988e-003
....
fsin tsc 370 sin tsc 445

C library の sin の方が遅くなっています。

fsin tsc 370 sin tsc 445

test.exe を　disassemble してみます。
C library の sin は内部で fsin 命令を使っていました。

0000000000402ca0 <__sinl_internal>:
  402ca0:	db 2a                	fldt   (%rdx)
  402ca2:	d9 fe                	fsin

dot product

test_s.s

    .globl  dot
# %rcx = buf
dot:
    fldl   0(%rcx)   # ST[0]  = buf[0]
    fmull  8(%rcx)   # ST[0]  = buf[0] * buf[1]
    fldl  16(%rcx)   # ST[0]  = buf[2],          ST[1] = buf[0] * buf[1]
    fmull 24(%rcx)   # ST[0]  = buf[2] * buf[3], ST[1] = buf[0] * buf[1]
    fadd  %st(1),%st # ST[0] += ST[1]
    fstpl -8(%rsp)
    ffree %st        # ST[0] <= empty
    movq  -8(%rsp), %xmm0
    ret

test.c

# include <stdio.h>
# include <stdint.h>
# include <inttypes.h>
# include <math.h>

/* ret = buf[0] * buf[1] + buf[2] * buf[3] */
extern double dot(double* buf);

static inline uint64_t RDTSCP()
{
  uint32_t hi, lo, aux;
  __asm__ volatile("rdtscp" : "=a" (lo), "=d" (hi), "=c" (aux));
  return ((uint64_t)hi << 32) | lo;
}

int main(int argc, char* argv[])
{
  uint64_t start, end;
  double buf[] = { 5.6, 2.4, 3.8, 10.3 }, v;
  
  start = RDTSCP();
  v = dot(buf);
  end = RDTSCP();
  printf("dot %+.16e tsc %" PRIu64 "\n", v, end - start);
  
  return 0;
}

console

$ make && ./test.exe
cc -I. -Wall -Werror -O2 -c -o test.o test.c
objdump -d test.o > test.o.disas
cc -I. -Wall -Werror -O2 -c -o test_s.o test_s.s
cc -I. -Wall -Werror -O2 -o test test.o test_s.o
objdump -d test.exe > test.exe.disas
dot +5.2579999999999998e+001 tsc 209

references

[1] Intel 64 and IA-32 Architectures Software Developer's Manual

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up