;----------------------------------------------------------------------------- ; ; PREFCHSZ.ASM ; ; Copyright (c) 1990, 1991, 1995-Present Robert Collins ; ; You have my permission to copy and distribute this software for ; non-commercial purposes. Any commercial use of this software or ; source code is allowed, so long as the appropriate copyright ; attributions (to me) are intact, *AND* my email address is properly ; displayed. ; ; Basically, give me credit, where credit is due, and show my email ; address. ; ;----------------------------------------------------------------------------- ; ; Robert R. Collins email: rcollins@x86.org ; ;----------------------------------------------------------------------------- ; PREFETCH_SIZE: Determines the size of the prefetch queue using ; self-modifying code. ;----------------------------------------------------------------------------- ; ; *** WARNING *** This algorithm ONLY works on 80386-DX and 80386-SX. ; ;----------------------------------------------------------------------------- ; On these CPU's the Bus Interface Unit (BIU) is constantly fetching opcodes ; when it is idle. Memory operations preempt prefetches. This knowledge is ; necesary for an accurate algorithm that determines the prefetch queue size. ; ; Such an algorithm must execute some self-modifying code. If the modified ; code lies inside the prefetch queue, then it (the modified code) doesn't ; get executed. If it lies outside the prefetch queue, then it does get ; executed. Therefore to determine the size of prefetch queue, we need to ; self-modify code until we determine that it gets executed. ; ; In order for this to work, we need the prefetch queue to be completely full, ; the decode unit must have its maximum number of instructions fully decoded, ; and no bus cycles pending. We can do this by executing instructions that ; takes longer to execute that any prefetch, and places no demands on the ; Bus Interface Unit (BIU) either by accessing memory or I/O. BSF is ideal ; for this purpose because it can take up to 104 clocks on the '386 and ; up to 43 clocks on the '486. ; ; Filling the prefetch queue is non-trivial -- because of the 3-instruction ; decode unit. Both the decode unit and prefetch queue have a point at which ; the get triggered to reload themselves. It's hard to know exactly where ; this point is, but by writing code, ICEing it, and analyzing the results, ; it is possible to deduce where this point it. So the trouble in filling ; the prefetch queue becomes one in which you get both the decode unit and ; prefetch unit in a known state -- regardless of prefetch queue size -- ; before attempting to self-modify any code. ; ;----------------------------------------------------------------------------- ; ; The process of fetch/execute after a JuMP istruction always begins by ; filling the prefetch queue and immediately transferring those instructions ; to the decode unit. During this process, the prefetch queue reaches its ; low water (LW) point and requests more data. Likewise when the decode ; unit reaches its LW point, it requests more instructions from the prefetch ; unit. ; ; 16-byte 16-byte 12-byte ; Decode Prefetch Decode Prefetch Decode Prefetch ; Unit Unit Unit Unit Unit Unit ; (LW=4) (LW=8) (LW=4) ; +-----+ +----+ +-----+ +----+ +-----+ +----+ ; | X1 |<---| C | | X1 |<---| C | | X1 |<---| C | ; +-----+ +----+ +-----+ +----+ +-----+ +----+ ; | X2t |<---| C | | X2t |<---| Ct | | X2t |<---| Ct | ; +-----+ +----+ +-----+ +----+ +-----+ +----+ ; | X3 |<---| Ct | | X3 |<---| C | | X3 |<---| C | ; +-----+ +----+ +-----+ +----+ +-----+ +----+ ; | X4t |<---| C | | X4t |<---| Ct | | X4t |<---| Ct | ; +-----+ +----+ +-----+ +----+ +-----+ +----+ ; | 4 |<---| C | | 4 |<---| C | | 4 |<---| C | ; +-----+ +----+ +-----+ +----+ +-----+ +----+ ; | 8 |<---| Ct | | 8 |<---| Ct | | 8 |<---| Ct | ; +-----+ +----+ +-----+ +----+ +-----+ +----+ ; | 12 |<---| 12 |\ | 12 |<---| 12 |\ | 12 |<---| 12 |\ ; +-----+ +----+ | +-----+ +----+ | +-----+ +----+ | Current ; | 16 | | | 16 | | | 16 | | Prefetch ; +----+ | +----+ | +----+ | Unit ; | 20 | | | 20 | | | 20 |/ ; +----+ | +----+ | +----+ ; | 24 |/ | 24 |/ ; +----+ +----+ ; ; C = Copied from prefetch unit to decode unit ; Ct = Copy triggered the prefetch unit to reload ; Xn = Execution "n" begins from the decode unit ; Xnt = Execution "n" triggered the decode unit to reload ; ; Self-modifying code @ X4 will makes the 16-byte prefetch queue appear to ; have 24 bytes, and a 12-byte prefetch queue appear to have 20 bytes. ; ; ;----------------------------------------------------------------------------- ; Input: None ; Output: BL = Prefetch size ; Register(s) modified: AX, BX ;----------------------------------------------------------------------------- ;----------------------------------------------------------------------------- ; Compiler directives ;----------------------------------------------------------------------------- Page 60,132 .radix 16 .386P ;----------------------------------------------------------------------------- ; PUBLIC statements here ;----------------------------------------------------------------------------- Public Prefetch_size ;----------------------------------------------------------------------------- ; Local variables here ;----------------------------------------------------------------------------- Max_size equ 20h Current_size equ word ptr [bp-2] _TEXT SEGMENT PARA USE16 PUBLIC 'CODE' ASSUME CS:_TEXT, DS:NOTHING, ES:NOTHING, SS:NOTHING align 10h ;----------------------------------------------------------------------------- Prefetch_size proc near ; Determines the size of the prefetch queue. ;----------------------------------------------------------------------------- ; Input: None ; Output: AL = Low Water point in prefetch queue ; BL = Size of the prefetch queue ; Register(s) modified: BX, CX, SI, EDI ;----------------------------------------------------------------------------- pushf ; save direction flag for restoration push ds push ecx push edx push esi push edi push ebp ; create a small stack frame cli cld ; clear direction mov bp,sp sub sp,2 mov Current_size,4 mov bx,cs mov ds,bx mov cx,0 ;----------------------------------------------------------------------------- ; Sizing the prefetch queue is an iterative process of self-modifying code ; and testing the results. Whenever the hoped-for results aren't achieved, ; then we know that it was because the code was already in the prefetch ; queue. In such a case, we must restore the original code (that was self- ; modified), and continue the iterative process until we detect a change. ; ; This algorithm detects the change by looking for a change in CX. The ; algorithm executes BSF EDX,EDI which places a 1F in EDX. The self- ; modifying code attempts to change this op code to BSF ECX,EDI which will ; place the result in ECX not EDX. Therefore when ECX=1F we have modified ; an instruction beyond the end of the prefetch queue. ; ; Just to make sure that the algorithm doesn't go wild, we should have a ; fail-safe way out. Therefore when we get to a "MAX_SIZE" we will quit ; attempting the algorithm. For further safety, all instructions within the ; actual algorithm are 4-byte opcodes. This ensures that the decode unit, ; prefetch unit, and bus size are all operating on the same size data. ;----------------------------------------------------------------------------- @Test: add Current_Size,4 ; continue in 4-byte granularity. cmp Current_size,Max_size ; should we stop this non-sense? ja short @F ; yep, we can't do this forever mov bx,Current_Size ; get current PF Q size lea bx,[bx][offset @Modify[4]]; xlat as a pointer to the op code ; This actually points 1-byte beyond ; the desired op code. This is com- ; pensated for later. mov byte ptr cs:[bx-5],0d7h ; restore the previously modified instr. call Prefetch_queue ; do it cmp cx,1fh ; CX get modified? jne short @Test ; NO ;----------------------------------------------------------------------------- ; When we get here, we have either determined the size of the prefetch queue, ; or failed to do so. Either way, we need to subtract the size of the ; decode unit, and report the final size. ;----------------------------------------------------------------------------- @@: sub Current_Size,8 ; subtract size of the decode unit ;----------------------------------------------------------------------------- ; Now we determine the LOW WATER point of the prefetch queue. This is done ; by successively reading the undocumented CPU register -- TR4. The first ; change in TR4 will indicate the granularity in which the prefetch queue ; requests data. By subtracting this number from the prefetch queue size, ; we determine the LOW WATER point for the prefetch queue. ;----------------------------------------------------------------------------- call Low_Water ; get low water point sub ebx,eax ; any difference? jnz short @F ; YES! we found the PF size granularity mov ebx,ecx ; get next attempt sub ebx,eax ; any difference? jnz short @F ; YES! we found the PF size granularity mov ebx,edx ; get next attempt sub ebx,eax ; any difference? jnz short @F ; YES! we found the PF size granularity mov ebx,esi ; get next attempt sub ebx,eax ; any difference? jnz short @F ; YES! we found the PF size granularity mov ebx,edi ; get next attempt sub ebx,eax ; any difference? @@: movzx eax,Current_Size ; get current size sub al,bl ; AL=Low water point movzx ebx,Current_Size ; BL=Prefetch queue size mov sp,bp ; restore stack frame pop ebp pop edi pop esi pop edx pop ecx pop ds popf ; restore DF because it got hosed ret ; and split! ;----------------------------------------------------------------------------- Prefetch_queue: ;----------------------------------------------------------------------------- ; * Preload the cache ; * Align refresh request to prefetch queue ; * Preload decode unit and prefetch queue ; * Self-modify code ;----------------------------------------------------------------------------- mov cx,@Modify_end mov si,offset @Fill rep lods byte ptr cs:[si] ; force subroutine into cache mov edi,80000000h call Align_Refresh jmp far ptr @Fill align 10h ;----------------------------------------------------------------------------- ; The next three instructions are needed to preload the decode unit and ; prefetch unit into a known state -- regardless of CPU stepping. ;----------------------------------------------------------------------------- @Fill: bsf edx,edi ; Preload decode unit and prefetch bsf edx,edi ; queue. bsf edx,edi @Modify: xor [bx-1],byte ptr 18h ; This will modify one of the following ; instructions to BSF ECX,EDI. This ; form of XOR was chosen because it is ; a 4-byte op code. bsf edx,edi bsf edx,edi bsf edx,edi bsf edx,edi bsf edx,edi bsf edx,edi bsf edx,edi bsf edx,edi bsf edx,edi bsf edx,edi ret ;----------------------------------------------------------------------------- Low_Water: ;----------------------------------------------------------------------------- ; * Preload the cache ; * Align refresh request to prefetch queue ; * Preload decode unit and prefetch queue ; * Successively load CPU registers with TR4 ;----------------------------------------------------------------------------- push ebp mov cx,@Modify_end mov si,offset @Fill rep lods byte ptr cs:[si] ; force subroutine into cache mov edi,80000000h call Align_Refresh jmp far ptr @Water align 10h @Water: bsf ebp,edi ; Preload decode unit and prefetch bsf ebp,edi ; queue. bsf ebp,edi db 66h,0fh,24,0e0h ; mov eax,tr4 bsf ebp,edi ; don't let prefetch q get exhausted db 66h,0fh,24,0e3h ; mov ebx,tr4 bsf ebp,edi ; don't let prefetch q get exhausted db 66h,0fh,24,0e1h ; mov ecx,tr4 bsf ebp,edi ; don't let prefetch q get exhausted db 66h,0fh,24,0e2h ; mov edx,tr4 bsf ebp,edi ; don't let prefetch q get exhausted db 66h,0fh,24,0e6h ; mov esi,tr4 bsf ebp,edi ; don't let prefetch q get exhausted db 66h,0fh,24,0e7h ; mov edi,tr4 pop ebp ret @Modify_end equ $-@Modify Prefetch_size endp ;----------------------------------------------------------------------------- Align_Refresh proc near ; Aisable refresh by reprogramming ; ; Timer1 Counter1. ;----------------------------------------------------------------------------- ; Input: None ; Output: None ; Register(s) modified: None ;----------------------------------------------------------------------------- push ax @@: in al,61h ; get hardware status port test al,10h ; refresh request go low? jnz short @B ; nope, not yet @@: in al,61h ; get hardware status port test al,10h ; refresh request go high? jz short @B ; nope, not yet pop ax ret Align_Refresh endp _TEXT ends end