Efficient BASIC coding for the ZX Spectrum (IV)

[Click here to read this in English ]

Éste es el cuarto de una serie de artículos que explican los fundamentos de la (in)eficiencia de los programas en BASIC puro para el ZX Spectrum:
I. Sobre los números de línea
II. Sobre las variables
III. Sobre las expresiones
IV. Funcionalidades diversas y medida del tiempo
V. Operaciones en la pantalla basadas en caracteres

En esta entrega hablaremos sobre la eficiencia de algunas funcionalidades importantes que ofrece el ZX Spectrum para los programadores en Sinclair BASIC, así como de la medida del tiempo de ejecución en los programas. Las operaciones de pantalla basadas en caracteres las dejamos para la siguiente entrada.

Para navegar más fácilmente por la entrada, éstos son los apartados que contiene:

Dibujar. Tiempos de ejecución de PLOT, DRAW, CIRCLE, POINT.
Leer el teclado. Tiempos de ejecución de INKEY$, PEEK (LAST_K) e IN (254).
Crear UDGs. Cómo disimular el tiempo de definición de los UDGs.
Copiar datos en memoria. El truco del DEFADD para copias rápidas de bloques de memoria.
Medir el tiempo de ejecución. De cualquier sentencia o parte de un programa BASIC.
La influencia de los conflictos CPU/ULA. Estos conflictos pueden retrasar (ligeramente) la ejecución del BASIC.

Dibujar

Especialmente sensibles en cuanto a velocidad de ejecución son las sentencias gráficas de dibujo (PLOT, DRAW, CIRCLE), como sabe cualquiera que haya programado lo más mínimo en este lenguaje. DRAW cuando dibuja arcos, y CIRCLE siempre, utilizan cálculos en coma flotante repetidamente, lo que las hace muy lentas a pesar de que sus implementaciones en la ROM están optimizadas. PLOT no es ineficiente en sí mismo (si se le pasan expresiones enteras, evitando así el cálculo en coma flotante), aunque para dibujar con PLOT algo decente hay que hacer muchos PLOT, y, por tanto, el coste se multiplica. DRAW utiliza el algoritmo de Bresenham para líneas rectas, el cual es bastante eficiente porque además no llama al calculador, pero, al igual que el resto de rutinas, tiene que establecer también los atributos de pantalla por los que pasa, además del modo (INVERSE, OVER), lo que lo enlentece.

Con el siguiente programa se pueden estimar los tiempos que tardan PLOT, POINT y DRAW cuando se llaman con parámetros numéricos literales enteros (y no usan modos de escritura especiales ni colores):

10  LET n = 10000
11  POKE 23672 , 0 :  POKE 23673 , 0 :  POKE 23674 , 0
15  FOR f = 1 TO n :  PLOT 100 , 100 :  NEXT f
25  LET T =  PEEK 23672 + 256 *  PEEK 23673 + 65536 *  PEEK 23674 :  PRINT  AT 0 , 0 ; “PLOT: “ ;  ( T * 0.02 / n ) 
50  LET n = 10000 :  LET k = 0
61  POKE 23672 , 0 :  POKE 23673 , 0 :  POKE 23674 , 0
75  FOR f = 1 TO n :  LET k =  POINT  ( 100 , 100 )  :  NEXT f
85  LET T =  PEEK 23672 + 256 *  PEEK 23673 + 65536 *  PEEK 23674 :  PRINT  AT 1 , 0 ; “POINT: “ ;  ( T * 0.02 / n ) 
100  LET n = 1000 :  PLOT 0 , 80
110  POKE 23672 , 0 :  POKE 23673 , 0 :  POKE 23674 , 0
150  FOR f = 1 TO n :  DRAW 255 , 0 :  DRAW  – 255 , 0 :  NEXT f
160  LET T =  PEEK 23672 + 256 *  PEEK 23673 + 65536 *  PEEK 23674 :  PRINT  AT 2 , 0 ; “HORIZ-DRAW: “ ;  ( T * 0.02 /  ( 2 * n )  ) 
200  LET n = 1000 :  PLOT 0 , 0
210  POKE 23672 , 0 :  POKE 23673 , 0 :  POKE 23674 , 0
250  FOR f = 1 TO n :  DRAW 0 , 175 :  DRAW 0 ,  – 175 :  NEXT f
260  LET T =  PEEK 23672 + 256 *  PEEK 23673 + 65536 *  PEEK 23674 :  PRINT  AT 3 , 0 ; “VERT-DRAW: “ ;  ( T * 0.02 /  ( 2 * n )  ) 

Los tiempos medios obtenidos por este programa para cada llamada individual a una de estas sentencias son: PLOT, 7.274 milisegundos; POINT, 9.208 milisegundos; DRAW (línea horizontal de 255 píxeles), 55.16 milisegundos; DRAW (línea vertical de 175 píxeles), 26.89 milisegundos. Curiosamente, trazar líneas verticales es mucho más rápido que hacerlas horizontales (el doble de rápido, a pesar de que las verticales que dibuja el programa tienen casi un 70% de la longitud de las horizontales), y PLOT es más rápido que POINT.

En cuanto a CIRCLE, modificando el programa anterior para que repita 100 veces este comando con distintos radios, obtenemos estos tiempos medios:

La figura muestra claramente cómo la rutina CIRCLE de la ROM divide la circunferencia en varios arcos que resuelve con líneas rectas, lo que produce los “escalones” que se observan (usa distinto número de divisiones de la circunferencia para distintos rangos de valores del radio; por ejemplo, usa el mismo número de escalones para radios entre 40 y 50 píxeles aproximadamente).

Despreciando el efecto cuadrático por su poca importancia en la ecuación, hacer círculos con parámetros enteros resulta aproximadamente proporcional al radio (o a la circunferencia), de forma que tarda unos 15 milisegundos más por cada incremento de 1 píxel en el radio. Nótese que los tiempos del eje vertical son muy elevados: incluso con círculos muy pequeños (1 píxel de radio), sólo podrían dibujarse 3 en 1 segundo…

En resumen:

Las rutinas de dibujo —y las de consulta gráfica, como POINT— deberían usarse esporádicamente, con argumentos enteros, y en lugares que no tengan requisitos de velocidad elevados, dando preferencia en todo caso al dibujo rectilíneo.

Leer el teclado []

Hay 3 formas de leer el teclado en BASIC: INKEY$, PEEK 23560 (variable del sistema LAST_K) e IN de cualquier puerto cuyo byte bajo sea 254. Para saber cuál es la más rápida se puede probar el siguiente programa, cambiando la forma en cuestión por la que queramos (su ejecución tarda unos 10 minutos):

10  POKE 23672 , 0 :  POKE 23673 , 0 :  POKE 23674 , 0 :  FOR f = 1 TO 100000 :  LET c =  PEEK 23560 :  NEXT f :  LET T =  PEEK 23672 + 256 *  PEEK 23673 + 65536 *  PEEK 23674 :  PRINT “PEEK 23560: “ ; T ; ”  “ ; T * 0.02 ; ”  “ ;  ( T * 0.02 / 100000 ) 

El último número mostrado en pantalla al terminar es el tiempo en segundos que tarda una de las 10000 iteraciones del bucle, lo que involucra la lectura de teclado y otras cosas (asignación de variable y salto de bucle), pero, si se compara con los resultados de las otras dos formas, siempre usando el resto del programa sin cambios, las diferencias reflejarán únicamente el modo de lectura. En particular, los resultados indican que, de media, INKEY$ es 367.2 microsegundos más lenta que las otras dos, que básicamente son indistinguibles entre sí con la precisión conseguida por el programa (las medias de tiempo obtenidas están, con alrededor de un 95% de probabilidad, en +/-36.5 microsegundos de la media).

Crear UDGs []

La creación de gráficos definidos por el usuario (UDGs), si se hace mediante lectura de sentencias DATA en lugar de cargarlos de cinta, es bastante lenta (y, si se hace desde cinta, también ;P), debido al propio bucle, a la lectura de los DATA y a que tendremos que actualizar un puntero a memoria donde hacer POKE con cada dato, lo que implica evaluar una expresión.

No es fácil mejorar esto, pero sí se puede hacer que la espera no sea insufrible para el usuario con una pequeña estrategia:

En cada iteración del bucle FOR se puede aprovechar para mostrar en pantalla parte de la ayuda del programa, o cualquier otro elemento que mantenga la atención lejos de la espera. Esta técnica resulta bastante efectiva en la práctica, y sencilla de implementar si la ayuda o el elemento que se ponga en pantalla se lee de los mismos DATA.

Existen otras posibilidades, como incluir los datos de los UDG al final de una línea del programa BASIC, ya sea en un comentario REM o en un exceso oculto de longitud de línea, y posteriormente establecer la variable del sistema UDG para que apunte allí, pero requieren una manipulación del programa BASIC en su forma binaria bastante delicada y tediosa. Además, esto supone un incremento de tiempo de carga equivalente al de tener que cargar los datos desde cinta.

Quizás la forma menos ofuscada de ahorrar memoria en el almacenamiento de los UDGs, si cargarlos directamente de un archivo con LOAD "" CODE es demasiado costoso en tiempo cuando son pocos, sea, en lugar de situarlos en sentencias DATA, ponerlos en una cadena de texto (un carácter por cada byte de los UDGs), que sea rastreada por el programa para pasarlos a la zona de memoria donde deban estar. Esto, de hecho, puede ser útil para almacenar cualquier conjunto de constantes numéricas enteras (que quepan en un byte) ocupando menos espacio que con DATA. La única restricción es que el valor 34 decimal (las comillas dobles) no puede estar en la secuencia a menos que haya otro 34 justo después, pues uno solo indicaría el fin del texto.

De esta manera, en lugar de tener el siguiente programa:

10  CLS  :  GOSUB 9000
20  PRINT  AT 0 , 0 ; “\(paper 0)\(ink 7)\(bright 1)Hello, world! \(paper 0)\(ink 2)\udg(A)\(paper 2)\(ink 6)\udg(A)\(paper 6)\(ink 4)\udg(A)\(paper 4)\(ink 5)\udg(A)\(paper 5)\(ink 0)\udg(A)\(paper 0)” :  PAUSE 0 :  STOP 
9000  REM — create udgs 
9010  RESTORE 9020 :  FOR f = 1 TO 8 :  READ b :  POKE  USR “a” + f – 1 , b :  NEXT f :  RETURN 
9020  DATA 1 , 3 , 7 , 15 , 31 , 63 , 127 , 255 :  REM “A” -> solid slash

se puede escribir éste:

10  CLS  :  GOSUB 9000
20  PRINT  AT 0 , 0 ; “\(paper 0)\(ink 7)\(bright 1)Hello, world! \(paper 0)\(ink 2)\udg(A)\(paper 2)\(ink 6)\udg(A)\(paper 6)\(ink 4)\udg(A)\(paper 4)\(ink 5)\udg(A)\(paper 5)\(ink 0)\udg(A)\(paper 0)” :  PAUSE 0 :  STOP 
9000  REM — create udgs 
9010  FOR f = 1 TO 8 :  POKE  USR “a” + f – 1 ,  CODE  ( “\[0103070f1f3f7fff]” ( f )  )  :  NEXT f :  RETURN 

lo que da el mismo resultado en ambos casos:

Esto sólo se puede hacer si disponemos de un editor de código fuente BASIC que admita escribir cualquier byte dentro de una cadena (cosa que no permite el editor de código del ZX original). El sintetizador de ZX-Basicus sí tiene esa capacidad, usando la sintaxis especial que se ha mostrado en los ejemplos anteriores.

Copiar datos en memoria []

A veces un programa BASIC se ve en la necesidad de copiar cierta cantidad de datos de una dirección de memoria a otra (por ejemplo, a o desde la pantalla). En BASIC esto sólo puede hacerse con un bucle FOR, lo que puede ser desesperadamente lento.

Hay un truco conocido para hacer esto mucho más rápidamente aprovechándose del comportamiento del intérprete de BASIC de la ROM, llamado brevemente truco del DEF FN o de DEFADD. De hecho, el sintetizador incluido en la herramienta ZX-Basicus puede generar código automáticamente para implementar este truco en programas BASIC existentes (opción --defadd). En lo que sigue explicamos con detalle en qué consiste.

Ya vimos en la segunda entrega de esta serie que las funciones definidas por el usuario con DEF FN traen consigo lo único parecido a un ámbito (scope) local de variables en un programa en BASIC para el ZX Spectrum: junto a cada parámetro de entrada a la función, en el mismo lugar del programa donde ésta se define con DEF FN, se reservan huecos para copiar, en tiempo de ejecución, los argumentos (parámetros reales) que se le pasan a la función cada vez que se llama con FN. Ese conjunto de huecos es el ámbito local al que nos referimos, y, en la práctica, es un área de variables separada del área global. Esta última es la que se usa la mayor parte del tiempo y se sitúa tras el programa BASIC.

Puesto que reservar huecos del mismo tamaño que los argumentos que va a recibir una función es totalmente impráctico cuando éstos son de tipo cadena de texto, ya que pueden ser de tamaños muy diferentes cada vez que la misma función se llame y copiarlos a donde está la sentencia DEF FN muy lento, el intérprete de ROM guarda, en los huecos, no la cadena de texto, sino un puntero a la misma, junto con su longitud. El tipo de datos puntero no existe en Sinclair BASIC salvo en este lugar.

El disponer de variables puntero, y el saber que cuando el intérprete copia una cadena sobre otra variable cadena lo hace rapidísimamente por medio de la instrucción ensamblador LDIR, es lo que permite engañar a la ROM para que haga copias de memoria de y a donde queramos.

Para ello, basta con crear, en algún sitio de memoria libre, una supuesta sentencia DEF FN que tenga como parámetros dos variables de tipo cadena de texto y sus correspondientes huecos reservados para los punteros. Luego, se fuerza al intérprete a creer que está evaluando una función de usuario FN aunque eso no sea cierto, indicándole que la DEF FN correspondiente está en ese lugar de memoria. El resultado es que, a partir de ahí, cualquier referencia en nuestro programa a alguna variable de las que aparecen como parámetros en nuestra falsa DEF FN será tomada de los punteros que hemos preparado; en particular, una asignación del valor de una de las dos cadenas a la otra hará una copia de todo el contenido de memoria de la primera en la segunda. ¿Qué sucede si la segunda es un puntero a la pantalla? Pues que podemos mover grandes bloques hacia ella de forma muy rápida (no tanto como LDIR por los procesos adicionales del intérprete, pero cerca). Y viceversa.

Más concretamente, el formato en memoria de la DEF FN falsa debe ser el siguiente (los números están en decimal):

donde cada celda es un byte y el conjunto de 9 celdas debe repetirse por cada una de las falsas variables locales (punteros) que se quiera tener. En particular, “v” debe ser la letra minúscula de la variable (una letra distinta para cada una, claro); aL y aH deben definir, en orden little endian, la dirección de comienzo de los datos de la falsa variable; sL y SH su longitud en bytes de memoria; y m es una marca de fin de variable: el código ASCII de una coma (“,”) si no es la última variable que estamos definiendo y el de un cierre de paréntesis (“)”) en caso contrario.

Una vez definida así, en alguna zona de memoria P, la falsa DEF FN, sus variables quedan a disposición del programa en el momento en que engañemos al intérprete indicándole que estamos dentro de una supuesta función FN, lo que se logra cambiando la variable del sistema DEFADD por la dirección P. Toda referencia a variable que no se encuentre entre las falsas será tomada sin problemas de la zona común de variables del programa, pero ésas no. Se puede desactivar el conjunto de falsas variables volviendo a cambiar el valor de DEFADD por 0. Para más señas, la variable del sistema DEFADD está situada en la dirección de memoria 23563 y tiene 2 bytes, también almacenados en orden little endian.

Para ilustrar la potencia de este truco, hemos preparado un programa en BASIC que mide su tiempo de ejecución a la hora de copiar bloques de memoria de diversa longitud hacia la pantalla (el contenido de los bloques no nos interesa aquí):

10  CLEAR 49999 :  GOTO 30 :  REM memory-copy is done in statement 20:2
20  FOR f = 1 TO N :  LET a$ = b$ :  POKE 50014 ,  INT  (  RND  * 30 )  :  NEXT f :  RETURN 
30  POKE 23672 , 0 :  POKE 23673 , 0 :  POKE 23674 , 0
40  LET x = 14 :  LET z = 0 :  LET notlastmark = 44 :  LET lastmark = 41
50  LET p = 50000 :  LET L = 6912 :  RANDOMIZE L
60  RESTORE 80 :  FOR f = 1 TO 2 * 9 :  READ b :  POKE p + f – 1 , b :  NEXT f
70  REM — artificial local scope variables definition —
80  DATA  CODE  ( “a” )  ,  CODE  ( “$” )  , x , z :  REM first variable: a$ -> data target
90  DATA 0 , 64 :  REM target address
100  DATA  PEEK 23670 ,  PEEK 23671 :  REM target length
110  DATA notlastmark
120  DATA  CODE  ( “b” )  ,  CODE  ( “$” )  , x , z :  REM second variable: b$ -> data origin
130  DATA 0 , 0 :  REM origin address
140  DATA  PEEK 23670 ,  PEEK 23671 :  REM origin length (must equal target length)
150  DATA lastmark :  REM ending mark of artificial local scope variables
160  REM — enable artificial local scope of variables —
170  RANDOMIZE p :  POKE 23563 ,  PEEK 23670 :  POKE 23564 ,  PEEK 23671
180  LET N = 60 * 120
190  BORDER 7 :  PAPER 7 :  INK 0 :  CLS 
200  REM — time measuring loop —
210  LET t0 =  PEEK 23672 + 256 *  PEEK 23673 + 65536 *  PEEK 23674
220  GOSUB 20
230  POKE 23563 , 0 :  POKE 23564 , 0
240  LET t1 =  PEEK 23672 + 256 *  PEEK 23673 + 65536 *  PEEK 23674
250  REM — end of time measuring loop
260  PRINT  AT 0 , 0 ; t0 ; ” frames0″ ‘ t1 ; ” frames 1″
270  PRINT  ( t1 – t0 )  * 0.020 ; ” total seconds”
280  PRINT  ( t1 – t0 )  * 0.020 / N ; ” secs per iteration”
290  REM — Time measurement of fast memory copies in the ZX —
300  REM — (c) Juan-Antonio Fernandez-Madrigal, april 2020 —
310  REM — http://jafma.net
320  REM — http://blog.jafma.net/2020/03/16/efficient-basic-coding-for-the-zx-spectrum-iv/

Nótese que hemos empleado el truco de RANDOMIZE para descomponer números enteros de 16 bits en sus dos partes de 8 bits, como se explicó en la entrega anterior de esta serie. Este programa ha sido ejecutado para distintos valores de L (el tamaño de los bloques a mover, o sea, los valores sL y sH explicados antes), obteniendo los siguientes resultados:

Como se ve, el comportamiento es prácticamente lineal en el número de bytes a copiar (eje horizontal), tardando, de media, 23 microsegundos por cada byte (la pendiente de la recta; el offset de 0.21 indica que nuestro bucle de medida añade siempre 210 milisegundos extra). Desde el punto de vista del BASIC, esto es rapidísimo, aunque desde el punto de vista de la CPU es casi 4 veces más lento que hacer un LDIR directamente en ensamblador, el cual llevaría 21 ciclos de reloj por byte, que con 3.5MHz de reloj serían 6 microsegundos.

También se puede modificar el programa para que no muestre datos aleatoriamente, sino siempre los mismos, y así hacer medidas más ajustadas del movimiento de memoria en sí. Haciendo eso hemos podido deducir que, de media, el cálculo aleatorio que tenía el programa original se llevaba unos 20 milisegundos.

El programa toma 7200 medidas de tiempo para calcular la media de todas ellas, por lo que el Teorema del Límite Central indica que los resultados expuestos arriba tienen una incertidumbre pequeña, del orden de 136 microsegundos si consideramos como fuente de incertidumbre original más importante los [0,20] milisegundos producidos por la discretización del tiempo en la variable del sistema FRAMES.

En cualquier caso, la conclusión es que este truco es muy potente para mover bloques de memoria en general, por ejemplo pantallas completas (6912 bytes; tardarían alrededor de un tercio de segundo en copiarse), aunque, en el caso de la pantalla del ZX Spectrum, no es muy útil ni para poner bloques gráficos en ella (“sprites”) ni para hacer scroll, pues la estructura del bitmap obliga a hacer saltos de dirección durante la copia en esos casos, algo imposible de esta forma (para hacer scroll de una línea de texto hacia arriba, bitmap y atributos, en cualquier momento se puede llamar directamente a la rutina de la ROM encargada de dicho trabajo: RANDOMIZE USR 3582).

Medir el tiempo de ejecución []

Estamos hablando mucho en esta serie de la eficiencia en tiempo, por lo que es conveniente referirnos con más detalle a la medida del mismo. El ZX no dispone de ninguna instrumentación en el intérprete para averiguar cuánto ha tardado una sentencia BASIC en ejecutar, por lo que sólo nos queda leer la cuenta de “frames” que, gracias a la ULA, se mantiene en la variable del sistema FRAMES. Esta cuenta comienza al arrancar el ordenador (al hacer un reset) y aumenta en 1 unidad por cada 20 milisegundos en los ZX europeos.

El problema que tiene acceder a esta cuenta para medir tiempos es que ese acceso no se puede hacer demasiado eficientemente en BASIC. La forma más rápida sería la expresión 65536 * PEEK 23674 + 256 * PEEK 23673 + PEEK 23672, la cual enlentece bastante de por sí al programa debido a las multiplicaciones y PEEKs, es decir, su complejidad es alta. Se puede acelerar algo si sabemos que los tiempos que debemos medir serán siempre menores de 65536 * 20 milisegundos (algo menos de 22 minutos). En este caso podemos hacer POKE con 0 en los tres bytes de la variable antes de comenzar a contar, y luego leer sólo los dos menos significativos. En caso de querer acelerar aún más tendríamos que asegurarnos de que los tiempos van a ser inferiores o iguales a 5 segundos; entonces podemos ignorar los dos bytes más significativos de la variable de igual forma, y quedarnos sólo con PEEK 23672.

A pesar de estas limitaciones, si se dispone de un buen emulador del ZX Spectrum que permita escribir en la ROM (algo imposible en un ZX real), se podría medir el tiempo que tarda una sección del código BASIC instrumentando dicha ROM y escribiendo un programa en ensamblador que extendiera su funcionalidad original.

Más concretamente, la idea básica sería intervenir la rutina de la ROM encargada de ejecutar sentencias individuales, llamada THE STATEMENT LOOP y ubicada en la dirección 0x1b28. Justo 5 bytes después, en la dirección 0x1b32, comienza la interpretación de la sentencia en curso, cuyo número de línea se encontrará en ese momento en la variable del sistema PPC y cuyo número de sentencia estará en SUBPPC. Por tanto, si escribiéramos en la ROM y sustituyéramos 3 bytes de la dirección 0x1b32 por una llamada a una rutina externa, llamada que ocupa exactamente 3 bytes de código máquina (C/M), y si esa rutina externa tomara nota del tiempo actual tras comprobar que el programa entra en o sale de la sección que queremos medir, sería posible almacenar dichas medidas para sacar luego estadísticas de la duración de su ejecución.

Esto precisamente es lo que hace la herramienta que se puede obtener aquí mismo. Su uso es un poco complicado, y no se recomienda para quien no tenga ya cierta experiencia en la programación del ZX Spectrum.

Se trata de un programa en C/M que ha de cargarse, después de cargar el programa BASIC a medir, en la dirección 60000 de memoria (debe haberse hecho un CLEAR previamente a alguna dirección inferior; este C/M respeta los UDGs). También debe habilitarse en el emulador la escritura sobre la ROM, como hemos dicho antes.

Una vez cargado este C/M, hay que: a) inicializar una tabla de datos para el mismo, así como crear algunas variables (todo esto se muestra en el ejemplo a continuación); b) definir el trozo de programa que se quiere medir, desde la primera sentencia a medir hasta la primera a no medir, mediante la creación de más variables y su almacenamiento en la tabla de datos anterior; c) almacenar en dicha tabla el número n de medidas de tiempo a tomar, que debe ser menor o igual a 1000, y activar el proceso de medida mediante RANDOMIZE USR 60000; d) dejar que se ejecute el programa BASIC mientras el medidor C/M almacena medidas de tiempo tomadas cada vez que se ejecuta la sección definida (si ésta se repite, se tomará más de una medida; si no, sólo una); e) cuando se desee dejar de medir, restaurar la ROM original haciendo RANDOMIZE USR 60038; f) interrumpit el programa BASIC y almacenar los datos obtenidos leyéndolos de memoria del emulador.

Un ejemplo de uso es el siguiente, en el que se mide lo que tarda la impresión en pantalla de la zona sombreada (línea 10, sentencia 2; las sentencias se numeran desde 1 en el ZX), que calcula el coseno de una variable entera:

1  CLEAR 59999 :  GOSUB 9999 :  REM load machine code and prepare table
2  LET firstline = 10 :  LET firststat = 2 :  LET endline = 10 :  LET endstat = 3 :  POKE postable + 6 ,  FN l ( n )  :  POKE postable + 7 ,  FN h ( n )  :  POKE postable ,  FN l ( firstline )  :  POKE postable + 1 ,  FN h ( firstline )  :  POKE postable + 2 , firststat :  POKE postable + 3 ,  FN l ( endline )  :  POKE postable + 4 ,  FN h ( endline )  :  POKE postable + 5 , endstat :  RANDOMIZE  USR 60000 :  REM activate
10  FOR f = 1 TO 1000 :  PRINT  AT 0 , 0 ;  COS  ( f )  :  NEXT f
20  RANDOMIZE  USR 60038 :  REM deactivate
30  STOP 
9999  LOAD “” CODE 60000 :  DEFFN h ( w )  =  INT  ( w / 256 )  :  DEFFN l ( w )  = w –  INT  ( w / 256 )  * 256 :  LET n = 1000 :  LET tam = n * 5 + 15 :  LET postable = 65536 –  ( 21 * 8 )  – tam :  POKE postable + 11 ,  FN l ( postable + 15 )  :  POKE postable + 12 ,  FN h ( postable + 15 )  :  POKE postable + 13 ,  FN l ( 60054 )  :  POKE postable + 14 ,  FN h ( 60054 )  :  POKE 23728 ,  FN l ( postable )  :  POKE 23729 ,  FN h ( postable )  :  RETURN 

La línea 9999 se encarga de cargar el C/M y crear la tabla de datos necesaria para éste (paso a). Nótese que para que funcione bien, este programa BASIC no debe tener funciones de usuario llamadas FN h o FN l. Los pasos b y c se implementan en la línea 2, el d es la línea 10, y el e la 20.

El ejemplo de arriba produce 1000 medidas de tiempo (las veces que se repite el bucle FOR, y el máximo que el programa en C/M es capaz de tomar), que se almacenan a partir de la dirección 60368 de la RAM, como se muestra en el visor de memoria del emulador fuse después de terminar el experimento:

Cada medida de tiempo son 5 bytes (la primera que se muestra en la figura de arriba es 03 00 00 60 04): los 3 primeros contienen un valor f igual a la diferencia de la variable del sistema FRAMES al principio y al final de la sección de código monitorizado; los 2 siguientes contienen un valor entero c de 16 bits que afina esa medida de tiempo. El tiempo total transcurrido, eliminando el propio tiempo incurrido por la rutina externa de C/M, puede calcularse a partir de f y c con la siguiente fórmula, que está diseñada específicamente para ella:

P_{ULA} - (35 \cdot c + 30) P_{CPU} + f \cdot P_{ULA} -(323+192)P_{CPU}

donde P_ULA es el período de refresco de la ULA (20 milisegundos en Europa) y P_CPU el período de reloj de la CPU (286 nanosegundos si ésta funciona a 3.5MHz). Esta fórmula no es del todo exacta porque el valor 30 de su segundo término puede estar en realidad en el intervalo [0,30] y porque puede ocurrir una interrupción de la ULA dentro de los 192 T-estados que aparecen en su último término, aumentándolos en lo que servir esa interrupción tarde (que dependerá del valor de la variable FRAMES y del estado del teclado, fundamentalmente). Sin embargo, la probabilidad de que ocurra dicha interrupción en ese período es de 192*P_CPU/0.02 = 0.0027, o sea, despreciable.

Los datos obtenidos de este ejemplo se han grabado en cinta desde el propio intérprete de BASIC (SAVE), se han extraído del fichero de cinta con un visor de datos binario para Linux (Bless), y se han interpretado con un script de Matlab (se podría haber hecho igualmente con cualquier otro lenguaje de scripting, como Python) según la fórmula anterior para obtener estadísticas de los mismos. Los resultados se muestran en la siguiente figura:

Como se puede observar, el ZX Spectrum tarda… ¡70 milisegundos en calcular y mostrar en pantalla el coseno de una variable entera! Le daría para imprimir solamente 14 en un segundo…

La influencia de la ULA []

En el ZX Spectrum, parte de la memoria RAM es accesible directamente por dos circuitos integrados distintos: la CPU y la ULA (Uncommited Logic Array). Este último es el análogo a la tarjeta gráfica de un ordenador actual, pero se ocupa de más cosas: sonido, cassette, teclado, etc., incluso de proporcionar la señal de reloj a la CPU.

La función más importante de la ULA es generar la señal de vídeo a partir de la memoria de imagen almacenada en las direcciones 16384 – 23295 de la RAM (los detalles se explican en la siguiente entrada de esta serie). Para ello, tiene que leer datos de esas direcciones; como la CPU también puede leer y escribir en ellas, y ambos pueden querer hacerlo a la vez, a veces se producen conflictos entre ellos que siempre se resuelven deteniendo a la CPU (de hecho, parándole la señal de reloj) durante el tiempo que la ULA espera necesitar para hacer sus cosas. Este retardo es de entre 1 y 6 ciclos de reloj, y se induce a la CPU de forma poco previsible, en general (para muchos más detalles, puede consultarse esto y esto).

Es más, no sólo se detiene la CPU cuando sucede el conflicto en zonas de pantalla: para simplificar la electrónica de la placa base, se hace esto cuando se accede a cualquier dirección dentro de los primeros 16K de RAM, es decir, en el rango 16384 – 32767. Así, sólo los 32K superiores de RAM (desde 32768 hasta 65535) y la ROM (desde 0 hasta 16383) están libres de conflicto.

También se inyectan retardos en la CPU al acceder a puertos de E/S, pero ésos son menos frecuentes e importantes para nosotros.

Si no tenemos conectado ningún dispositivo externo que inserte información extra sobre canales en la memoria RAM, el programa BASIC empezará en la dirección 23755. En caso contrario, la información de estos canales se almacena justo antes del programa, moviendo hacia arriba su comienzo. En cualquier caso, la variable del sistema PROG (en la dirección 23635, 2 bytes) contiene la dirección de comienzo en memoria de la primera línea de nuestro programa BASIC.

Esto significa que el programa BASIC comienza en memoria conflictiva. De hecho, si empieza en 23755, tendrá 9013 bytes en dicha memoria antes de alcanzar los 32K superiores. Es raro que podamos prescindir de esos 9K a la hora de escribir nuestro programa (9K es casi el 20% de la memoria RAM disponible para nosotros!), pero si quisiéramos hacerlo, bastaría con insertar una línea inicial con una sentencia REM que contuviera un comentario de longitud 9008 (caracteres), lo que haría que la siguiente línea ya estuviera en zona no conflictiva (si, además, queremos que la ejecución del resto de líneas no sufra por tener ésa antes, podemos cambiar el contenido de la variable del sistema PROG para que apunte a la dirección de la segunda línea; todo esto se detalla en la primera entrega de esta serie).

¿Cuál es el tiempo esperado de retardo que puede tener un programa BASIC a causa de los conflictos con la ULA? Bueno, aquí podemos tener en cuenta lo siguiente:

Los retardos fundamentales se producirán por las consultas del intérprete al programa que hay en memoria (leer sentencias para interpretarlas, buscar líneas, buscar variables, etc.), no tanto por los accesos del propio código BASIC a memoria conflictiva: el tiempo de acceso a esa memoria es mínimo comparado con el de interpretación y ejecución del programa.
Cada programa es diferente de cualquier otro, pero en general el intérprete siempre está haciendo lo mismo sin parar: leer la sentencia a ejecutar, interpretarla, ejecutarla y pasar a la siguiente. Por tanto, las diferencias en accesos a memoria de programa BASIC que haya entre programas distintos, y, en particular, para sentencias BASIC distintas, pueden ser pocas desde el punto de vista de los conflictos que pudieran producir con la ULA.

Asumiendo estas premisas, podemos estimar el tiempo incurrido en los conflictos ejecutando un pequeño programa primero en memoria conflictiva y luego en no conflictiva, y viendo las diferencias de tiempo. En la segunda entrega de la serie ya hicimos eso, pero no proporcionamos el código del programa movido a zonas no conflictivas. Repetimos aquí primero el programa original: es un bucle que sirve para medir el tiempo esperado (líneas 10 y 30) en hacer una operación muy sencilla de decremento de variable entera (línea 20):

10  POKE 16384 , 255 :  POKE 23672 , 0 :  POKE 23673 , 0 :  POKE 23674 , 0
20  POKE 16384 ,  PEEK 16384 – 1 :  IF  PEEK 16384 <> 0 THEN  GOTO 20
30  LET t =  PEEK 23672 + 256 *  PEEK 23673 + 65536 *  PEEK 23674 :  PRINT “time: “ ;  ( t * 0.02 / 255 )  ; ” secs”

La versión que desplaza el bucle principal a una zona no conflictiva es la siguiente:

1  REM The next REMs create 9013 bytes before the main program (line 15), thus placing it at 32768 if this is executed after a hard reset in the ZX, and hence using the non-contended memory for the main loop.1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123
2  REM 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123
3  REM 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123
4  REM 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123
5  REM 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123
6  REM 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123
7  REM 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123
8  REM 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123
9  REM 01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678
10  POKE 23635 , 0 :  POKE 23636 , 128
15  POKE 65535 , 255 :  POKE 23672 , 0 :  POKE 23673 , 0 :  POKE 23674 , 0
20  POKE 65535 ,  PEEK 65535 – 1 :  IF  PEEK 65535 <> 0 THEN  GOTO 20
30  LET t =  PEEK 23672 + 256 *  PEEK 23673 + 65536 *  PEEK 23674 :  PRINT “time: “ ;  ( t * 0.02 / 255 )  ; ” secs”
40  STOP 

Nótese que no sólo el bucle principal se pone en zona no conflictiva, sino también se accede desde él a una zona de memoria del mismo área. Asimismo, en lugar de usar un sólo REM para desplazar el bucle, hemos puesto 9 para que sea algo más fácil de visualizar el código.

Tras ejecutar estos programas, el tiempo por iteración del bucle del programa original es de 10.901961 milisegundos, mientras que el del programa modificado es de 10.823529 milisegundos. La diferencia es de 78.432 microsegundos para una iteración, o, lo que es lo mismo, un 0.72% del tiempo de iteración original. Para afinar más se puede considerar que una iteración del bucle principal consiste en: 1) decrementar el valor de un byte en memoria, 2) comprobar que no sea igual a 0 y 3) saltar al principio del bucle para repetir en caso de que no sea igual a 0. Si simplificamos la cuestión suponiendo que esas 3 sentencias van a tomarle el mismo tiempo al intérprete, cada una llevaría 3.633987 milisegundos, por lo que el retardo por iteración supondría un 2.16% del de cada sentencia.

Siendo impreciso, este valor nos puede servir de referencia para estimar cuánto podemos esperar que se enlentezca una sentencia de programa a causa de los conflictos con la ULA. Como se ve, es escaso, debido fundamentalmente a que el tiempo gastado por el intérprete haciendo su trabajo es mucho mayor que el utilizado para leer de memoria el programa BASIC.

Finalmente, nótese que en programas largos es de lo más probable que las variables sí estén en memoria no conflictiva (puesto que se almacenan tras el programa), por lo que no sufrirán de este problema.

.oOo.

[Click here to read this in Spanish ]

This is the fourth in a series of posts that explain the foundations of the (in)efficiency of pure BASIC programs written for the ZX Spectrum:
I. On line numbers
II. On variables
III. On expressions
IV. Some statements and time measurements
V. Screen operations based on characters

In this post we talk about the efficiency of some important functionality in the ZX Spectrum BASIC and also about time measurements. The screen operations based on characters are dealt with in the next post.

To navigate through this post more easily, these are the sections it contains:

Drawing. Execution times of PLOT, DRAW, CIRCLE, POINT.
Reading the keyboard. Execution times of INKEY$, PEEK (LAST_K) and IN (254).
Creating UDGs. How to hide the execution time of defining UDGs.
Copying memory blocks. The DEFADD trick for fast memory copies.
Measuring time. Of any statement or part of a BASIC program.
Influence of the ULA on a BASIC program. Contended memory produces delays that have a (slight) influence on BASIC.

Drawing

The drawing statements (PLOT, DRAW, CIRCLE) are specially sensitive regarding their execution speed, as anyone that has programmed sometime in the ZX knows well. When DRAW is used to draw arcs, and in all uses of CIRCLE, the interpreter does repeated floating point calculations, which makes them very slow in spite of having optimized ROM implementations. PLOT is not inefficient in principle (particularly if its arguments are integer expressions and do not use real numbers), but to draw something interesting only with PLOT you need a lot of plots!

Drawing straight lines with DRAW uses a version of the Bresenham’s algorithm which is quite efficient (it does not use the ROM calculator), but, like other drawing routines, it must set appropriately the color attributes of the screen cells which the line goes through, besides executing the INVERSE or OVER modes correctly. That reduces its efficiency.

With the next program you can estimate the time taken by PLOT, POINT and DRAW when they are called with integer numeric literals as parameters (and without special modes or colours):

10  LET n = 10000
11  POKE 23672 , 0 :  POKE 23673 , 0 :  POKE 23674 , 0
15  FOR f = 1 TO n :  PLOT 100 , 100 :  NEXT f
25  LET T =  PEEK 23672 + 256 *  PEEK 23673 + 65536 *  PEEK 23674 :  PRINT  AT 0 , 0 ; “PLOT: “ ;  ( T * 0.02 / n ) 
50  LET n = 10000 :  LET k = 0
61  POKE 23672 , 0 :  POKE 23673 , 0 :  POKE 23674 , 0
75  FOR f = 1 TO n :  LET k =  POINT  ( 100 , 100 )  :  NEXT f
85  LET T =  PEEK 23672 + 256 *  PEEK 23673 + 65536 *  PEEK 23674 :  PRINT  AT 1 , 0 ; “POINT: “ ;  ( T * 0.02 / n ) 
100  LET n = 1000 :  PLOT 0 , 80
110  POKE 23672 , 0 :  POKE 23673 , 0 :  POKE 23674 , 0
150  FOR f = 1 TO n :  DRAW 255 , 0 :  DRAW  – 255 , 0 :  NEXT f
160  LET T =  PEEK 23672 + 256 *  PEEK 23673 + 65536 *  PEEK 23674 :  PRINT  AT 2 , 0 ; “HORIZ-DRAW: “ ;  ( T * 0.02 /  ( 2 * n )  ) 
200  LET n = 1000 :  PLOT 0 , 0
210  POKE 23672 , 0 :  POKE 23673 , 0 :  POKE 23674 , 0
250  FOR f = 1 TO n :  DRAW 0 , 175 :  DRAW 0 ,  – 175 :  NEXT f
260  LET T =  PEEK 23672 + 256 *  PEEK 23673 + 65536 *  PEEK 23674 :  PRINT  AT 3 , 0 ; “VERT-DRAW: “ ;  ( T * 0.02 /  ( 2 * n )  ) 

The average times measured by the program for each individual call to one of these statements are: PLOT, 7.274 milliseconds; POINT, 9.208 milliseconds; DRAW (horizontal line with 255 pixels), 55.16 milliseconds; DRAW (vertical line with 175 pixels), 26.89 milliseconds. Maybe surprisingly, drawing vertical lines is much faster as drawing horizontal ones (twice faster, in spite of the former having almost a 70% of the length of the latter), and PLOT is faster than POINT.

As for CIRCLE, we can modify the previous program to repeat that statement a number of times with different radii. If we repeat it 100 times, we get the following average times (per circle):

The figure shows how the ROM CIRCLE routine divides the circumference in several arcs that are drawn with straight lines, which produces the “steps” in the red data points (it uses different number of divisions of the circumference for different ranges of values of the radius; for instance, it uses the same number of steps for radii in the approximate range [40,50]).

You can see that, discarding the small effect of the quadratic term in the equation, drawing circles with integer parameters is approximately proportional to their radii (or circumferences); particularly, it takes around 15 milliseconds more for each increment of the radius of the circle. Notice that all the times in the figure are quite high: even very small circles (radius = 1 pixel) can only be drawn at a pace of about 3 per second…

In summary:

Drawing statements (and bitmap consulting statements, like POINT) should be used sporadically, with integer arguments, in parts of the code that do not have special speed requirements, and preferably for straight drawings.

Reading the keyboard []

We have 3 ways of reading the keyboard in BASIC: INKEY$, PEEK 23560 (system variable LAST_K), and IN from any port which low byte is 254. To decide which one is the fastest, you can use the following program, changing the way of reading the keyboard by the one you wish (its complete execution takes around 10 minutes):

10  POKE 23672 , 0 :  POKE 23673 , 0 :  POKE 23674 , 0 :  FOR f = 1 TO 100000 :  LET c =  PEEK 23560 :  NEXT f :  LET T =  PEEK 23672 + 256 *  PEEK 23673 + 65536 *  PEEK 23674 :  PRINT “PEEK 23560: “ ; T ; ”  “ ; T * 0.02 ; ”  “ ;  ( T * 0.02 / 100000 ) 

The last number shown on the screen after finishing is the time in seconds taken by each one of the 100000 iterations of the loop (considering an European ZX), which includes the keyboard reading but also the rest of the program statements (assignment, loop). However, you can compare the results of the program with different ways of reading the keybard without changing anything else, and then the differences between those results will be due only to the keyboard reading. In particular, they indicate that, on average, INKEY$ is 367.2 microseconds slower than the other two, which are basically indistinguishable with the precision of this program (the time measurements have around a 95% of probability of be in +/-36.5 microseconds of the average).

Creating UDGs []

Creating UDGs means to fill a certain area of memory with the bitmaps of those UDGs. If this is done by reading DATA values and POKE-ing them, it is quite slow, due to the loop itself (consulting the FOR variable, jumping in NEXT…), reading the DATA (locating the line where the DATA is, recall the first post in this series), and maintaining the pointer to memory that holds the address where to POKE the next byte.

It is not easy to improve this (if you load the UDGs from tape, the loading time is not much better). However, there is a technique that can hide that time for the user, making him/her to be mentally active in the meanwhile instead of doing a simple wait till completion:

In each iteration of the loop that is defining the UDGs, you can show in the screen part of information needed to use your program, or any element that distracts the attention of the user. It is amazing how this simple strategy can make the wait more pleasant and even dissapear in the conscience of the viewer! It is also quite simple to implement; you even can use the same DATA statements that hold the UDG bitmaps to hold the information to display, thus READing it on the fly.

There are other possibilities, such as including the UDG data at the end of a line of the BASIC program, either as a REM comment or in a hidden excess of the line length, and set the system variable UDG to point there. However, they requiere a careful and tedious manipulation of the BASIC program in its binary form, and they involve, at the end, the same loading time as though the data is loaded separately from tape.

Maybe the less obfuscated way of saving space when storing the UDG definition data, without loading them with LOAD "" CODE because that takes too long if there are few UDGs, is, instead of placing their bytes into DATA statements, put them into a text string (one character per UDG byte), that can be scanned by the program in order to pass the bytes to the final UDG memory address. Actually, this can be useful to store any set of integer numeric constants (that fit into a byte) saving a lot of space with respect to DATA. The only restriction is that the decimal value 34 (double quotes) cannot be in the sequence unless another 34 is right after it, since only one value like that would mark the end of the string.

In this way, instead of the next program:

10  CLS  :  GOSUB 9000
20  PRINT  AT 0 , 0 ; “\(paper 0)\(ink 7)\(bright 1)Hello, world! \(paper 0)\(ink 2)\udg(A)\(paper 2)\(ink 6)\udg(A)\(paper 6)\(ink 4)\udg(A)\(paper 4)\(ink 5)\udg(A)\(paper 5)\(ink 0)\udg(A)\(paper 0)” :  PAUSE 0 :  STOP 
9000  REM — create udgs 
9010  RESTORE 9020 :  FOR f = 1 TO 8 :  READ b :  POKE  USR “a” + f – 1 , b :  NEXT f :  RETURN 
9020  DATA 1 , 3 , 7 , 15 , 31 , 63 , 127 , 255 :  REM “A” -> solid slash

you can write this one:

10  CLS  :  GOSUB 9000
20  PRINT  AT 0 , 0 ; “\(paper 0)\(ink 7)\(bright 1)Hello, world! \(paper 0)\(ink 2)\udg(A)\(paper 2)\(ink 6)\udg(A)\(paper 6)\(ink 4)\udg(A)\(paper 4)\(ink 5)\udg(A)\(paper 5)\(ink 0)\udg(A)\(paper 0)” :  PAUSE 0 :  STOP 
9000  REM — create udgs 
9010  FOR f = 1 TO 8 :  POKE  USR “a” + f – 1 ,  CODE  ( “\[0103070f1f3f7fff]” ( f )  )  :  NEXT f :  RETURN 

which produces the same result:

This can only be done (confortably) with a BASIC source code editor that admits escaped characters into strings (the original ROM editor does not). The synthesizer of ZX-Basicus allows for that using the special syntax shown in the previous examples.

Copying memory blocks []

Sometimes, a BASIC program needs to copy certain amount of data from a memory address to another (for example, to or from the screen). In BASIC, that can only be done through a FOR loop, which is usually desperately slow.

There is a trick to do that much faster, by leveraging the way the BASIC interpreter of the ROM deals with user functions, called the DEF FN or DEFADD trick. Actually, the synthesizer tool included in ZX-Basicus can generate code automatically for implementing this trick in existing BASIC programs (option --defadd). In the following we explain the trick in more detail.

We already saw in the second post of this series that user defined functions (DEF FN) produce the only thing close to a local scope of variables in a BASIC program written for the ZX Spectrum. This happens because, in the same place that the DEF FN is in the program memory, and alongside each of its input parameters, the interpreter reserves some space for copying there the actual parameters during each function call (FN). Those placeholders form the local scope, and, in practice, an area of variables different and isolated from the global, common one that exists in memory after the end of the program.

Since reserving space in memory in these placeholders when copying entire string parameters would be impractical (strings may have very diverse lengths and it would be certainly slow to copy them at each FN call), the interpreter stores there just pointers to the strings, and their lengths. The pointer data type does not exists in Sinclair BASIC except here.

It is worth to mention that the interpreter copies strings in a very fast way, using the assembly instruction LDIR. These fast copies plus the existence of pointers is what allows us to cheat the ROM to do fast memory block movements.

In order to do that, we need to store at some memory location a fake DEF FN statement that has as input parameters two string variables. Once that this is done, we will cheat the interpreter by indicating that it is evaluating a user function FN that corresponds to that DEF FN. That is not true, but from that point on, any reference in the program to one of the fake variables will use the pointers instead of consulting the global variable area of the program, and therefore any assignment of one of the variables to the other one will do a fast copy of the former pointed data to the latter in memory. For instance, what happens if the latter points to the screen? Aha! You can now move large blocks of memory to the screen very fast (well, not as fast as LDIR due to the rest of processes involved in the interpreter, but close enough).

The format in memory of the fake DEF FN must be this one (the numbers are in decimal):

Each cell in that figure is a byte, and the set of 9 cells must be repeated for all the fake local variables (pointers) that you need. In particular, “v” must be the lowercase letter of the variable (you must use a different letter for each variable); aL and aH must define, in little endian order, the starting address of the data of that fake variable; sL and SH must be its length in bytes; and m is a marker that must be the ASCII code of comma (“,”) if that is not the last fake variable in the DEF FN or a closing parenthesis (“)”) otherwise.

Once you have filled some memory place with the data defined above, starting at, let say, address P, those variables can be used by the program if we cheat the interpreter by indicating it is evaluating a user function. For doing that, we just have to store in the system variable DEFADD the address P. Any reference to a fake variable name will use the memory block pointed by it from that moment on, while references in the program to variables not defined at P will lead the interpreter to consult the set of global, common variables. For finishing the use of the fake variables, you need to reset the value of DEFADD to 0. By the way, the system variable DEFADD is located at memory address 23563 and it has 2 bytes that store a value in little endian.

To illustrate the power of this trick, we have prepared a BASIC program that measures the execution time of this kind of memory copies. We have used it with diverse memory block lengths to move them to the screen (the content of the blocks is not interesting):

10  CLEAR 49999 :  GOTO 30 :  REM memory-copy is done in statement 20:2
20  FOR f = 1 TO N :  LET a$ = b$ :  POKE 50014 ,  INT  (  RND  * 30 )  :  NEXT f :  RETURN 
30  POKE 23672 , 0 :  POKE 23673 , 0 :  POKE 23674 , 0
40  LET x = 14 :  LET z = 0 :  LET notlastmark = 44 :  LET lastmark = 41
50  LET p = 50000 :  LET L = 6912 :  RANDOMIZE L
60  RESTORE 80 :  FOR f = 1 TO 2 * 9 :  READ b :  POKE p + f – 1 , b :  NEXT f
70  REM — artificial local scope variables definition —
80  DATA  CODE  ( “a” )  ,  CODE  ( “$” )  , x , z :  REM first variable: a$ -> data target
90  DATA 0 , 64 :  REM target address
100  DATA  PEEK 23670 ,  PEEK 23671 :  REM target length
110  DATA notlastmark
120  DATA  CODE  ( “b” )  ,  CODE  ( “$” )  , x , z :  REM second variable: b$ -> data origin
130  DATA 0 , 0 :  REM origin address
140  DATA  PEEK 23670 ,  PEEK 23671 :  REM origin length (must equal target length)
150  DATA lastmark :  REM ending mark of artificial local scope variables
160  REM — enable artificial local scope of variables —
170  RANDOMIZE p :  POKE 23563 ,  PEEK 23670 :  POKE 23564 ,  PEEK 23671
180  LET N = 60 * 120
190  BORDER 7 :  PAPER 7 :  INK 0 :  CLS 
200  REM — time measuring loop —
210  LET t0 =  PEEK 23672 + 256 *  PEEK 23673 + 65536 *  PEEK 23674
220  GOSUB 20
230  POKE 23563 , 0 :  POKE 23564 , 0
240  LET t1 =  PEEK 23672 + 256 *  PEEK 23673 + 65536 *  PEEK 23674
250  REM — end of time measuring loop
260  PRINT  AT 0 , 0 ; t0 ; ” frames0″ ‘ t1 ; ” frames 1″
270  PRINT  ( t1 – t0 )  * 0.020 ; ” total seconds”
280  PRINT  ( t1 – t0 )  * 0.020 / N ; ” secs per iteration”
290  REM — Time measurement of fast memory copies in the ZX —
300  REM — (c) Juan-Antonio Fernandez-Madrigal, april 2020 —
310  REM — http://jafma.net
320  REM — http://blog.jafma.net/2020/03/16/efficient-basic-coding-for-the-zx-spectrum-iv/

Notice that we use the RANDOMIZE trick described in the previous post in this series to do the decomposition of a 16-bit number into its two 8-bits parts. After executing the program for different values of L (the size of the memory blocks to copy, i.e., the values sL and sH), we got these results:

As shown, the behaviour is very close to linear in the length of the blocks that are moved (horizontal axis), taking 23 microseconds per byte on average (the slope of the line; the intercept of 0.21 indicates that our measuring loop injects around 210 extra milliseconds to that time). From the point of view of the ZX Spectrum BASIC, this is very very fast, although from the perspective of assembly programming it is almost 4 times slower than a direct LDIR, which would take 21 clock cycles per byte, i.e., 6 microseconds with a 3.5MHz CPU.

The program above can be modified to not do the random calculation in the measuring loop in order to get finer measurements. By doing that, we can deduce that on average, the random calculation takes around 20 milliseconds.

Since the program does 7200 measurements to get the final average time, the Central Limit Theorem suggests that the results in the figure above have a small amount of uncertainty, of around 136 microseconds if we consider as the main source of original uncertainty the [0,20] milliseconds produced by the time discretization of the FRAMES system variable.

Anyway, the conclusion is that this trick is very powerful for moving memory blocks in general (for instance, entire screens), but not so much for putting graphic blocks on the screen or doing scrolls, due to the bitmap layout in the ZX, that requires to make gaps within the copy procedure, something imposible with the trick (to do scroll one line of text upwards, bitmap and attributes, you can use at any time the ROM routine that does precisely that: RANDOMIZE USR 3582).

Measuring time []

Besides making programs time-efficient, often you need to take notice of the time spent in some part of your code. The ROM interpreter is not instrumented to measure the time spent in any BASIC statement, but we have the system variable FRAMES, that, thanks to the ULA, keeps a count of the 20 millisecs ticks (16.7 millisecs in non-european countries) that have passed since the computer was reset the last time.

The problem with reading this variable is that it is a long number formed by 3 bytes. Getting the whole value is slow even using the most direct expression for it: 65536 * PEEK 23674 + 256 * PEEK 23673 + PEEK 23672. There are a bunch of multiplications, sums and memory readings there. Fortunately, it can be simplified if we know that the times to measure are always shorter than 65536 * 20 millisecs (around 22 minutes). In that case, we can POKE a 0 in the three bytes of the variable before starting to count, and then reading only the 2 least significan bytes. Moreover, we can use just the least significant byte if all times to measure will be shorter than 5 seconds: the previous expression becomes then PEEK 23672.

In spite of these limitations, we could think of measuring the time spent in a section of a BASIC program better if we use a good emulator that allows us to do writings at ROM addresses (something imposible with a real ZX Spectrum!). That ROM could be instrumented and extended with some assembly code to gather the measurements.

More concretely, the basic idea would be to change the original ROM routine that is in charge of interpreting and executing individual BASIC statements, called THE STATEMENT LOOP and placed at 0x1b28 within the ROM. Right 5 bytes after that, at 0x1b32, the code that interprets the current statement starts. When that happens, the line number of that statement is in the system variable PPC, and the statement number in SUBPPC. Consequently, if we could write at that ROM address in order to substitute 3 bytes at 0x1b32 by a call to an external machine code (M/C) routine (such a call are exactly 3 bytes) which timestamps the current time when the BASIC program is entering or leaving the section to measure, it would be possible to store those measurements in order to do an offline statistical analysis.

This is precisely what the tool that you can download here does. Notice that this tool is not for beginners; it is only recommended if you have experience in ZX Spectrum programming.

The tool is a M/C program that must be loaded in memory after the BASIC program that we wish to measure is loaded. The M/C must be placed concretely at address 60000 (CLEAR must has been done previously to a smaller address; the M/C program respects the UDGs); in the emulator, you must enable ROM writings, as we have commented.

Once the M/C program is loaded, you have to: a) initialize a data table for it, and create some variables (all of that is shown in the example below); b) define the section of BASIC code you wish to measure, from the first statement to be measured to the first statement NOT to be measured, and store in the data table that information; c) store in the data table the number n of measurements to accumulate on that section, that must be smaller than or equal to 1000, and activate the measurement procedure through RANDOMIZE USR 60000; d) leave the BASIC program to run while the monitoring M/C program gathers one measurement each time the monitored section executes (if the section executes more than once, several measurements are gathered); e) when you wish to stop measuring, restore the original ROM with RANDOMIZE USR 60038; f) stop the BASIC program and gather the measurements for offline analysis, for instance in a tape file.

A case use is the following, where we measure how long takes to calculate and print in the screen the cosine of an integer variable (line 10, statement 2 in the code; statements are numbered from 1 in the ZX Spectrum):

1  CLEAR 59999 :  GOSUB 9999 :  REM load machine code and prepare table
2  LET firstline = 10 :  LET firststat = 2 :  LET endline = 10 :  LET endstat = 3 :  POKE postable + 6 ,  FN l ( n )  :  POKE postable + 7 ,  FN h ( n )  :  POKE postable ,  FN l ( firstline )  :  POKE postable + 1 ,  FN h ( firstline )  :  POKE postable + 2 , firststat :  POKE postable + 3 ,  FN l ( endline )  :  POKE postable + 4 ,  FN h ( endline )  :  POKE postable + 5 , endstat :  RANDOMIZE  USR 60000 :  REM activate
10  FOR f = 1 TO 1000 :  PRINT  AT 0 , 0 ;  COS  ( f )  :  NEXT f
20  RANDOMIZE  USR 60038 :  REM deactivate
30  STOP 
9999  LOAD “” CODE 60000 :  DEFFN h ( w )  =  INT  ( w / 256 )  :  DEFFN l ( w )  = w –  INT  ( w / 256 )  * 256 :  LET n = 1000 :  LET tam = n * 5 + 15 :  LET postable = 65536 –  ( 21 * 8 )  – tam :  POKE postable + 11 ,  FN l ( postable + 15 )  :  POKE postable + 12 ,  FN h ( postable + 15 )  :  POKE postable + 13 ,  FN l ( 60054 )  :  POKE postable + 14 ,  FN h ( 60054 )  :  POKE 23728 ,  FN l ( postable )  :  POKE 23729 ,  FN h ( postable )  :  RETURN 

Line 9999 loads the M/C program and fill the data table for it (step a). Notice that in this example, the BASIC program cannot use user functions FN h or FN l. Steps b and c are implemented in line 2, d in line 10, and e in line 20.

The above example produces 1000 measurements (the number of times that the FOR loop iterates, and the maximum to gather by the M/C program), that are stored from address 60368 up, as it is shown in the memory browser of the fuse emulator after finishing the experiment:

Each measurement consists of 5 bytes (the first one shown above is 03 00 00 60 04): the first 3 bytes are an integer value f that equals the increment of the FRAMES system variable since the starting of the monitored section to its end; the next 2 bytes contain an integer value c that refines that measurement. The total time spent in the monitored BASIC section, once the very time of the M/C routine is discarded, can be calculated from both values using the following formula, designed for the particular implementation of the M/C routine:

P_{ULA} - (35 \cdot c + 30) P_{CPU} + f \cdot P_{ULA} -(323+192)P_{CPU}

where P_ULA is the refresh period of the ULA (20 milliseconds in Europe) and P_CPU the clock period of the CPU (286 nanoseconds if it is a 48K@3.5MHz). This formula is not entirely exact since the value 30 in the second term can actually be within the interval [0,30] and because one interruption of the ULA may occur within the 192 T-states that are in the last term, increasing them by what the ULA service routine takes (which depens on the value of the FRAMES system variable and the state of the keyboard, essentially). However, the probability of that interrupt to occur in such a short interval is 192*P_CPU/0.02 = 0.0027, which is negligible.

The data gathered from our example have been saved in a tape file (SAVE), have been extracted from that file with a binary viewer in Linux (Bless), and have been interpreted with a Matlab script (you could use any other scripting language, such as Python) according to the formula above. The statistical results are as follows:

As you can see, the BASIC interpreter of the ROM takes… 70 milliseconds to calculate and print the cosine of an integer variable! It could only print 14 of these variables in one second…

The influence of the ULA []

In the ZX Spectrum, part of the RAM is accessible directly by two circuitries: the CPU and the ULA (Uncommited Logic Array). The latter is like a graphic card of a modern computer, except that it takes care of many more tasks: sound, cassette, keyboard, etc., even of generating the clock signal for the CPU.

The most important function of the ULA is to generate the video signal from the bitmap and attribute map memories stored in addresses 16384 – 23295 of the RAM (the details are explained in the next post of this series). For that, it has to read bytes from these addresses; since the CPU can also read and write on them, and both might wish to do it at the same time, there are occasions when access conflicts appear. All of them are are always solved by halting the CPU (actually, its clock signal is stalled) during the time that the ULA expects to need for doing its job. This delay ranges from 1 to 6 CPU clock cycles, and it is induced on the CPU work in a quite unpredictable fashion in general (for many more details, you can consult this and this).

Furthermore, the CPU is not only stalled when the conflict occurs in screen memory: to simplify the ZX circuitry, this actualy happens when both access any address within the first 16K of RAM, i.e., in the area 16384 – 32767. Thus, only the higher 32K of RAM (32768 to 65535) and the ROM (0 to 16383) are free of conflict.

This kind of delays are also injected in the CPU when accessing I/O ports, but these are less frequent and important for us.

If there is no external device connected to the ZX that inserts information about channels in the RAM, the BASIC program will start at address 23755. Otherwise, the information of those channels is stored right before the program, displacing it to a higher zone. In any case, the system variable PROG (located at 23635, with 2 bytes) contains the start address in memory of the BASIC program, concretely of its first line.

This means that the BASIC program starts at contended memory. Actually, if it is at 23755, it will have 9013 bytes in that memory before it reaches the higher 32K. It is uncommon that we can lose those 9K in our program (9K is almost 20% of the RAM available or it!), but if we wish to, the only thing to do is to insert a first line in the program that contains a REM statement with a comment of length 9008 (characters), which puts the next line in non-contended memory (if, in addition, we wish that the execution of the rest of lines does not suffer from not being at the beginning, we can change the content of the system variable PROG and make it point to the address of the second line; all of this is detailed in the first post of this series).

Now, how long takes the contended delays in a BASIC program? Well, we can consider these:

The fundamental delays wil be due to the readings of program memory done by the interpreter (get statements to interpret them, searching for lines or variables, etc.), not so much to the accesses of the very program to contended memory: that access time to memory is negligible with respect to the time taken for interpreting and executing the BASIC statements.
Every program is different from any other, but in general the interpreter is always doing the same thing: reading the statement to be executed, interpreting it, and jumping to the next one. Therefore, differences in accesses to the program memory are expected to be small as well, even among different programs, from the point of view of the contention.

Assuming these premises, we can estimate the time caused by contention by executin first a program in contended memory and then the same one displaced to non-contended one, and examining the time differences in their executions. In the second post of this series we did that already, but did not provide the code of the second program. The original one was a loop whose execution time was measured; the loop was doing a simple task of decreasing a byte value stored in memory. Its code is repeated here for your convenience:

10  POKE 16384 , 255 :  POKE 23672 , 0 :  POKE 23673 , 0 :  POKE 23674 , 0
20  POKE 16384 ,  PEEK 16384 – 1 :  IF  PEEK 16384 <> 0 THEN  GOTO 20
30  LET t =  PEEK 23672 + 256 *  PEEK 23673 + 65536 *  PEEK 23674 :  PRINT “time: “ ;  ( t * 0.02 / 255 )  ; ” secs”

The version with the loop displaced to non-contended memory is this:

1  REM The next REMs create 9013 bytes before the main program (line 15), thus placing it at 32768 if this is executed after a hard reset in the ZX, and hence using the non-contended memory for the main loop.1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123
2  REM 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123
3  REM 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123
4  REM 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123
5  REM 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123
6  REM 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123
7  REM 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123
8  REM 0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123
9  REM 01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678
10  POKE 23635 , 0 :  POKE 23636 , 128
15  POKE 65535 , 255 :  POKE 23672 , 0 :  POKE 23673 , 0 :  POKE 23674 , 0
20  POKE 65535 ,  PEEK 65535 – 1 :  IF  PEEK 65535 <> 0 THEN  GOTO 20
30  LET t =  PEEK 23672 + 256 *  PEEK 23673 + 65536 *  PEEK 23674 :  PRINT “time: “ ;  ( t * 0.02 / 255 )  ; ” secs”
40  STOP 

Notice that we do not only put the main loop in non-contended memory, but we also do accesses to non-contended RAM in the loop. Also, instead of just one REM to do the displacement, we use 9 for the sake of clarity in the listing.

After running these programs, the time spent, on average, in one iteration of the original loop is 10.901961 milliseconds, while the one of the non-contended version is 10.823529 milliseconds. The difference is 78.432 microseconds per iteration, or, in other terms, 0.72% of the iteration time in the original program. To be more detailed we can consider that one iteration of the loop consists in : 1) decrementing a byte in memory, 2) checking whether that byte is now 0, and 3) jumping to the beginning of the loop for a new iteration if the byte is not 0. We can simplify the issue by assuming the three statements take the same time, which leads to 3.633987 milliseconds per statement; in that case, the delay per iteration is 2.16% of the time taken by interpreting and executing one statement.

As imprecise as this figure may be, the value can serve as a reference to estimate how long we can expect to be delayed any statement in a BASIC program that is at contended memory. As you can see, that time is pretty short, essentially due to the fact that the time taken by the interpreter to do its internal tasks is much longer than the one employed in reading the BASIC program memory.

Finally, notice that in long programs is more than likely that the variables are in non-contended memory (since they are stored after the BASIC code), thus they will not suffer of contention.