My new emulator

Emulator and emulator development specific topics
Thommy
Posts: 52
Joined: Wed Sep 28, 2011 6:37 pm

Re: My new emulator

Post by Thommy »

Actually, on further consideration I think your argument is more persuasive. I'm going to move to an explicit bus that resolves my ongoing problem with the extent to which my emulated Z80 should be able to self-manage and ensures I put in the suitable high-impedance states as soon as possible.
Thommy
Posts: 52
Joined: Wed Sep 28, 2011 6:37 pm

Re: My new emulator

Post by Thommy »

Update on this: the decisions involved in switching to a completely generic concept of a bus are a lot tougher than I anticipated from a software engineering perspective. It's mostly real-world performance stuff.

Primary concerns are that, defining a message to be any change in what a component is loading onto the bus, you're in the low 10s of millions of messages a second, which means that you need to deal with each propagation in the low 100s of cycles.

Having been through several designs, at the minute I have a tree structure with messages propagating instantly (in emulated time) and being restrained to the relevant subtree where possible. The problem is that the test for deciding whether a message should go to a particular component and of a component processing a message is so trivial that the overhead of running the tree — the stuff of function calls and copying relevant things to and from the call stack — is actually far and away the majority of the processing cost. So I think I'm losing out in real-world terms. I'm aware of smarter runtime ways to process a tree, but obviously I'm trying to come to the optimum algorithmic arrangement before I really go all-out on implementation.

I'm considering whether it is sufficiently accurate for all communications to propagate only with the clock line, which would instantly reduce me to a fixed 6.5m messages/second and suggest simpler runtime processing. Though then you're possibly talking about making timings not just accurate to half a cycle but requiring that all changes take half a cycle, which probably won't do at all.

For the record, with generalised messages and a tree structure, my ZX80 emulation now costs something like 85% of a core to run, which is more than a twofold increase over the old approach and, in my opinion, indicative that I need to find another solution.
sirmorris
Posts: 2811
Joined: Thu May 08, 2008 5:45 pm

Re: My new emulator

Post by sirmorris »

What would the z80 do? It's not bothered about all the stuff that happens between clock edge transitions, is it? If you look at the timing diagrams and quantise all the edges the behaviour is still the same. As long as the RAM can ensure its outputs have settled by the time the z80 latches the data who cares that the data was actually settled 37ns ago??

Just saying.

;)
C
Thommy
Posts: 52
Joined: Wed Sep 28, 2011 6:37 pm

Re: My new emulator

Post by Thommy »

sirmorris wrote:What would the z80 do? It's not bothered about all the stuff that happens between clock edge transitions, is it? If you look at the timing diagrams and quantise all the edges the behaviour is still the same. As long as the RAM can ensure its outputs have settled by the time the z80 latches the data who cares that the data was actually settled 37ns ago??

Just saying.
Components attach themselves to a bus supply (i) a condition that they would like to be satisfied; and (ii) the complete list of lines they may at any point use as output. With the tree structure, the two things together are used to decide who should be seated near who.

Components themselves are signalled when their condition becomes true and when it becomes false. So they get the leading and trailing edge on the evaluation of their condition.

I keep flip flopping on conditions for performance reasons, but the canonical form is 'these lines changed; and these lines have these values'. You can specify no lines for the first or second limb so as to get just half the test.

A bus is just a collection of components. A bus also is a component. So I have one bus that runs most of the system but the ROM component is actually a sub-bus that substitutes the low 9 address bits when relevant conditions are satisfied.

When a component actually changes its output, you need to evaluate who to propagate that to, which is now the main cost of my emulation. The tree structure theoretically speeds that up, since e.g. if the CPU isn't currently signalling refresh or memory request (and that hasn't just changed) then we immediately know not to consider the ROM or the RAM without checking each in turn.

The Z80 is connected as wanting the clock line to be set. So it ends up receiving the leading and trailing edges of the clock signal when its condition toggles between being satisfied and not being satisfied. It obviously doesn't care about the exact timing of events in between clock transitions but the clock signal isn't in any way elevated at present. That's why I'm mulling over elevating it and having messages propagated only upon clock changes.

Part of it is that, you know, I've been writing emulators for more than a decade now but this is the first time that I've decided to elevate the bus to the thing that needs to be correct (within stated bounds, naturally). So I want to get myself onto very solid ground with the whole line of logic.
sirmorris
Posts: 2811
Joined: Thu May 08, 2008 5:45 pm

Re: My new emulator

Post by sirmorris »

Why didn't you just say so before? :lol:

I suppose what I was trying to say was that I was having a hard time working out what would be lost if you were quantising and only propogating messages on clock edge transitions. What kind of messages would appear between the transitions?


C
User avatar
RetroTechie
Posts: 379
Joined: Tue Nov 01, 2011 12:16 am
Location: Hengelo, NL
Contact:

Re: My new emulator

Post by RetroTechie »

Thommy wrote:I'm considering whether it is sufficiently accurate for all communications to propagate only with the clock line, which would instantly reduce me to a fixed 6.5m messages/second and suggest simpler runtime processing.
Basically a machine like the ZX81 is a state machine. It has a certain state, and 'external stimuli' make it progress onto next states. Regardless of the low-level physical details (RC-delays for example), any state changes are initiated by transitions of the master (6.5 MHz) clock. If that clock freezes, machine state also freezes. And many internal state changes (memory read/writes etc) will be timed by lower-frequency transitions derived from that master clock (like CPU clock which is 6.5 MHz / 2).

So for emulation purposes, using that master clock as the sole reference for state changes, should be perfectly adequate. You're not interested in how many nanoseconds a bus transition takes, are you? Or what voltage level a bus line settles to. Or at what exact point in time data is clocked into registers... It's enough to know which data goes where (as result of each clock transition).
Thommy
Posts: 52
Joined: Wed Sep 28, 2011 6:37 pm

Re: My new emulator

Post by Thommy »

sirmorris wrote:Why didn't you just say so before? :lol:

I suppose what I was trying to say was that I was having a hard time working out what would be lost if you were quantising and only propogating messages on clock edge transitions. What kind of messages would appear between the transitions?
RetroTechie wrote:Basically a machine like the ZX81 is a state machine. It has a certain state, and 'external stimuli' make it progress onto next states. Regardless of the low-level physical details (RC-delays for example), any state changes are initiated by transitions of the master (6.5 MHz) clock. If that clock freezes, machine state also freezes. And many internal state changes (memory read/writes etc) will be timed by lower-frequency transitions derived from that master clock (like CPU clock which is 6.5 MHz / 2).

So for emulation purposes, using that master clock as the sole reference for state changes, should be perfectly adequate. You're not interested in how many nanoseconds a bus transition takes, are you? Or what voltage level a bus line settles to. Or at what exact point in time data is clocked into registers... It's enough to know which data goes where (as result of each clock transition).
The main concern is messages that are consequential to a transition and which are between components that then talk up to the CPU. In the clock-tied solution I'm imposing a latency of at least half a cycle for any inter-component communications. So, for example, if I were simulating a paged memory unit and I implemented the paging logic as a component on the bus then the process would be:
  • Message one: CPU requests data
  • Message two: paging unit, having determined which chip should respond, messages that chip appropriately
  • Message three: chip with data responds
I'd probably have to be running a really complicated architecture and have ended up with an awkward subdivision of logic into components for the delays to accumulate sufficiently to be a problem but I'd rather not have to think about it at all.

In practice I already have the concept of a sub-bus, as per the address-bus substitutions in front of the ROM, which effectively vests zero-latency logic into the bus and solves the problem in the absence of circular references. It'd also be helpful to have the bus treat the clock signal as something special because then I could add a timing condition to my component conditions and be able to do components that respond only after a fixed number of half cycles without having to write little counters at the component level time and time again.

It's pretty clear I'm overthinking this, I think. A flat list, probably that distinguishes between components that listen to the clock and those that don't for broad phase messaging decisions, and outgoing messages documented to take a half cycle to propagate with the sub-bus conceit where that's a problem is probably good enough and should get me back into the ~40% range of CPU usage.
User avatar
RetroTechie
Posts: 379
Joined: Tue Nov 01, 2011 12:16 am
Location: Hengelo, NL
Contact:

Re: My new emulator

Post by RetroTechie »

Let me give an example: a binary ripple counter. It consists of a number of bits (flip-flops) representing a binary number. On a clock transition (for example high -> low) the lowest-numbered bit toggles once (1 -> 0 -> 1 -> 0 -> 1 etc). That bit is used as clock input to the next bit, so that next bit will toggle when lower bit makes 1 -> 0 change. And so on for the higher bits. So on each input clock transition, depending on initial counter value, a 'ripple' of changing bits may go through the counter.

Looks complicated, but everything that uses that counter only depends on its value (once that 'ripple' has passed & all bits have stabilized). So for emulation purposes, all you'd need to know is:

clock high->low transition -> counter value = +1

Same for bits inside ULA, memory, Z80 registers etc. In a full system, there may be some ordering dependencies for evaluating state changes (perhaps ZX80 schematic would be a good place to start & understand when/where that's the case). But mostly the logic itself makes sure that only 1 component does something interesting at a time. Or there's multiple independent parts that don't interfere with each other.
Thommy wrote:It's pretty clear I'm overthinking this, I think.
YES. ;)
User avatar
PokeMon
Posts: 2264
Joined: Sat Sep 17, 2011 6:48 pm

Re: My new emulator

Post by PokeMon »

Fine Thommy, sounds good what you want to do. 8-)

So all in the ZX81 or ZX80 is controlled via the clock. Z80 CPU is clocked with 3,25 MHz. These means a full cycle. The CPU always do something on either rising or falling edge of the clock. So if you have about 6,5 Million messages in a second this will be correct and enough. You should begin with the CPU and ROM only and build an instruction fetch cycle first.

This is the best document describing the Z80 CPU with timing diagrams:
http://www.zilog.com/docs/z80/um0080.pdf

You will find how many times before or after a clock edge the data is prepared to be ready or expected to be ready from external component. I think you can discard exact timing information is this describes only minimum and maximum to describe if logic ic's can work together correct or maybe waitstates have to be added for slow devices. But you should maybe make it more easy I think.

If you have a clock master, which generates messages you can send this to all devices connected to the clock. But you will find, that only CPU gets this message. But the CPU generates new messages triggered from clock with a duration (I would take only full "half" clock cycles) on the bus and if a ROM or RAM reacts on this messages and how is controlled via ROM CS, READ, WRITE, M1, MREQ, IOREQ and so on. I think it is not too difficult to do. But you will miss official documentation of what does the ULA do, here you have to refer on documents published in the internet by private users like Wilf Rigter and so on.

I wish you good luck and think it's not to complicate to realize and that code can be run fast enough. I think you can learn a lot about microarchitecture while implementing this feature. In the first time you could breakdown to CPU special work cycles like

* M1 cycle
* memory read
* memory write
* input
* output
* int cycle
* nmi cycle
* halt cycle

bus request and power down/power up is not used in the ZX80/ZX81 context I think.

As these cycles are in general 4 clock cycles long they can replace 8 messages through one message only which make your emulation faster but more in detail.
Thommy
Posts: 52
Joined: Wed Sep 28, 2011 6:37 pm

Re: My new emulator

Post by Thommy »

To the extent that it helps the conversation: the old build was based on a half-cycle accurate bus, with changes published only on the trailing or leading edge of the clock but maintaining only a single bus state and giving the Z80 more responsibility than is realistic in terms of managing the bus. So, for example, RAM would spot when it was supposed to load the bus and then it would load a value as a single discrete action and never think about it again. The Z80 would know, come the end of a read cycle, that nobody was loading and reset the bus itself.

That's the premise on which the builds already release operate. I have a 2010 i3 iMac, which I think is clocked somewhere around the 3Ghz mark, and that approach cost about 35% of a single core to emulate a ZX80 at full speed. So not enough to be worrisome.

In order to implement that I've collected and examined appropriate datasheets. The Z80 is already signalling entirely correctly, other than where I've made mistakes in the internal scheduling of tasks. I specifically think I may have some of the indexed read/write/modify stuff off, but it's all new code and I'll fix it.

The name 'Clock Signal' does actually come primarily from the original decision that timing and sequencing is accurate only to the nearest half-a-cycle.

The purpose of the changes since then, in research of which I've been ripping large shreds of the source apart, has been to vest logic more accurately. So RAM loads are considered to be a continuous event that the RAM module ensures happens while the correct conditions are met. Which in practice means little more than spotting when the condition goes false and adding a bus actor to do explicit aggregations and to keep and manage local bus states for each component, at least as far as is necessary. The new convention that components declare which lines they intend to use as output actually allows the bus to group components to an extent to reduce the aggregation costs, but that's an implementation specific.

I've decided first to see how things would proceed if there are no special cases. The clock signal is no more special than any other bus line. After a few variations in design I've come to my current tree structure. The processing costs for the actual components are essentially the same as before (well, there's a negligible increase because, as per the example, RAM has to do something to stop loading the bus before rather than leaving that step implicit as before). However the bus propagation costs have grown massively, to be easily the majority of the processing cost. I'm now spending 90%+ on running a ZX80 and I'm not even persuaded it's running at full speed.

In that context I'm obviously keen to find a way to make savings. Tying message propagations back to the clock is probably the correct thing to do, and I'm dithering on the tree structure for propagation. Trees are a broad phase test so they always cost more in the worst case — which gives you a spectrum of expected change in processing costs from less expensive to more expensive depending on your usage patterns. Even if I fix the actual implementation costs of running the tree (and there are a bunch of options in that regard), it still may not be the correct thing to do. Or, more likely, the pragmatic two branch tree with one branch being everything that watches the clock cycle and the other branch being everything that doesn't is likely to be a smart way forward, especially as the data structure is so trivial.

On the plus side, having now written the components properly and having an abstract interface for adding them to a bus all these decisions are concentrated in the bus code. So the time it took to get here isn't representative of what I'll go through if I flip flop on guarantees for message delivery again in the future.

On implementation stuff in general, I feel a bit like I'm wading through treacle because I've decided to do it in vanilla C, after several years working in higher level languages. That's almost certainly also contributing to my tendency to over think (and, hence, to write long interminable posts like this), for which I apologise.
Post Reply