Guestpost: a proposal of a UD like algorithm by Bauke Conijn

The following is the first guest post, answering the invitation made in  Call for analysis of the new UD video .  The author is Bauke Conijn (aka bcmpinc). Reposted on his blog here.

______________

I designed an algorithm that, given an octree and a location within that octree can compute the cubemap at that location. The algorithm kind of raytraces each face of the cubemap, but does this in such a way that  multiple rays are traced simultaneously (aka. mass connected processing).

The faces of the cubemap are axis-aligned, which allows us to use linear interpolation rather than perspective correct interpolation (which includes 2 multiplications and a division). Because of this linear
interpolation, the algorithm only needs integer additions, subtractions,  and multiplications by a power of two (aka. bitshifts).

For a more thorough explanation of the algorithm please read this article.

I’ve also written a partial implementation of the algorithm, available at Git-hub, which renders the positive z plane of the cubemap. It currently does so at 2 fps, which, considering I did not yet put effort in optimization and screen space occlusion is not yet implemented, seems  promising. The data is still fully loaded into memory at startup. As background I render a fixed cubemap.

Due to space considerations, not all models have been uploaded to  git-hub, but a small low-resolution model is available, such that after compiling the code, you can at least see something. The code is rather  messy as it is still in an experimental state.

Images:
http://imageshack.us/a/img839/7291/voxel2.png (test-scene)
http://imageshack.us/a/img841/1796/voxel3.png (test-scene)
http://imageshack.us/a/img708/221/voxel4.png (part of Mouna Loa)
http://imageshack.us/a/img24/8798/voxel5.png (part of Mouna Loa)
http://imageshack.us/a/img69/2483/voxel6.png (low resolution)

_______________

UPDATE: Here is a movie made by Bauke, with his program.

Moreover, here is a draft article which explains it in detail.
Advertisements

114 thoughts on “Guestpost: a proposal of a UD like algorithm by Bauke Conijn”

  1. I implemented the screen space occlusion [available here]. It seems to get the FPS up to about 8, but they way I implemented it does produce some rendering artifacts (pixels that don’t get rendered).

    1. I edited your comment, the […] part. I don’t see reactions, why?

      [Added later:] Are you in the stage when you can produce a youtube movie which is at least slightly alike UD, along with giving the github link?

  2. The vxl files are rather large (they use a plain text format), which is why I rather don’t put them on github. I’ve put two on https://www.box.com/s/fbluf7mrvqullmr0tfdv, which are converted textures + heightmap. Furthermore, the heightmap and convert programs can be used to generate more vxl files, but they indeed lack documentation.

    Only the positive z plane of the cubemap is being rendered. Adding this should not reduce the framerate (8 FPS) and is easy, but cumbersome (needs 32 nearly similar copies of the same code). I want to convert my rendering/query function to a template function before rendering the remaining 5 planes. Only when that is done, I can make a youtube movie.

    1. thank’s Bauke:
      I more interested in the data format, and you aproximation is for render, you octree is basic and expensive (in memory) implementation.
      I understand you code now, you quadtree in a side of cube simplify the octree traverse, but then you need draw every side.
      Sorry, i’m not help you too much, I’m not code any render octree yet.
      anyway, nice see you try.

      1. You did use compiler optimization? I profiled my program with gprof and got that 90% of the time is spent in the traverse function.

  3. I updated my code. It now has a single template function that is called and renders all the faces of the cubemap. A few minor bugs have been fixed. I did some optimizations and it is now running at approximately 6 FPS (screen: 1024×768, cubemap: 6x1024x1024).

    Its now at the point where I can make a movie, though I’m still lacking a proper scene. With my current implementation a proper scene won’t fit in memory, so I need to start designing and implementing the on-disk database.

  4. I’ve implemented the database part. It first sorts the points in Hilbert curve order (this is the slowest step). Then it computes the number of nodes per layer in the octree and allocates a file that is sufficiently large. For each layer, a segment is created, which stores the nodes in that layer in Hilbert curve order. Now my renderer starts up instantly and FPS has improved to 10 (I assume due to better cache usage because of the Hilbert curve sorting).

    Conversion of 150M points took 16 minutes for sorting and 1 minute for the remainder of the conversion. The resulting file is about 2.5G in size.

    A movie will probably take a while longer as I do not know how to make a screen capture movie on Linux. I can generate the movie straight from the rendered images, but that also takes some work and I’d like to do some further speed optimizations.

  5. This is looking very promising, fantastic effort!
    Would things get faster if you render the cubemap in the same way as Euclideon? i.e. fat camera facing pixels, rather than cubes? The pixels can change in size depending on distance from camera, and gap size to their immediate neighbours.

    1. Only if you zoom in a lot. And it would be more difficult than rendering the cubes. The cubes are a direct result of how the algorithm works. In fact the renderer thinks that a cube is an octree node whose children are again that cube.

      I could prune these traversals, but that would only increase performance if you’re zoomed in too much and additionally degrade the image quality in that case.

      I suspect that my best bet for improving speed is reducing the amount of branching (or to start using multi threading).

    1. Beautiful link Dave.
      In fact, since the GPU is massively parallel, what if we just renounce at the “purity” of Dell’s no GPU, instead just using the GPU, as Bauke, when it’s better?

      1. Yeah, it’s a great visual explanation of Morton coding.
        I’d love to do a GPU version of this idea, that is, when I finally visualize what Bauke is actually doing, which I have to admit I’m finding some of it a little hard to grasp. 🙂
        The trouble with GPU code is that recursion is not possible, so an iterative approach would be needed.
        Bauke’s already knocking at Atomontage’s door though! 🙂

      2. chorasimilarity :
        I’d say it’s better already, if you compare apples with apples.

        Heh! Well they’ve just posted massive frame-rate claims, but hang on, they’ve been at it for years now!

    2. I haven’t tried using Morton code for sorting the points. However, I do convert the coordinates to Morton code before I transform them into Hilbert code. I believe that Hilbert code has slightly better properties. Furthermore, I think I can get it as fast as Morton (using some kind of k-way bucketsort for sorting in Hilbert order per 1 or 2 layers).

      However, I don’t think its worth the effort as I don’t have any plans of converting other scenes (yet).

      I’m also considering compression by common subtree elimination (the octree variant of http://en.wikipedia.org/wiki/Common_subexpression_elimination) combined with wavelet transform (a fancy way of saying that I only store the color difference with the parent node). That would result in a directed acyclic graph, in which sorting is no longer possible.

      For the quadtrees, used for occlusion testing, I do use Morton Code, mostly because it is better for the cache, but also because traversal is easier. (Funny thing is that I got to this by generalizing how a heap stores its tree to 4 children.)

      1. Bauke, please consider to write a full, detailed explanation of your algorithm, it might be as useful as the program. I’d do that, if I were you, and put it on arxiv.

  6. Hello Bauke, I’m finding some errors in the code. I’m trying to compile it for Windows, and I’ve got SDL, SDL_Image, a version of ‘unistd.h’ and ‘getopt.h’ and the OpenGL Glee files. But I’m getting internal errors for missing ‘get_face’ and ‘build()’ (without parameters) in the cubemap class. Is it out of date, or am I missing something?
    Thanks.

    1. What is the SHA1 of the checkout you’re building? If unistd.h and getopt.h are required, you probably also need mmap, which isn’t available on windows.

      I can start using boost for filesystem operations and mmap. That might make the code more portable, though I find that boost is a too large dependency.

      1. Thanks preda, that showed me that ‘octree.cpp’ was not being used, that was the file calling ‘get_face’, which was the problem I originally had with it as stated.
        It’s a shame that all these libraries and c++11 stuff are being used (sometimes I miss the old days). having said that, I do love the GLM maths library! 🙂

  7. I was already wondering where the getopt .h include was coming from. I don’t use it.
    If you want, you can create a fork of my git-hub project. That is probably the best way to share the code. Otherwise you can use bcmpinc at users dot sourceforge dot com, though mention that you send something, because I don’t check it often.

    I was not able to extract the 7z file I got from you (chorasimi sent it to me). I’m not sure what went wrong.

    1. Ok, I managed to extract it using the command line. Runs fine (using wine). Checked the differences, merging should be easy. Though I will need to figure out how to define those constants in a templates, or quit using it as a template. As you noticed, I’m using c++11, So I need to convert a few things to use the pre-c++11 tricks. Though I cannot keep the project.dev file up to date, so you will have to use dev-c++ with a custom makefile.

      Thanks btw for the windows mmap implementation.

      Oh, and I’m still wondering where the getopt.h requirement comes from.

  8. The latest version now no longer uses c++11 and compiles fine without optimization (i.e. no undefined reference to quadtree::SIZE errors). Strangely, I did not notice any change in speed with compiler optimizations disabled.

  9. I’m having a nice time with Visual C++ here, I really needed to know which files belong to the main project or not, having 6 ‘mains’ didn’t help! 😦

    1. It is in the makefile: “$(eval $(call target,voxel,main events art timing pointset octree_file octree_draw quadtree))”.
      So for target ‘voxel’, you need main.cpp, events.cpp, art.cpp, timing.cpp, pointset.cpp, octree_file.cpp, octree_draw.cpp and quadtree.cpp.

  10. Branching and data cache issues are certainly tricky! You’re data is already in a good order though, isn’t it?

  11. For caching issues it is. For branching it probably isn’t and can’t be. There I probably have to modify my code. I do suspect that branching is currently the bottleneck.

  12. Sorry for being a bit thick, but can you explain how the perspective view is created?

    I think I’m approaching this as something it isn’t, so I’m going to need a bit more of a nudge on how this works, if that’s OK bcmpinc?

    1. I’m with Dave on this! That’s why I say you should try to explain each part in detail, there are advantages to work in a “academic” style, like a latex draft which eventually will reach a stable version after is “attacked” by anybody. You have a draft already, right? (However, it does not have to be unreadable, after all we are not killing any tree, so there’s no pressure to make it either short or obscure technobabble).

      1. Thanks, I’ve read it several times, but it seems like there’s something missing. i.e. I still can’t imagine how you’re getting the perspective positions? I can see that you’re going down a quadtree on the cube’s side with an octree of data, but I can’t figure out how you’re getting the projection correct. Please help! 🙂

      2. That’s not written yet, apparently.

        What do you think about the issue of “color tunneling”, mentioned in section 1 “Database”?
        It looks to me that saj with his mysterious way of constructing the database may have a solution for this.

      3. Which perspective? There are multiple perspectives, one for the camera and one for each side of the cubemap.

        The camera perspective is handled by OpenGL, which I use to render the cubemap. Note that the cubemap can be computed independently of this perspective.

        The perspective correctness of the sides of the cubemap is harder to explain. It is kind of a direct result of how the octree and quadtree (of the cubemap face) are traversed. The important thing here is that the cubemap and octree are axis aligned, which means that I don’t have to do perspective correction.

        The two-stage approach, 1. render octree to cubemap faces, 2. render cubemap to screen, is what allowed me to bypass the perspective correction problem. It is the key discovery that led to my method.

      4. I say it again, with Bauke’s words, but more polite, with [ … ] for my comments:

        – the cubemap can be computed independently of perspective [define your meaning of cubemap, it’s a collection of quadtrees?]

        -The camera perspective is handled by OpenGL, used to render the cubemap [so, at this step there is no more recursion, therefore you can use the advantage of the fast GPU?]

        – the key discovery is the two stage approach, 1. render octree to cubemap faces, 2. render cubemap to screen, which obviates the need to do perspective correction.

        – this is possible because of the perspective correctness of the sides of the cubemap, which is harder to explain, but it has to do with the way the cubemap [collection of quadtrees?] and the octree [database of space in absolute coordinates] are axis aligned [probably if you insist into clarifying this then the zest of your method will become clear]

      5. A cubemap is a set of 6 square images, which are the cubemap’s faces.

        The quadtree is only used during the rendering of a cubemap face for the purpose of frustum and occlusion culling.

        A cubemap is a special type of texture. Rendering a quad that uses this texture is a task that GPUs have been designed for (and indeed does not require recursion). The gain: a reduction from 15 to 3ms for rendering the cubemap to the screen; as well as some anti-aliasing/bilinear interpolation.

        Perspective correction is explained here: http://en.wikipedia.org/wiki/Texture_mapping#Perspective_correctness.

        To project a 3D point onto a 2D plane, one has to do a division. (Afaik, this cannot be circumvented. My algorithm does this implicitly using a base-2 long division.) If one wants to compute the midpoint of two points, one can use the 3D points and project again (requires division) or use perspective correct interpolation (also requires division). However, if the if the 3D points have the same distance to the viewing-plane, then their midpoint coincides with the midpoint of the projected points. Hence in that case the midpoint can be calculated by computing the midpoint of the projected points, which only involves additions and bitshifts.

        My algorithm computes a lot of midpoints, but because the cubemap and octree are aligned, the points of which the midpoint needs to be computed have the same distance to the viewing plane, which allows us to use the division free method.

        It is difficult to explain this without images. Though drawing some images should help.

        Draw a point O (the camera), draw two lines, a and b, trough O (these form the viewing wedge/pyramid/cone). now draw two line segments, c and d, that each connect a and b (c is your viewing plane and d is the object you are rendering). Now draw a line trough O and the middle of d. Do you notice that this line goes through the middle of c if and only if c and d are parallel?

        Aligning the cubemap and octree causes c and d to always be aligned (during the computation of the cubemap’s faces).

  13. Thank-you for more explanation. Say the Z positive is the cube side we’re interested in, this displays a 90 degree FOV in that Z direction into the octree. But the very fact that it’s a FOV means you draw the octree cubes at different angles, sizes and distances to the camera, doesn’t it?

    I noticed you traverse the same code for the octree and quadtree – are the octree and quadtree directly connected structurally somehow?

    1. I don’t draw cubes, I draw points. The reason that you see cubes in the movie is because my algorithm considers a leaf node in the model (octree) to be a node whose children are again that leaf node. The algorithm traverses this infinite ‘tree’ until the cube has the size of a single pixel and then it is rendered. (Implementation note: if you’re inside a cube this would give infinite recursion, which I solve by not traversing the nearest node.)

      Once you pick a camera location, each node in the quadtree defines a view pyramid, which intersects nodes in the octree. We traverse these simultaneously to determine the nearest node in the octree for each leaf in the quadtree. The traversal is such that, when the octree node and quadtree pyramid are projected on the cubemap face, they are approximately of equal size.

      1. A quadtree defines a pyramid! – How, it’s only 2D?
        OK, do you mean the quadtree the covers a cube’s side, that starts as large pixels that are close to the camera, then as it descends the tree the child nodes are further and further away from the camera, and it’s a frustum ‘shape’?

      2. I’ve seen that paper before and had a hard time reading it. I wondered whether it could be the same thing as what I’m doing, until I realized that they do orthogonal rendering.

        If you take the camera position as the top of the pyramid and the square of a node in the octree as its intersection, then you get the pyramid I’m talking about. (I use pyramid to be a frustum without near-plane.)

        The quadtree is for occlusion only. The leaf nodes represent pixels and store whether that pixel still needs to be rendered. Octree nodes are traversed in front to back order, so when the first octree node for that pixel is found we mark it as rendered such that later on other octree nodes do not overwrite our pixel.

        Parent nodes in the quadtree encode whether their subtree contain any pixels that still need to be rendered.

  14. Sorry, I don’t want to be a pedantic ass, but to take the time to well explain really helps everybody. Nobody’s hurried here, everything is open, we are all competitive people who choose to share freely our time and thoughts.
    It’s probably an after-effect of the party of comments, but let me tell you what would make me happy: that Jim (or any other guy) comes with another solution, as fast or even faster.
    To close with this indulging comment, for the first time your method Bauke has my full attention due to your last comment, because here we are not looking for minute improvements, but for really new ideas. Looks like the “cubemap” stuff is such an idea, only if you really take the time to explain it well.
    Again excuse me if I ruffled any feathers with this comment.

    1. Oh, your comment is fine. I am planning to write a paper on my renderer, however, I’m a bit reluctant due to the amount of work I expect it to be. I also want it to reflect the current state of my implementation, though there is a design flaw I would still like to correct: coordinates are relative to the octree node instead of relative to the camera. The later allows easier depth pruning (the Menger sponge is currently giving occasional stack overflows) and provides depth information. Furthermore the discussions here tell me which parts are hard to understand and allows me to try different explanations.

  15. Ok so is this correct? If I’m looking directly at a single wall stretched out left and right, the top and bottom of that wall will be straight and aligned to the screen’s X axis, no matter how far away the wall is. If I render that wall to a square polygon (ie the side of cube), I can rotate that instead, to get the illusion of looking around? And at any one frame I only have to draw parts of three cube sides, because of the viewing angle.

  16. Hm, because the wall is aligned, You can draw it to a side of the cube by simply scaling the wall. If the wall was not aligned, you can also draw it to a side of the cube, but that would be more difficult as you have to take the perspective into account.

    When rotating and rendering the cube you will get the illusion of looking around, regardless of the orientation of the wall. Having the wall aligned, just makes it simpler.

    Also, if you look at the middle of a side of the cube, you can potentially see 4 cube sides. If you look at the center of a face and rotate by 45 degrees, or if you have a large FOV even 5 is possible.

    1. Dave, that’s the power of cubemaps, that’s why I repeatedly said that using cubemaps disconnects rotations from translations of the camera. It’s JX idea, look at this post which contains also links to JX explanations and speculations.
      Bauke, in his way, did it better and went further than JX.

      1. JX’s idea was to store the current view directly into a cube-map and save them directly, Bauke is not doing anything like the same thing, the cube-map’s here are only used to render the final image.

      2. Dave, quote from the link, by JX:
        “1) In short: I render cubemaps but not of pixels – it is cubemaps of 3d points visible from some center.
        4) Yes, I too started this project with this idea: “indexing is the key”. You say to the database: “camera position is XYZ, give me the points”. And there’s files in database with separated points, database just picks up few files and gives them to you. It just can’t be slow. It only may be very heavy-weight (impossible to store such many “panoramas”) .

        5) I found that instead of keeping _screen pixels_ (like for panoramas) for each millimeter of camera position it is possible to keep actual _point coordinates_ (like single laser scanner frame) and project them again and again while camera moves and fill holes with other points and camera step between those files may be far bigger than millimeters (like for stereo-pairs to see volumetric image you only need two distant “snapshots”).

        6) By “points linked with each other” I meant bunch of points linked to the some central points. (by points I mean _visible_ points from central point).

        7) What is central point? Consider this as laser scanner frame. Scanner is static and catches points around itself. Point density near scanner is high and vice versa.”

        So, the method resembles with Bauke’s, in it’s key point, but is not as evolved.

  17. JX’s cube-maps are pre-rendered and stored as cube-maps. That’s why he gave up, because of the huge amounts of data needed to store them all, plus all the artifacts in rendering. The only similarity here is the use of cube-mapping somewhere, but it’s really very different.

    1. Yes, you are right. But he says about the possibility to store not the cubemap, literary, but only some useful part of it (achieved by Bauke’s cubemaps, which are exactly this, with the advantage that he does not have to store them, but he may quickly build them, if I understood well). Moreover, the main advantage of using cubemaps, in any form, is exactly this: to split rotations from translations of the camera,so that first you construct a cubemap, along some fixed orientation, then you “look at the cubemap” and not at the world, from any angle you wish. That’s clever, the day I learned that I imagined walking around with my head in a box made by computer screens :).
      Or maybe I don’t understand something and Bauke will correct this.

      1. Just spend sometime on Google Earth street view to see what a head-box-of-screens looks like! 🙂

  18. bcmpinc :
    Also, if you look at the middle of a side of the cube, you can potentially see 4 cube sides. If you look at the center of a face and rotate by 45 degrees, or if you have a large FOV even 5 is possible.

    If the cube is drawn with 90 degree FOV and the viewed at less than 90 degrees, say 65 which is reasonable, then there will be no more than a maximum of 3 faces seen at once.
    How are you rendering the cubemap? If you draw a quad covering the screen then you can access it using the following in the fragment shader:-

    vec3 dir = vec3(2.0*texelST-1.0, tanFOV); // Find the screen pixel as -1 to +1
    // dir is the ray going through the screen so some things need to be done to it…
    dir.x *= aspectRatio; // sort out aspect ratio
    dir *= cameraMatrix; // rotate it by camera 3×3 matrix
    vec3 colour = textureCube(cubeMap, dir).xyz;
    gl_FragData[0] = vec4(colour, 1.0);

    Done! 🙂

    1. I use the fixed pipeline of OpenGL 2 or something, but it effectively does the same. Oh, and (each face of) the cube is always drawn with 90 degree FOV.

      1. I’m sure that’s fine though, I’m still a little confused as to how you can see more than 3 sides of the cube, if your viewing FOV is less or equal to 90, and the cube resolution is the largest screen dimension square. It may save you some CPU cycles, but I’m just trying to help out, in a very minor way of course. 🙂

      2. Call the faces of the cubemap: front-back, left-right and up-down. Look in front of you, then rotate 45deg wrt to the front-back axis. Your screen will contain a part of the front face, but each of the 4 corners of the screen will contain parts of the up, down, left, right faces.

      3. FOV is usually defined in the height direction of the screen. The FOV in the width direction is larger and the FOV in the diagonal direction is largest. Only if the diagonal FOV<=90 degrees, then it is limited to 3 faces.

        From a speed standpoint, this does not matter much, if at all. The time required to transfer the images to the GPU is negligible (about 2ms per face, and practically 0ms when buffers are used). Furthermore, I prepare the quadtree using frustum culling, such that only the points within view are marked as requiring rendering. This takes on average 0.1ms per face.

        The amount of CPU cycles is currently not the problem, I recently reduced it by about 25% without any change in FPS. IO bandwith isn't the problem either, as the Menger sponge (a 600 byte dataset) also requires 50ms per frame. So I suspect branching and branch prediction. In the best case, this can give a 6x speedup.

    1. Let’s take it easy, I have a million ideas, I need someone to damp this behaviour, because otherwise most of them are lost. Let’s see what goes out from this joust between Bauke and Jim (that’s how I see that, maybe fantasizing). If it goes well then we pass to the next thing, in the open and polite way of this community of gentlemans.

      1. That’s not fair, I’ve only been asking what I thought were relevant, salient questions about this Guestpost’s subject matter.

  19. chorasimilarity :
    Call the faces of the cubemap: front-back, left-right and up-down. Look in front of you, then rotate 45deg wrt to the front-back axis. Your screen will contain a part of the front face, but each of the 4 corners of the screen will contain parts of the up, down, left, right faces.

    That only happens when you are viewing the cubemap at 90 degree FOV, like the following picture of a cubemap, made with six different tinted sides:-
    http://imageshack.us/photo/my-images/62/hx7u.jpg/

    But you view it at 70 degrees:-
    http://imageshack.us/photo/my-images/822/qjm0.jpg/

    The zooming in artifacts are not really seen because of the graphic card’s filtering, which is nice.
    You could also render the cube-map bigger to compensate, but it may not be necessary. Or simply stop the camera from rolling too far, like the island demo which has no camera roll at all.

  20. bcmpinc :
    FOV is usually defined in the height direction of the screen. The FOV in the width direction is larger and the FOV in the diagonal direction is largest. Only if the diagonal FOV<=90 degrees, then it is limited to 3 faces.
    From a speed standpoint, this does not matter much, if at all. The time required to transfer the images to the GPU is negligible (about 2ms per face, and practically 0ms when buffers are used). Furthermore, I prepare the quadtree using frustum culling, such that only the points within view are marked as requiring rendering. This takes on average 0.1ms per face.
    The amount of CPU cycles is currently not the problem, I recently reduced it by about 25% without any change in FPS. IO bandwith isn’t the problem either, as the Menger sponge (a 600 byte dataset) also requires 50ms per frame. So I suspect branching and branch prediction. In the best case, this can give a 6x speedup.

    Oh that’s interesting. Scary though – can the octree be replaced with something stackless?

  21. I still don’t understand how the perspective is calculated with respect to the quadtree and display frustum. Can someone please explain it somehow using 2D drawing if possible? Or basically without presuming we know about all the steps missing from the description.

    I understand the octree descends in view axis order, it’s beautifully simple. I’m just stuck on the scaling of each pixel. Is it faked for each distance, and when are those scales calculated?

    Thanks, still confused,
    Dave.

    1. My algorithm consists of three steps:
      1. Preparing the quadtree. (This process is very similar to frustum culling, I’m basically culling all the leaf nodes of the quadtree that are not in the frustum.) This step can be omitted at the cost of higher CPU usage.
      2. Computing the cubemap. (Or actually only the visible part, if you did step 1.) It uses the quadtree to do occlusion culling. The quadtree is basically an improved z-buffer, though since the octree is descended in front to back order, it is sufficient to only store ‘rendered’ or ‘not rendered’. It furthermore allows to do constant time depth checks for larger regions.
      3. Rendering the cubemap. This is just the common cubemap rendering method. I do nothing new here.

      My description only explains step 2, as this is the core of my algorithm. Step 1 is an optimization and step 3 is so common that I expect the reader to already know how this step works.

      Step 2 does not use the display frustum at all. It does do the perspective. But does so by computing the nearest octree leaf (for the sake of simplicity I’m ignoring the LOD/mipmap/anti-aliassing here) in the intersection of a cone and the octree model. This is shown in the following 2D images:

      I don’t know what you mean with ‘scaling of each pixel’, but I believe that scaling of pixels only happens in step 3. In step 1 and 2 I completely ignore that this happens.

      I hope this answers your questions. If not, please tell which of the 3 steps you do not understand.

      1. Thank-you for making it clearer. I was a bit vague with my question, sorry. What I meant by scaling is what you call ‘do the perspective.’ I see that as including a divide by Z, but you say not. Can you explain how the perspective is calculated, as this is my only stumbling block. Thanks again.

      2. I know that perspective is not possible without division. Somewhere hidden in my algorithm is a base-2 long division.

        The quadtree node R encodes the partial result of the long division, x1 and x2 encode the remainder.

        The division is done sing a trial and error method for multiple octree leaves simultaneously. At each step I check if the computation can still result in an answer I did not get before (occlusion culling) and check if the division can still lead to a valid result. Then I check if I still need to continue the division, if not, I’m done and a pixel can be rendered. If the three previous checks succeed, then I either recurse over all possible next partial results (the quadtree nodes), or over the 8 children of the current octree node, depending on which seems to be most useful.

      3. Bauke :
        I know that perspective is not possible without division. Somewhere hidden in my algorithm is a base-2 long division.
        The quadtree node R encodes the partial result of the long division, x1 and x2 encode the remainder.
        The division is done sing a trial and error method for multiple octree leaves simultaneously. At each step I check if the computation can still result in an answer I did not get before (occlusion culling) and check if the division can still lead to a valid result. Then I check if I still need to continue the division, if not, I’m done and a pixel can be rendered. If the three previous checks succeed, then I either recurse over all possible next partial results (the quadtree nodes), or over the 8 children of the current octree node, depending on which seems to be most useful.

        OK that’s cool. Have you tried not calculating the perspective at some small size like 16 pixels, which I believe the Quake engine does, as well as Meagher’s algorithm I believe. This is where the perspective distortion is not generally seen at a certain level. Perhaps analogous to using linear interpolation of the parent screen perspective calculations, I suppose. Or is that impossible with this method?

  22. OK, after taking a break from it, I’ve got a better idea of how this works now, possibly. 😉
    You’re culling the octree data outside each quad’s frustum until you either hit the end of the octree or you get to a pixel size, at which point you simply render the current octree’s colour.
    Which is beautiful, as there are no projections, just rejections of everything that can’t be in that pixel! Is that correct?
    Here’s another take on it:
    http://dip.sun.ac.za/~henri/quadoctcull.html

      1. That’s a great link, and I think I understand the octree order traversal part, it was just the perspective transform that totally baffled me.
        I may not be reading your code correctly, but It seems possible to do a single compare for each plane by using a plane to sphere test instead of octree corners, which appears to use 16 dot products in total?

      2. I guess you’re referring to these lines:

        if (x+d-(1-DX)*(xp+dp)<=-ONE || ONE<=x-(1+DX)*xp) return false;
        if (y+d-(1-DY)*(yp+dp)<=-ONE || ONE<=y-(1+DY)*yp) return false;

        These are the 4 checks, one for each plane. The numbers DX and DY are template parameters, and are optimized away. The multiplication is also optimized away. So the checks only consist of additions, subtractions and bit shifts. I don't think that sphere checking would be any faster.

        The code in SubFaceRenderer::traverse includes multiplications, but they are all such that the compiler will optimize them away.

  23. bcmpinc :
    I guess you’re referring to these lines:
    if (x+d-(1-DX)*(xp+dp)<=-ONE || ONE<=x-(1+DX)*xp) return false;
    if (y+d-(1-DY)*(yp+dp)<=-ONE || ONE<=y-(1+DY)*yp) return false;
    These are the 4 checks, one for each plane. The numbers DX and DY are template parameters, and are optimized away. The multiplication is also optimized away. So the checks only consist of additions, subtractions and bit shifts. I don’t think that sphere checking would be any faster.
    The code in SubFaceRenderer::traverse includes multiplications, but they are all such that the compiler will optimize them away.

    Hmm, I guess I don’t understand as much as I thought. 😦 What do these lines do?

    // Check if entirely outside of frustum.
    for (int j=0; j<4; j++)
    {
    if (
    glm::dot(glm::dvec3(x1,y1,SIZE), normals[j])<0 &&
    glm::dot(glm::dvec3(x2,y1,SIZE), normals[j])<0 &&
    glm::dot(glm::dvec3(x1,y2,SIZE), normals[j])<0 &&
    glm::dot(glm::dvec3(x2,y2,SIZE), normals[j])<0 )
    {
    map[i]=0;
    return;
    }
    }
    "

    1. They’re for step 1 of my algorithm. It determines whether a quadtree node is fully outside of the viewing frustum. There are also a few lines that check if the quadtree node is fully inside the frustum. Together, they determine which leaf nodes of the quadtree must be initialized as ‘needing to be rendered’ and which are set to ‘do not need rendering’.

      This step takes about 0.5ms of the 50-100ms processing required per frame, so there is no need to further optimize it.

      1. Right, OK thanks. Sorry for the ongoing questions. : ) So it just checks to see which parts of the cube-map need to be updated, depending on the camera angle.
        Can you explain a little how this checks an octree and a plane? It seems very abstract to me, especially as it appears to only use 2D:-
        if (x+d-(1-DX)*(xp+dp)<=-ONE || ONE<=x-(1+DX)*xp) return false;
        if (y+d-(1-DY)*(yp+dp)<=-ONE || ONE<=y-(1+DY)*yp) return false;

        Thanks, again for your ongoing explanations.

      2. The formulas are so much simplified that you indeed can consider it a 2D check as well.

        There is the octree node: (ONE,ONE,ONE)-(-ONE,-ONE,-ONE),
        and the frustum: (x,y,ONE)-(x+d,y+d,ONE) for the far plane and (x-xp,y-yp,-ONE)-(x-xp+d-dp,y-yp+d-dp,-ONE) for the near plane.

        The check ignores the near and far planes of the frustum, as these coincide with the front and back face of the octree node.

        The remaining 4 planes are used to check whether the octree node and frustum intersect. This checks are done once within the near plane and once within the far plane, and thus are 2D.

        This gives 8 checks, of which 4 can be dropped, because the DX and DY values tell something about the shape of the frustum. The DX value, which is either 1 or -1, tells whether xp >= 0 or xp+dp <= 0, and DY does the same for yp.

  24. I managed to increase the FPS by 30%. It is now rendering at approximately 20 frames per second.

    The occlusion checks:
    if (x+d-(1-DX)*(xp+dp)<=-ONE || ONE<=x-(1+DX)*xp) return false;
    if (y+d-(1-DY)*(yp+dp)<=-ONE || ONE<=y-(1+DY)*yp) return false;
    have been moved to before the recursive call.

    1. Sounds sweet.
      Do you use pixel buffer objects for image uploads?
      In “octree_draw.cpp” I see that with prepare_cubemap() you send the cubemap faces down to the gpu with glTexImage2D. Pixel buffer objects may be used efficiently here for asyncronous data transfer. I mean, while uploading to opengl other stuff can happen in the CPU meanwhile, like HDD streaming for the next frame.

      Any hope for a windows build btw?

      1. Using pixel buffer objects gives about a 5 msec gain. This gives a 10% speedup, which is useful, but then there is still the query step that takes 45 msec. on average. Unless I can get that 45msec down to 30msec, I do not really feel the need to use pixel buffer objects.

        I’m too lazy to port the code to windows (though others have done so already). A windows build would only be partially useful as I still require the users to generate their datasets themselves, which requires changing and recompiling the code.

        Some tiny datasets are included though.

  25. as in, what is the checklist to be considered unlimited detail.. you’re using points in the last video, the euclideon island ran in only 15-25 fps at a resolution of 1900×600 having rendered twice(reflection) you cannot do this in your system? Does yours have limitations on tree-depth? What is the agreed “Checklist” or necessities to be considered UD?

    1. One small note, Euclideon did not do reflection in the island demo. They only had copied the island upside down in the dataset. So that requires only one rendering per frame.

      Currently my algorithm is running at 20 fps on average, with a resolution of 1024×768. However, for some scenes it can drop to 10 fps and occasionally, frames take >150ms. At full HD, I get ~4 fps.

      There is no limit on tree-depth. (current limit is hard coded to 20. There is however no reason why this cannot be increased to 60, which would require 64-bit integers for the current location. Higher tree depths, would require a higher precision of position or would be pointless otherwise. )

      There is no checklist. However, to definitely call this “unlimited detail” I want:
      – Rendering at full HD resolution: 1920×1080;
      – Rendering reliably at 30 FPS.

      However, this is probably the only open-source implementation that comes close enough to be considered UD. (Feel free to prove me wrong 😛 )

      1. Yes Bauke, yours is the only open source 🙂 I think that is different from Dell, though, which in fact is great.
        Maybe b@b has another one?
        I believe that Dave H may have something, but it’s a perfectionist.

  26. Recently I realized that maybe I could replace the occlusion quadtree by a 2D Fenwick tree (aka binary indexed tree), thereby eliminating the quadtree recursion. However, Fenwick tree queries and updates take logarithmic time. So it is really up to which has the larger constant factor.

  27. If you cannot read from a hard-drive directly then you can be sure the traversal is incorrect. I’ve so far got low level file reading operations to only get to 100k per 250ms;

    a budget much less than 100k is preferred to avoid 4 frames per second. your traversal uses rays, which is something dell stated he did not use, and made very clear. Performance wise, it is nice you’re getting 20fps at 1024×768.. but with rays, youre always going to be capped on the CPU without something more like multiple core usage.. when you say it doesn’t matter if the depth is at 60 or not, I gather that is because you’re applying adaptive hitting.. meaning if the radius of the node is less than a pixel(or some threshold such as two for lower detail) then you do not need to traverse deeper.. with that, you’re stilling getting 1024×768 node traversals AT best.. this will never stream from a hard-drive unless you come up with something clever..

    bauke I feel you will be next. I am terrible at optimization tricks.. unlike what you’ve done in your code.. I wish I could tell you more.

    1. Part of my algorithm is the bundling of rays, such that only the last few nodes are traversed for only a single pixel. And indeed, I break the traversal once I hit an octree node with a size of at most one pixel.

      As for optimization tricks, the octree nodes are sorted in Hilbert curve order for each layer. This allows multiple nodes to be read in a single read (disk seeks are slow, requiring 10ms). Theoretically it should be at most 4 reads per octree level.

      However, my algorithm is usually running from the file system cache residing in RAM. With 8Gb all models I’ve tested on would eventually be loaded entirely into RAM. The 20fps is obtained only when the model is entirely in RAM. When any disk access is required, a single frame takes 200-800ms to be rendered.

  28. I see.. sorting into these different orders is pretty important for many other reasons then better compressed structure. Thanks for that heads up.

  29. I am looking at your octree.. I thought the port of ordering these things, was get rid of the need for pointers, but in your octant struct I see uint32_t nodes[8]; .. I guess I may never understand the ordering stuff 😛

    1. The sorting is only used for faster reading. Due to how the hardware works I can only read chunks of 16kb-1Mb and such read takes about 10ms. The sorting causes nodes that are close to be close in the file, hence I require fewer reads. The sorting is not used for compression at all. Actually, my conversion algorithm does not compress at all.

  30. I’ve implemented the Fenwick tree, but while trying to use it I found out that the quadtree traversal is used to do the long-tail division. So replacing the quadtree means that I need a different way to compute the division required for perspective projection. I’m now looking into whether I can eliminate the cubemap.

    1. We really need to to stick with algorithmic, and 2D plane rendering of 3D located pixels rather than esoteric machinations of what it finally looks like. It’s the location of those those fat pixels that are relevant, and nothing else surely?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s