Package org.apache.lucene.util.bkd
Class BKDWriter
- java.lang.Object
-
- org.apache.lucene.util.bkd.BKDWriter
-
- All Implemented Interfaces:
java.io.Closeable
,java.lang.AutoCloseable
public class BKDWriter extends java.lang.Object implements java.io.Closeable
Recursively builds a block KD-tree to assign all incoming points in N-dim space to smaller and smaller N-dim rectangles (cells) until the number of points in a given rectangle is <=maxPointsInLeafNode
. The tree is fully balanced, which means the leaf nodes will have between 50% and 100% of the requestedmaxPointsInLeafNode
. Values that fall exactly on a cell boundary may be in either cell.The number of dimensions can be 1 to 8, but every byte[] value is fixed length.
This consumes heap during writing: it allocates a
Long[numLeaves]
, abyte[numLeaves*(1+bytesPerDim)]
and then uses up to the specifiedmaxMBSortInHeap
heap space for writing.NOTE: This can write at most Integer.MAX_VALUE *
maxPointsInLeafNode
/ (1+bytesPerDim) total points.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description private static class
BKDWriter.BKDMergeQueue
private static class
BKDWriter.MergeReader
private class
BKDWriter.OneDimensionBKDWriter
-
Field Summary
Fields Modifier and Type Field Description protected int
bytesPerDim
How many bytes each value in each dimension takes.private int
bytesPerDoc
How many bytes each docs takes in the fixed-width offline formatstatic java.lang.String
CODEC_NAME
(package private) int[]
commonPrefixLengths
static float
DEFAULT_MAX_MB_SORT_IN_HEAP
Default maximum heap to use, before spilling to (slower) diskstatic int
DEFAULT_MAX_POINTS_IN_LEAF_NODE
Default maximum number of point in each leaf blockprotected FixedBitSet
docsSeen
private boolean
finished
static int
MAX_DIMS
Maximum number of dimensionsprivate int
maxDoc
(package private) double
maxMBSortInHeap
protected byte[]
maxPackedValue
Maximum per-dim values, packedprotected int
maxPointsInLeafNode
private int
maxPointsSortInHeap
protected byte[]
minPackedValue
Minimum per-dim values, packedprotected int
numDataDims
How many dimensions we are storing at the leaf (data) nodesprotected int
numIndexDims
How many dimensions we are indexing in the internal nodesprotected int
packedBytesLength
numDataDims * bytesPerDimprotected int
packedIndexBytesLength
numIndexDims * bytesPerDimprotected long
pointCount
private PointWriter
pointWriter
(package private) byte[]
scratch1
(package private) byte[]
scratch2
(package private) BytesRef
scratchBytesRef1
(package private) BytesRef
scratchBytesRef2
(package private) byte[]
scratchDiff
private GrowableByteArrayDataOutput
scratchOut
private static int
SPLITS_BEFORE_EXACT_BOUNDS
Number of splits before we compute the exact bounding box of an inner node.(package private) TrackingDirectoryWrapper
tempDir
(package private) java.lang.String
tempFileNamePrefix
private IndexOutput
tempInput
private long
totalPointCount
An upper bound on how many points the caller will add (includes deletions)static int
VERSION_CURRENT
static int
VERSION_LEAF_STORES_BOUNDS
static int
VERSION_LOW_CARDINALITY_LEAVES
static int
VERSION_SELECTIVE_INDEXING
static int
VERSION_START
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
add(byte[] packedValue, int docID)
private int
appendBlock(RAMOutputStream writeBuffer, java.util.List<byte[]> blocks)
Appends the current contents of writeBuffer as another block on the growing in-memory fileprivate void
build(int nodeID, int leafNodeOffset, MutablePointValues reader, int from, int to, IndexOutput out, byte[] minPackedValue, byte[] maxPackedValue, int[] parentSplits, byte[] splitPackedValues, long[] leafBlockFPs, int[] spareDocIds)
private void
build(int nodeID, int leafNodeOffset, BKDRadixSelector.PathSlice points, IndexOutput out, BKDRadixSelector radixSelector, byte[] minPackedValue, byte[] maxPackedValue, int[] parentSplits, byte[] splitPackedValues, long[] leafBlockFPs, int[] spareDocIds)
The point writer contains the data that is going to be splitted using radix selection.private void
checkMaxLeafNodeCount(int numLeaves)
void
close()
private void
computeCommonPrefixLength(HeapPointWriter heapPointWriter, byte[] commonPrefix, int from, int to)
private static BytesRef[]
computeMinMax(int count, java.util.function.IntFunction<BytesRef> packedValues, int offset, int length)
Return an array that contains the min and max values for the [offset, offset+length] interval of the givenBytesRef
s.private void
computePackedValueBounds(MutablePointValues values, int from, int to, byte[] minPackedValue, byte[] maxPackedValue, BytesRef scratch)
private void
computePackedValueBounds(BKDRadixSelector.PathSlice slice, byte[] minPackedValue, byte[] maxPackedValue)
long
finish(IndexOutput out)
Writes the BKD tree to the providedIndexOutput
and returns the file offset where index was written.private long
getLeftMostLeafBlockFP(long[] leafBlockFPs, int nodeID)
long
getPointCount()
How many points have been added so farprivate void
initPointWriter()
long
merge(IndexOutput out, java.util.List<MergeState.DocMap> docMaps, java.util.List<BKDReader> readers)
More efficient bulk-add for incomingBKDReader
s.private byte[]
packIndex(long[] leafBlockFPs, byte[] splitPackedValues)
Packs the two arrays, representing a balanced binary tree, into a compact byte[] structure.private int
recursePackIndex(RAMOutputStream writeBuffer, long[] leafBlockFPs, byte[] splitPackedValues, long minBlockFP, java.util.List<byte[]> blocks, int nodeID, byte[] lastSplitValues, boolean[] negativeDeltas, boolean isLeft)
lastSplitValues is per-dimension split value previously seen; we use this to prefix-code the split byte[] on each inner nodeprivate void
rotateToTree(int nodeID, int offset, int count, byte[] index, java.util.List<byte[]> leafBlockStartValues)
private static int
runLen(java.util.function.IntFunction<BytesRef> packedValues, int start, int end, int byteOffset)
protected int
split(byte[] minPackedValue, byte[] maxPackedValue, int[] parentSplits)
Pick the next dimension to split.private HeapPointWriter
switchToHeap(PointWriter source)
Pull a partition back into heap once the point count is low enough while recursing.private boolean
valueInBounds(BytesRef packedValue, byte[] minPackedValue, byte[] maxPackedValue)
Called only in assertprivate boolean
valueInOrder(long ord, int sortedDim, byte[] lastPackedValue, byte[] packedValue, int packedValueOffset, int doc, int lastDoc)
private boolean
valuesInOrderAndBounds(int count, int sortedDim, byte[] minPackedValue, byte[] maxPackedValue, java.util.function.IntFunction<BytesRef> values, int[] docs, int docsOffset)
private java.lang.Error
verifyChecksum(java.lang.Throwable priorException, PointWriter writer)
Called on exception, to check whether the checksum is also corrupt in this source, and add that information (checksum matched or didn't) as a suppressed exception.static void
verifyParams(int numDataDims, int numIndexDims, int maxPointsInLeafNode, double maxMBSortInHeap, long totalPointCount)
private void
writeActualBounds(DataOutput out, int[] commonPrefixLengths, int count, java.util.function.IntFunction<BytesRef> packedValues)
private void
writeCommonPrefixes(DataOutput out, int[] commonPrefixes, byte[] packedValue)
long
writeField(IndexOutput out, java.lang.String fieldName, MutablePointValues reader)
Write a field from aMutablePointValues
.private long
writeField1Dim(IndexOutput out, java.lang.String fieldName, MutablePointValues reader)
private long
writeFieldNDims(IndexOutput out, java.lang.String fieldName, MutablePointValues values)
private void
writeHighCardinalityLeafBlockPackedValues(DataOutput out, int[] commonPrefixLengths, int count, int sortedDim, java.util.function.IntFunction<BytesRef> packedValues, int compressedByteOffset)
private void
writeIndex(IndexOutput out, int countPerLeaf, int numLeaves, byte[] packedIndex)
private void
writeIndex(IndexOutput out, int countPerLeaf, long[] leafBlockFPs, byte[] splitPackedValues)
private void
writeLeafBlockDocs(DataOutput out, int[] docIDs, int start, int count)
private void
writeLeafBlockPackedValues(DataOutput out, int[] commonPrefixLengths, int count, int sortedDim, java.util.function.IntFunction<BytesRef> packedValues, int leafCardinality)
private void
writeLeafBlockPackedValuesRange(DataOutput out, int[] commonPrefixLengths, int start, int end, java.util.function.IntFunction<BytesRef> packedValues)
private void
writeLowCardinalityLeafBlockPackedValues(DataOutput out, int[] commonPrefixLengths, int count, java.util.function.IntFunction<BytesRef> packedValues)
-
-
-
Field Detail
-
CODEC_NAME
public static final java.lang.String CODEC_NAME
- See Also:
- Constant Field Values
-
VERSION_START
public static final int VERSION_START
- See Also:
- Constant Field Values
-
VERSION_LEAF_STORES_BOUNDS
public static final int VERSION_LEAF_STORES_BOUNDS
- See Also:
- Constant Field Values
-
VERSION_SELECTIVE_INDEXING
public static final int VERSION_SELECTIVE_INDEXING
- See Also:
- Constant Field Values
-
VERSION_LOW_CARDINALITY_LEAVES
public static final int VERSION_LOW_CARDINALITY_LEAVES
- See Also:
- Constant Field Values
-
VERSION_CURRENT
public static final int VERSION_CURRENT
- See Also:
- Constant Field Values
-
bytesPerDoc
private final int bytesPerDoc
How many bytes each docs takes in the fixed-width offline format
-
DEFAULT_MAX_POINTS_IN_LEAF_NODE
public static final int DEFAULT_MAX_POINTS_IN_LEAF_NODE
Default maximum number of point in each leaf block- See Also:
- Constant Field Values
-
DEFAULT_MAX_MB_SORT_IN_HEAP
public static final float DEFAULT_MAX_MB_SORT_IN_HEAP
Default maximum heap to use, before spilling to (slower) disk- See Also:
- Constant Field Values
-
MAX_DIMS
public static final int MAX_DIMS
Maximum number of dimensions- See Also:
- Constant Field Values
-
SPLITS_BEFORE_EXACT_BOUNDS
private static final int SPLITS_BEFORE_EXACT_BOUNDS
Number of splits before we compute the exact bounding box of an inner node.- See Also:
- Constant Field Values
-
numDataDims
protected final int numDataDims
How many dimensions we are storing at the leaf (data) nodes
-
numIndexDims
protected final int numIndexDims
How many dimensions we are indexing in the internal nodes
-
bytesPerDim
protected final int bytesPerDim
How many bytes each value in each dimension takes.
-
packedBytesLength
protected final int packedBytesLength
numDataDims * bytesPerDim
-
packedIndexBytesLength
protected final int packedIndexBytesLength
numIndexDims * bytesPerDim
-
tempDir
final TrackingDirectoryWrapper tempDir
-
tempFileNamePrefix
final java.lang.String tempFileNamePrefix
-
maxMBSortInHeap
final double maxMBSortInHeap
-
scratchDiff
final byte[] scratchDiff
-
scratch1
final byte[] scratch1
-
scratch2
final byte[] scratch2
-
scratchBytesRef1
final BytesRef scratchBytesRef1
-
scratchBytesRef2
final BytesRef scratchBytesRef2
-
commonPrefixLengths
final int[] commonPrefixLengths
-
docsSeen
protected final FixedBitSet docsSeen
-
pointWriter
private PointWriter pointWriter
-
finished
private boolean finished
-
tempInput
private IndexOutput tempInput
-
maxPointsInLeafNode
protected final int maxPointsInLeafNode
-
maxPointsSortInHeap
private final int maxPointsSortInHeap
-
minPackedValue
protected final byte[] minPackedValue
Minimum per-dim values, packed
-
maxPackedValue
protected final byte[] maxPackedValue
Maximum per-dim values, packed
-
pointCount
protected long pointCount
-
totalPointCount
private final long totalPointCount
An upper bound on how many points the caller will add (includes deletions)
-
maxDoc
private final int maxDoc
-
scratchOut
private final GrowableByteArrayDataOutput scratchOut
-
-
Constructor Detail
-
BKDWriter
public BKDWriter(int maxDoc, Directory tempDir, java.lang.String tempFileNamePrefix, int numDataDims, int numIndexDims, int bytesPerDim, int maxPointsInLeafNode, double maxMBSortInHeap, long totalPointCount) throws java.io.IOException
- Throws:
java.io.IOException
-
-
Method Detail
-
verifyParams
public static void verifyParams(int numDataDims, int numIndexDims, int maxPointsInLeafNode, double maxMBSortInHeap, long totalPointCount)
-
initPointWriter
private void initPointWriter() throws java.io.IOException
- Throws:
java.io.IOException
-
add
public void add(byte[] packedValue, int docID) throws java.io.IOException
- Throws:
java.io.IOException
-
getPointCount
public long getPointCount()
How many points have been added so far
-
writeField
public long writeField(IndexOutput out, java.lang.String fieldName, MutablePointValues reader) throws java.io.IOException
Write a field from aMutablePointValues
. This way of writing points is faster than regular writes withadd(byte[], int)
since there is opportunity for reordering points before writing them to disk. This method does not use transient disk in order to reorder points.- Throws:
java.io.IOException
-
computePackedValueBounds
private void computePackedValueBounds(MutablePointValues values, int from, int to, byte[] minPackedValue, byte[] maxPackedValue, BytesRef scratch)
-
writeFieldNDims
private long writeFieldNDims(IndexOutput out, java.lang.String fieldName, MutablePointValues values) throws java.io.IOException
- Throws:
java.io.IOException
-
writeField1Dim
private long writeField1Dim(IndexOutput out, java.lang.String fieldName, MutablePointValues reader) throws java.io.IOException
- Throws:
java.io.IOException
-
merge
public long merge(IndexOutput out, java.util.List<MergeState.DocMap> docMaps, java.util.List<BKDReader> readers) throws java.io.IOException
More efficient bulk-add for incomingBKDReader
s. This does a merge sort of the already sorted values and currently only works when numDims==1. This returns -1 if all documents containing dimensional values were deleted.- Throws:
java.io.IOException
-
rotateToTree
private void rotateToTree(int nodeID, int offset, int count, byte[] index, java.util.List<byte[]> leafBlockStartValues)
-
checkMaxLeafNodeCount
private void checkMaxLeafNodeCount(int numLeaves)
-
finish
public long finish(IndexOutput out) throws java.io.IOException
Writes the BKD tree to the providedIndexOutput
and returns the file offset where index was written.- Throws:
java.io.IOException
-
packIndex
private byte[] packIndex(long[] leafBlockFPs, byte[] splitPackedValues) throws java.io.IOException
Packs the two arrays, representing a balanced binary tree, into a compact byte[] structure.- Throws:
java.io.IOException
-
appendBlock
private int appendBlock(RAMOutputStream writeBuffer, java.util.List<byte[]> blocks) throws java.io.IOException
Appends the current contents of writeBuffer as another block on the growing in-memory file- Throws:
java.io.IOException
-
recursePackIndex
private int recursePackIndex(RAMOutputStream writeBuffer, long[] leafBlockFPs, byte[] splitPackedValues, long minBlockFP, java.util.List<byte[]> blocks, int nodeID, byte[] lastSplitValues, boolean[] negativeDeltas, boolean isLeft) throws java.io.IOException
lastSplitValues is per-dimension split value previously seen; we use this to prefix-code the split byte[] on each inner node- Throws:
java.io.IOException
-
getLeftMostLeafBlockFP
private long getLeftMostLeafBlockFP(long[] leafBlockFPs, int nodeID)
-
writeIndex
private void writeIndex(IndexOutput out, int countPerLeaf, long[] leafBlockFPs, byte[] splitPackedValues) throws java.io.IOException
- Throws:
java.io.IOException
-
writeIndex
private void writeIndex(IndexOutput out, int countPerLeaf, int numLeaves, byte[] packedIndex) throws java.io.IOException
- Throws:
java.io.IOException
-
writeLeafBlockDocs
private void writeLeafBlockDocs(DataOutput out, int[] docIDs, int start, int count) throws java.io.IOException
- Throws:
java.io.IOException
-
writeLeafBlockPackedValues
private void writeLeafBlockPackedValues(DataOutput out, int[] commonPrefixLengths, int count, int sortedDim, java.util.function.IntFunction<BytesRef> packedValues, int leafCardinality) throws java.io.IOException
- Throws:
java.io.IOException
-
writeLowCardinalityLeafBlockPackedValues
private void writeLowCardinalityLeafBlockPackedValues(DataOutput out, int[] commonPrefixLengths, int count, java.util.function.IntFunction<BytesRef> packedValues) throws java.io.IOException
- Throws:
java.io.IOException
-
writeHighCardinalityLeafBlockPackedValues
private void writeHighCardinalityLeafBlockPackedValues(DataOutput out, int[] commonPrefixLengths, int count, int sortedDim, java.util.function.IntFunction<BytesRef> packedValues, int compressedByteOffset) throws java.io.IOException
- Throws:
java.io.IOException
-
writeActualBounds
private void writeActualBounds(DataOutput out, int[] commonPrefixLengths, int count, java.util.function.IntFunction<BytesRef> packedValues) throws java.io.IOException
- Throws:
java.io.IOException
-
computeMinMax
private static BytesRef[] computeMinMax(int count, java.util.function.IntFunction<BytesRef> packedValues, int offset, int length)
Return an array that contains the min and max values for the [offset, offset+length] interval of the givenBytesRef
s.
-
writeLeafBlockPackedValuesRange
private void writeLeafBlockPackedValuesRange(DataOutput out, int[] commonPrefixLengths, int start, int end, java.util.function.IntFunction<BytesRef> packedValues) throws java.io.IOException
- Throws:
java.io.IOException
-
runLen
private static int runLen(java.util.function.IntFunction<BytesRef> packedValues, int start, int end, int byteOffset)
-
writeCommonPrefixes
private void writeCommonPrefixes(DataOutput out, int[] commonPrefixes, byte[] packedValue) throws java.io.IOException
- Throws:
java.io.IOException
-
close
public void close() throws java.io.IOException
- Specified by:
close
in interfacejava.lang.AutoCloseable
- Specified by:
close
in interfacejava.io.Closeable
- Throws:
java.io.IOException
-
verifyChecksum
private java.lang.Error verifyChecksum(java.lang.Throwable priorException, PointWriter writer) throws java.io.IOException
Called on exception, to check whether the checksum is also corrupt in this source, and add that information (checksum matched or didn't) as a suppressed exception.- Throws:
java.io.IOException
-
valueInBounds
private boolean valueInBounds(BytesRef packedValue, byte[] minPackedValue, byte[] maxPackedValue)
Called only in assert
-
split
protected int split(byte[] minPackedValue, byte[] maxPackedValue, int[] parentSplits)
Pick the next dimension to split.- Parameters:
minPackedValue
- the min values for all dimensionsmaxPackedValue
- the max values for all dimensionsparentSplits
- how many times each dim has been split on the parent levels- Returns:
- the dimension to split
-
switchToHeap
private HeapPointWriter switchToHeap(PointWriter source) throws java.io.IOException
Pull a partition back into heap once the point count is low enough while recursing.- Throws:
java.io.IOException
-
build
private void build(int nodeID, int leafNodeOffset, MutablePointValues reader, int from, int to, IndexOutput out, byte[] minPackedValue, byte[] maxPackedValue, int[] parentSplits, byte[] splitPackedValues, long[] leafBlockFPs, int[] spareDocIds) throws java.io.IOException
- Throws:
java.io.IOException
-
computePackedValueBounds
private void computePackedValueBounds(BKDRadixSelector.PathSlice slice, byte[] minPackedValue, byte[] maxPackedValue) throws java.io.IOException
- Throws:
java.io.IOException
-
build
private void build(int nodeID, int leafNodeOffset, BKDRadixSelector.PathSlice points, IndexOutput out, BKDRadixSelector radixSelector, byte[] minPackedValue, byte[] maxPackedValue, int[] parentSplits, byte[] splitPackedValues, long[] leafBlockFPs, int[] spareDocIds) throws java.io.IOException
The point writer contains the data that is going to be splitted using radix selection. /* This method is used when we are merging previously written segments, in the numDims > 1 case.- Throws:
java.io.IOException
-
computeCommonPrefixLength
private void computeCommonPrefixLength(HeapPointWriter heapPointWriter, byte[] commonPrefix, int from, int to)
-
valuesInOrderAndBounds
private boolean valuesInOrderAndBounds(int count, int sortedDim, byte[] minPackedValue, byte[] maxPackedValue, java.util.function.IntFunction<BytesRef> values, int[] docs, int docsOffset) throws java.io.IOException
- Throws:
java.io.IOException
-
valueInOrder
private boolean valueInOrder(long ord, int sortedDim, byte[] lastPackedValue, byte[] packedValue, int packedValueOffset, int doc, int lastDoc)
-
-