Weka Wonkiness
Weka has the notion of “dense instances” (just called “instances” pre-3.7) and “sparse instances”, where the former allocates memory for storing a value for every feature, and the latter only for non zero and missing values. Sounds awesome, right? There are a few gotchas though. First, the constructor signature:
There is no documentation that states exactly what maxNumValues
should be set to, but I know it must be at least as big as the number of values you pass in. There is a catch, though. If you padd in values that contain zeros (such as in nominals or otherwise), they will be stripped out of the data. However, instance.numAttributes()
will not reflect this change; it will always be equal to the value you pass in for maxNumValues
. instance.numValues()
will, however, reflect this change, so be careful. Making a method to iterate over set values for SparseInstance
is therefore less trivial than perhaps it could be…
The next gotch can be found only in the description of the constructor from above:
Constructor that inititalizes instance variable with given values. Reference to the dataset is set to null. (ie. the instance doesn't have access to information about the attribute types) *Note that the indices need to be sorted in ascending order*. Otherwise things won't work properly.
Note the indices need to be sorted in ascending order
bit. Otherwise things really don’t work properly.
Another gotcha has to do with string values. Weka uses a lookup table for attributes and stores the index in that lookup table as the value for all nominal and string attributes. When using SparseInstance
, this means that any time there is a 0 for these, they will be ommitted from the value array. For nominals this may not be so bad, but for strings you probably do not want this behavior (you want to recover the string!). As a workaround, I found a thread that suggested adding a dummy value as the first value for the attribute so no instances will reference it and all have values > 0
. However, at least with Weka 3.6.x, this value had to be non-empty for it to stick (?).
Further, though an instance has a stringValue
method, this can only be safely called on string attributes. Similarly, the value
method will always return the double representation, which may be the index to the actual value if it is nominal or string. Instead, one must use the instance.toString(int)
and instance.toString(Attribute)
methods to safely get back the real values of attributes for a given instance.