Krangl Roadmap ============== For completed items see [change-log](../CHANGES.md). ## Next release https://github.com/holgerbrandl/krangl/issues * [provide a `tidyr::fill` equivalent](https://github.com/holgerbrandl/krangl/issues/61) * [consider shorter columns cast as in pandas](https://github.com/holgerbrandl/krangl/issues/60) * [Add excel import/export](https://github.com/holgerbrandl/krangl/issues/58) * Better documentation & cheatsheet * Date (column?) support * Factor (column?) support + [Add factor attribute utilities similar to methods in R package `forcats`](https://github.com/holgerbrandl/krangl/issues/47) * better spec out NA * consider use of doublearray for double/int-col along with NaN, see https://pandas.pydata.org/pandas-docs/stable/missing_data.html#working-with-missing-data * add a `pretty_column` helper * improve `iter.toDataFrame()` to include reference, getters, kotlin property getters ## Meta * inconsistenly named reader methods * `krangl.ColumnsKt#map` should have better return type * use/support compressed columns (https://github.com/lemire/JavaFastPFOR) * Better lambda receiver contexts * Performance (indices, avoid list and array copies, compressed columns) * Use dedicated return type for table formula helpers (like `mean`, `rank`) to reduce runtime errors * More bindings to other jvm data-science libraries * `Sequence` vs `Iterable`? * Pluggable backends like native or SQL * should `unfold` be better called `flatten`? * write chapter about timeseries support * https://github.com/signaflo/java-timeseries/wiki/ARIMA-models * learn from https://pandas.pydata.org/pandas-docs/stable/timeseries.html ### IO * Add parquet support https://stackoverflow.com/questions/39728854/create-parquet-files-in-java ### Core * more defined behavior/tests needed for grouped dfs that become empty after filtering ```r require(dplyr) iris %>% group_by(Species) %>% filter(Sepal.Length>100) ``` --- ### API improvements --- ### Performance * misc consider to use kotlin.collections.ArrayAsCollection * Setup up benchmarking suite List copy optimization * use iterable where possible * misc consider to use kotlin.collections.ArrayAsCollection --> get rid of toList which always does a full copy internally. * [ ] 30% flights HOTSPOT: `krangl/Extensions.kt:275` can we get rid fo the array creation? * [ ] `krangl.SimpleDataFrame.addColumn` should avoid `toMutatbleList` * [ ] More consistent use of List vs using arrays as column datastore (see [array vs list](http://stackoverflow.com/questions/716597/array-or-list-in-java-which-is-faster)). This would avoid array conversion which are omnipresent in the API at the moment. * [ ] `get rid of other `toMutableList` and use view instead * Analyze benchmark results with with kravis/krangl :-) * use for column indices to speed up access fast column storage https://github.com/lemire/JavaFastPFOR http://fastutil.di.unimi.it/ http://nd4j.org/ benchmarking https://github.com/mm-mansour/Fast-Pandas Backlog ------- * remove regrouping in core verbs where possible * consider to use invoke for row access (potentially decouple more arguable extensions in different namespace?) * provide equivalent for dplyr::summarize_each and dplyr::mutate_each [#4](https://github.com/holgerbrandl/krangl/issues/4) * `krangl.head` should use view instead of copy; also consider to use views for grouped data (see https://softwarecave.org/2014/03/19/views-in-java-collections-framework/) * koma bindings --> http://koma.kyonifer.com/ * Add a `DataFrame.transpose()` method `as_tibble(cbind(nms = names(df), t(df)))` * Integrate idoms to do enrichment testing with fisher test from [commons-math](http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/distribution/HypergeometricDistribution.html) * see tablesaw changelog https://jtablesaw.github.io/tablesaw/changes_in_v_0.2 ## Rejected Ideas * directly access values with `it["foo"]` and not just column object. For the latter DataFrame.cols can be used * Not a good idea because all extension function would then be defined for common lists like List etc. It's more important to keep the namespace clear --- Provide adhoc/data class conversion for column model [adhoc](https://kotlinlang.org/docs/reference/object-declarations.html#object-expressions)/data class objects ```kotlin val dataFrame = object : DataFrame() { val x = Factor("sdf", "sdf", "sdfd") val y = DblCol(Double.MAX_VALUE, Double.MIN_VALUE) val z = y + y } val newTable = df.map{ data class Foo(val name:String)} newTable.newCol newTable.src.x ``` --> Can not work because data class is not an expression --- * improve benchmarking by avoid jmv warmup with -XX:CompileThreshold=1 [src](http://stackoverflow.com/questions/1481853/technique-or-utility-to-minimize-java-warm-up-time) --> rather continue with jmh driven benchmarking subproject -- Make use of kotlin.Number to simplify API --> Done by adding `NumberCol` but unclear how to actually benefit from it