========================================== Release Notes for Khiops 10 Version Series ========================================== Version 10.3.0 ============== Linux packages have been improved. Better integration of the file drivers (currently S3 and GCS). Streamlined distributed computing support integration. Version 10.2.4 ============== Bug fix: - Fix MPI executable detection in "Conda-based" environments on Linux and MacOS. Minor improvement: - Reduce launching time of the GUI on Windows. Version 10.2.3 ============== Khiops is now available for Ubuntu 24.04. The GUI now uses all available processor cores if instructed to do so. New version of visualization and covisualization tools, with improvements and bug fixes. Internal improvements linked to packaging, in particular with the systematic use of the 'khiops_env' script. Other minor improvements and bug fixes. Version 10.2.2 ============== On Windows, the Khiops installation program now comes with the open-source Java JRE JustJ instead of Oracle's JDK. Khiops is now available for debian 11 and 12: the library used to parallelize Khiops is now OpenMPI. MPICH remains for conda packages only. New version of visualization and covisualization tools, with many improvements and bug fixes. Minor improvements: - new '-s' flag to display system information (useful for debugging). - change in exit status: return 0 on success, 1 on fatal error (if there is an error or warning in the log file, the exit code is 0). Other minor improvements and bug fixes. Version 10.2.1 ============== Bug fix: - Fix a bug that appears when reading temporary files in multi-machines mode. Minor improvements: - improve java management in linux packages Version 10.2.0 ============== **Announcement**: This is a special release marking the open-sourcing of the Khiops AutoML suite. New Features: - Khiops can now be built in macOS. Currently only a conda package is available for installing (more information at https://khiops.org) Bug fix: - Fix a bug in the calculation of descriptive stats for huge databases - Fix clusters having negative typicalities in coclustering - Fix regex derivation rule not working in multi-table dictionaries - Fix `AddSeconds` rule not correctly parsing large values Version 10.1.5 ============== Bug fix: - KhiopsNativeInterface crashes with java under Linux. Signals are not intercepted anymore in KNI. Version 10.1.4 ============== Bug fix: - in the sorting algorithm: it could produce corrupted files when the separators of the input and output files are different. Version 10.1.3 ============== Bug fix: - better error management on non standard file systems (e.g. s3 or hdfs) - the results directory is correctly build in predictor evaluation on non standard file systems Version 10.1.2 ============== Bug fix: - the results path is correctly build in Khiops-coclustering when the directory is located on non standard file systems (e.g. s3 or hdfs) Version 10.1.1 ============== Bug fixes: - in the sorting algorithm : in case where the field separator in the output file was different from the input file, with the input file having fields surrounded by double-quotes and containing the input field separator - in the construction of trees: in an edge case with a single tree and categorical fields with very large number of values - in a multiple-machine cloud environment: fixed the file path management with URI - in a multiple-machine cloud environment: fixed management of specialized temporary directories per machine Improvements: - better dimensioning of the deployment task in case of multi-table schema with a large number of orphan secondary records - improved detection and diagnostics related to JAVA runtime in Khiops scripts on Windows Version 10.1 ============ Khiops 10.1 is a minor release, with few new features, but many optimization and reliability improvements, to better perform in cloud environments and better integrate Khiops within information systems. Main new features - Visualization tools: - Khiops visualization: new panel "Tree preparation" to visualize the trees. - Khiops covisualization: new tool, replacing the previous one based on the obsolete Flex framework. - Data Table Dictionaries: - A new type TimestampTZ: A timestamp with timezone: - it uses the ISO 8601 standard (see the Khiops Guide for more details). - it is detected automatically in the "Build dictionary from data table" feature. - new derivation rules for this type: - CopyTSTZ, FormatTimestampTZ, AsTimestampTZ, UtcTimestamp, LocalTimestamp, SetTimeZoneMinutes, GetTimeZoneMinutes, DiffTimestampTZ, AddSecondsTSTZ, IsTimestampTZValid, BuildTimestampTZ, GetValueTSTZ. - new automatic construction rule in "Variable construction parameters": - LocalTimestamp, to obtain a local Timestamp from a TimestampTZ - Timestamp type now accepts a format with the character 'T' as separator between date and time. - New derivation rules for ternary operator rules: IfD, IfT, IfTS, IfTSTZ. - Sample percentage field in all database dialog boxes now accepts one digit precision (ex: 2.5%). - Coclustering simplification now accepts a new constraint: the maximum total part number. Main integration improvements - License keys/tokens are not required anymore to execute Khiops: - Note that Khiops is not unlicensed, having the same legal license agreement file as before. - The options -l and -u of the Khiops executables are not available anymore. - Log, progression and output scenario files can now be redirected to /dev/stdout or /dev/stderr on Linux, with lines prefixed by the string "Khiops.log", "Khiops.progression" and Khiops.command respectively. - A new format for progression files that is simpler and easier to parse. - New executable return codes: - 1 if there were fatal errors in the execution - 2 if there were errors but no fatal errors in the execution - 0 on success - A new environment variable KHIOPS_RAW_GUI allows file name selection for URIs; it disables the file chooser dialog box in the user interface (see khiops_env command file in Khiops bin directory). - A data table file format detection more resilient to empty lines among valid lines. - Khiops now accepts UTF-8 data table files with BOM (byte order mark). - Less verbose log files. Main performance improvements - Tree construction is now parallelized. - Quantile sampling in multi-table tasks is now parallelized. - More exact resource estimations, eliminating previous overestimations and execution refusals. - The I/O lower layers have been deeply refactored, to better perform in cloud environments. Many minor corrections, mainly for edge-case bugs. Version 10.0.4 ============== - bug fix in sort algorithm : the field separator in the output file was not correct for files with a huge number of identical lines. Version 10.0.3 ============== - fix an edge-case bug in multi-table schemas, in case of a root table of moderate size and several subtables, some of them very small and some very large, and duplicate records in the root table - fix a problem of wrong line numbers reported in case or warning or errors occuring with very large data files - fix a bug in json reports for trees - the file format detector is now more resilient to empty lines in the analyzed data files Version 10.0.2 ============== - fix a bug for file systems with URI schemes (s3 or hdfs) for correct managment of the "result files directory" Version 10.0.1 ============== Improvements: - better handling of database encodings (ascii, ansi, utf8) for json report files, to make it easier to manipulate reports from pykhiops: cf. "Character encodings" section in the Khiops guide - the "Detect file format" feature now displays a message in the log window with the recognized format: used header line and field separator - the "Build dictionary" function now proposes a categorical format for fields that only contain the values "", "0", or "1" - in modeling dictionaries, the target variable is now set as "Unused" by default to facilitate the deployment of predictors in the case of deployment databases where the target variable is missing - the Accidents sample database is now translated into English, with interpretable variable names and values; a simpler version called AccidentsSummary is available - the khiops and khiops_coclustering shell commands now exploit a common shell command khiops_env which defines all env variables required by Khiops. This new command is self-documented and can be used from a wrapper, such as pykhiops - better messages and documentation in case of unsupported data formats, such as the "classic" Mac OS line endings, deprecated since Max OS X in 1998, or a UTF-8 file with BOM (byte order mark) start characters - slight improvement of the level criterion for the preparation of univariate and bivariate data in some edge cases - the "Build coclustering" button is renamed to "Train coclustering" in the Khiops coclustering tool - the environment variables for managing external resources are now KHIOPS_MEMORY_LIMIT and KHIOPS_TMP_DIR, instead of the now obsolete KhiopsTmpDir. - AUC is now 0 in the case of an empty test database. - the "Specific test database" dialog is now reset when the test database mode has changed. - other minor improvements Issues: - fix a bug in the "Extract keys" function, when the output file is specified without a header line - fix an overestimation of the memory requirement for building the trees in the case of large databases - fix a edge case bug that freezes parallel processes when learning a multi-table scheme - fix a edge case bug in the SNB classifier, in the case of a very large matrix instances x variables beyond two billion values - fix a edge case bug in the sorting functionality, when the input files contain fields between double quotes with internal quotes double quotes and/or an input field separator, and when the output field separator is different from the input one - fix a bug in te user interface, when the "Inspect dictionary" action could be called twice simultaneously, resulting in a crash - fix several resource management issues in the case of a cluster with several heteregeneous machines - Other minor corrections Version 10.0 ============ Khiops is a fully automatic tool for mining large multi-table databases, winner of several data mining challenges. Khiops components - Khiops: supervised analysis (classification, regression) and correlation study - Khiops Visualization: to visualize data preparation, modeling and evaluation results of Khiops - Khiops Coclustering: exploratory analysis using hierarchical coclustering - Khiops Covisualization: to visualize, explore and annotate coclustering results Main features - fully automatic data preparation (variable construction and preprocessing) and modeling - mining of data with single table and multi-table schemas - data preparation and modeling for classification, irrespective of the number of classes - data preparation and modeling for regression - descriptive statistics as well as correlation analysis for unsupervised data exploration - advanced unsupervised analysis via coclustering - enhanced recoding capabilities for data preparation - post-evaluation of trained predictors - robustness and scalability, with multi-gigabytes train datasets and no deployment limit - parallelization of data management, data preparation, modeling and deployment tasks - ease of use via simple user interface - interactive visualization tools for easily interpretable results - easy integration in information systems via batch mode, python library and online deployment library Khiops 10.0 - what's new - New algorithm for Selective Naive Bayes predictor - improved accuracy using a direct optimization of variable weights, - improved interpretability and faster deployment time, with less variables selected, - faster training time, using the new algorithm and exploiting parallelization, - Improved random forests - faster and more accurate preprocessing, - biased random selection of variable to better deal with large numbers of variables, - Management of sparse data - fully automatic, - potentially faster algorithms in case of many constructed variables, - New visualization tool - available on any platform: windows, linux, mac, both in standalone and using a browser - Parallelization on clusters of machine - available on Hadoop (Yarn, HDFS) - New version of pykhiops - more compliant with python PEP8 standard, using snake case - distributed as a python package - new features are available: see pykhiops release notes Backwards compatibility with Khiops 9 - dictionaries of Khiops 9 are readable with Khiops 10 - visualization reports of Khiops 9 are usable with the former visualization tool - python scripts using pykhiops 9 are running, with warnings for the deprecated features - scenarios of Khiops 9 are compatible, with warnings for the deprecated features - removed features in Khiops 10, that still work when used from former python scripts or scenarios: - MAP Naive Bayes is removed - Naive Bayes predictor is removed - Mandatory variable in pairs - Preprocessing options MODLEqualWidth, MODLEqualFrequency and MODLBasic are removed - deprecated features, that still work but will be removed in next versions: - pykhiops 9 is replaced by pykhiops 10 (see migration guide within pykhiops release notes) Warning: - all Khiops 9 deprecated features will be removed after Khiops 10, without upward compatibility - in future version, compatibility will no longer be maintained at the scenario level: it is preferable to use python scripts using pykhiops Detailed evolutions =================== Predictor - the new SNB algorithm directly optimizes the variable weights instead of computing them as a weighted sum over an ensemble of models, - the new SNB algorithm selects far less variables than previously - the MAP Naive Bayes predictor is no longer available - the Naive Bayes predictor is no longer available - in the SNB modeling report, each selected variable comes with three indicators - level: univariate evaluation - weight: multivariate evaluation - importance (new indicator): geometric mean of the weight and the level - the MAP indicator is no longer available - regression: predictors can now deal with missing target values - modeling: predictor are trained on the subset of the train database, containing the records related to actual target values - evaluation: the evaluation criteria are computed using only the the records related to actual target values Management of sparse data: - user impacts, in the user interface - in the deployment dialog box, the user can choose the output format for the output database: - tabular (default): standard tabular format - sparse: extended tabular format, with sparse fields in case of sparse data - technical impacts, most of them for internal purpose only, sparse data is managed throughout the data mining process, from feature construction to model deployment - dictionaries: variables can be organized in blocks of sparse variables - file format: sparse format is exploited in case of blocks of variables - derivation rules: many internal sparse derivation rules have been added - automatic variable construction: sparse derivation rules are generated when necessary Khiops file suffix: - new -khj: Khiops report under the json format -khcj: Khiops Coclustering report under the json format - existing -kdic: data dictionary -kdicj: data dictionary under the json format -khc: Khiops Covisualisation report, for the Covisualization tool -_kh: Khiops script -_khc: Khiops coclustering script - deprecated -khv: Khiops visualisation report, for the former Visualization tool -json: used by Khiops 9, deprecated Khiops Visualization - new visualization tool available to visualize the .khj files or the Khiops 9 .json files on any platform (not windows only as in former tool) - former visualization tool is still delivered to visualize the deprecated Khiops 9 .khv files - Khiops Covisualization tool is unchanged in this version Parallelization on clusters of machines - available on Hadoop (Yarn, HDFS) - delivered as an opened generic package, that needs potential adaptations for any specific Haddop distribution Derivation rules - Translate rule: to replace a list of search values with the corresponding replacement values; useful for example to replace all accented characters - Regex rules: RegexMatch, RegexSearch, RegexReplace, RegexReplaceAll - Sparse rules (internal use only) Pairs of variables: - extended parameters to specify either to analyse all or specific variable pairs KNI: Khiops Native Interface - KNIGetFullVersion: new fonction in API, to get full version of KNI - information on version available by right click on the KNI DLL on Windows Parallelization - now exploits up to n+1 processes on a machine with n cores, and usable on machines with only 2 cores Samples: - new sample 'Accident' from the open data, using a snowflake multi-table schema User interface - refactored panes and new default values - Train database pane: - Sample percentage: 70% by default - Test pane: removed - Inspect or edit the specification of the test database from the train database pane - Parameters pane: fields and actions from the former Preditors and Variable construction panes have been moved to new panes - Predictor pane - Feature engineering: new sub-pane - Max number of constructed variable: 100 by default - Max number of trees: 10 by default - Max number of variable pairs: 0 by default - Advanced predictor parameters: new sub-pane - Baseline predictor - Number of univariate predictors - Button Selective naive bayes parameters - Button Variable construction parameters - date and time construction rules are no longer selected by default - Button Variable pairs parameters - new parameters to specify either all or specific variable pairs to analyse - Recoders pane: new pane with all recoding parameters - Variable construction pane: removed - new helper button "Detect file format" in each database pane, to detect whether there is a header line and what the field separator is - new menu Help, with documentation, license management and about sub-menus - removed options - Pane Parameters/Predictors - MAP Naive Bayes and Naive Bayes predictors are removed - Pane Parameters/Preprocessing - Preprocessing options MODLEqualWdth, MODLEqualFrequency and MODLBasic are removed - renamed labels: some field or action labels have been renamed to for better understandability - improved fluidity, ergonomy and robustness Reports - new fields are stored in reports, to get a better description of the current analysis - shortDescription: new field in the Results pane - logs: section in all json reports, with the warnings and errors that occurred during the tasks - samplePercentage, sampleMode, selectionVariable, selectionValue: from the database panes - featureEngineering: section in preparationReport, with maxNumberOfConstructedVariables, maxNumberOfTrees, maxNumberOfVariablePairs - evaluatedVariablePairs, informativeVariablePairs in summary of bivariatePreparationReport - tree reports in the json report - treePreparationReport: similar to preparationReports, for the tree-based variables - treeDetails: specific json section that described the structure of each tree - json files produced by Khiops, for analysis reports and dictionaries, are now encoded using iso8859-1/windows-1252 unicode for extended ascii characters - evolutions taken into account into pykhiops 10 Khiops Coclustering - post-processing functionalities can now select an input coclustering model in any format (.khc, .json, .khcj) Command line options - new options: -l, -u, -v to manage the license using the command line Packaging - light versions of Khiops can be installed without java as prerequisite - platforms: see web site Performance - tuned resource management, for better use of available RAM and cores - improved scalability, evaluated on huge datasets Khiops web site - the www.khiops.com web site has been redesigned Many minor corrections and improvements