--- pinned: false title: "From Hadoop to the modern data stack" description: "I was at the meetup From Hadoop to the Modern Stack with the companies Criteo, Starbust and Castor with 3 speeches Hadoop and Data processing at Criteo, Data Catalog Costor, and Trino on Apache Iceberg." authors: ["glegoux"] time_reading_minutes: 3 category: "Data" --- # Hadoop at Criteo by [Anthony Rabier](https://fr.linkedin.com/in/anthony-rabier-345108100), Staff Site Reliability Lead Engineer at [Criteo](https://www.criteo.com/) and [William Montaz](https://fr.linkedin.com/in/williammontaz), Senior Staff Site Reliability Engineer at [Criteo](https://www.criteo.com/). Criteo migrated from [Hadoop](https://hadoop.apache.org/docs/stable/) 3 to Hadoop 2, with a lot of patches and in an awesome way by migrating first the runtime to Hadoop 3 for a big progressively without downtime for Hadoop YARN and HDFS without any intervention of the development teams, then by migrating the Hadoop projects for Spark, Flink, and Hadoop Map Reduce. So a vanilla distribution of Hadoop 3 has been created, running with compiled projects in Hadoop 2 and 3 with retro compatibility and tricks developed and merged by Criteo in the core project of Hadoop. ![](https://miro.medium.com/v2/resize:fit:318/0*Ca5vwq5jxI-Qq2m_) Migration from Hadoop 2 to Hadoop 3 [Garmadon](https://github.com/criteo/garmadon) is an agent java deployed with all JVM processes running on the Criteo Hadoop cluster to do the metrology jobs for Spark, Flink, and Hadoop Map Reduce … You can build generic Grafana dashboards and create data lineage between your dataset and audit all operations on HDFS. ![](https://miro.medium.com/v2/resize:fit:700/0*OSMVeNYwRB6fQMSC) Garmadon, Java agent for the Metrology on Hadoop # Data processing at Criteo by [Miguel Liroz](https://fr.linkedin.com/in/mliroz), Senior Staff Site Reliability Lead Engineer at [Criteo](https://www.criteo.com/) and [Raphael Claude](https://www.linkedin.com/in/raphaelclaude), Staff Site Reliability Engineer at [Criteo](https://www.criteo.com/) Datadoc is an internal tool used by Criteo to browse the data catalog and to provide a data catalog on each dataset. {% include content/image.html src="https://miro.medium.com/v2/resize:fit:700/1*UD_BHWOi4vMzm_vEQQthvg.png" abs_url=true title="Datadoc" source_author=true %} [BigDataFlow](https://medium.com/criteo-engineering/scheduling-data-pipelines-at-criteo-part-1-8b257c6c8e55) is an internal tool used by Criteo written in Scala to schedule data processing jobs from an extended SQL query. {% include content/image.html src="https://miro.medium.com/v2/resize:fit:700/1*cRzS5MJ4IiErfXQ82Sz5Ug.png" abs_url=true title="BigDataFlow extended SQL" source_author=true %} {% include article/read-more.md src="https://medium.com/@glegoux/from-hadoop-to-the-modern-stack-cdb2f10cee31" %}