swagger: "2.0" info: version: 1.0.0 title: Anuvaad Worfklow Manager - Kafka Contract description: A python based microservice to trigger and orchestrate the workflow of the anuvaad-etl pipeline. This service will expose REST APIs for workflow related activities on the other hand will also be plugged into the system as a consumer that picks records posted onto the Kafka Queue in order to perform workflow related activities. contact: name: Kumar Deepak email: kumar.deepak@tarento.com usecase: intiate-workflow: input-topic: anuvaad-etl-wf-initate description: In order to intiate the job asynchronously, the below mentioned JSON has to be pushed to the above mentioned topic. The wf-manager will consume this data and trigger the workflow. Since it is async, there'll be no response and the status of the workflow can only be fetched using the search API. $ref: '#/definitions/WFInitiateRequest' interrupt-workflow: input-topic: anuvaad-etl-wf-interrupt description: In order to interrupt the job asynchronously, the below mentioned JSON has to be pushed to the above mentioned topic. The wf-manager will consume this data and interrupt all the workflows that belonging to the jobIDs and the taskIDs mentioned in the request. Since it is async, there'll be no response and the status of the workflow can only be fetched using the search API. $ref: '#/definitions/WFInterruptRequest' update-wf-status: input-topic: anuvaad-etl-wf-status-update $ref: '#/definitions/WFStatusUpdateInstance' description:|- In order to update the status of every workflow this topic will be used. It works as follows - * Every step in workflow like ingester, transformer etc (referred to as tasks), will build and push the above mentioned JSON to the above mentioned topic. This JSON will contain all the metadata of that particular task. * Workflow Manager will pick this JSON and store it as the task details against that particular taskID and its parent jobID. WFM will also use this JSON to build a job-level-status-update object and stores it as the job details against the jobID. * The idea is to have 2 tables - job_details and task_details which will not only store data extensively but will also faciliate in fetching either current status of the job or the entire history of the job at any point in time. pipeline-orcherstration: topics: Consumes from all output topics and Produces to all input topics. description: WFM acts as orcherstrator which means - WFM decides the step 'n+1' after step 'n'. This is entirely driven by configs. For example, lets say - * Transformer recieves input on topic 't-in' and produces the output on 't-out' * Extractor recieves input on topic 'e-in' and produces the output on 'e-out' * The WF definition is the config says, perform extraction after transformation. * So the WF handles it this way - * WF will produce to 't-in', Transformer will process it and produce the output to 't-out'. * WF will now consume from 't-out', check the config for next step and finds out that extraction is the next step. * WF will now produce the output of Transformer to the topic 'e-in'. Extractor processes it and produces the output to 'e-out'. * WF will consume from 'e-out', find the next step from config and posts is accordingly. definitions: File: type: object properties: name: type: string description: Name of the file. This will be obtained in the output of the file upload API exposed by the anuvaad system. type: type: string description: type of the file. enum: - PDF - IMAGE - TEXT - CSV locale: type: string description: The locale of the file. Meaning, in which language is the uploaded file. For instance, 'en' for English, 'hi' for Hindi etc. format: varchar FileInput: type: object properties: source: type: object description: Details of the source file. $ref: '#/definitions/File' target: type: object description: Details of the target file. $ref: '#/definitions/File' User: type: object properties: name: type: string description: Name of the user id: type: string description: Unique ID of the user. It can be the ID used to log into anuvaad's system. email: type: string description: Email ID of the user. WFInitiateRequest: type: object properties: input: type: object description: Details of the source and target files of which a parallel corp has to be generated $ref: '#/definitions/FileInput' userDetails: type: object description: Details of the user initiating this job. $ref: '#/definitions/User' workflowCode: type: string description: Code of the workflow that has to be picked to process this input. These workflows are configured at the application level. WFStatusUpdateInstance: type: object properties: jobID: type: string description: A unique job ID generated for this workflow. taskID: type: string description: A unique task ID generated for the current on-going task of the workflow. status: type: string description: Current status of the particular task. enum: - SUCCESS - FAIL state: type: string description: Current state of the workflow enum: - INGESTED - TRANSFORMED - PARAGRAPH-EXTRACTED - SENTENCE-TOKENISED - SENTENCES-ALIGNED - PROCESSED output: type: object description: Intermediate or final output of the process. If the status = COMPLETE and the state = PROCESSED, this output is considered to be the final output, intermediate otherwise. $ref: '#/definitions/FileInput' error: type: object description: Incase the job fails due to an error, that error details will be logged here. $ref: '#/definitions/Error' taskStartTime: type: number description: 13 digit epoch of the start time. taskEndTime: type: number description: 13 digit epoch of the end time. WFInterruptRequest: type: object properties: jobIDs: type: array items: type: string description: IDs of the jobs to be interrupted. taskIDs: type: array items: type: string description: IDs of the tasks to be interrupted. Error: type: object properties: errorID: type: string description: Unique ID generated for this error. jobID: type: string description: ID of the job within which the error occured. taskID: type: string description: ID of the task within which the error occured. state: type: string description: Processing state of the job just before the error. httpCode: type: string description: Http Code of the error errorMsg: type: string description: This is the machine generated error message like TypeError, ValueError etc with line number. cause: type: string description: This is the cause of the error in a user understandable format that is defined by the developer. timeStamp: type: number description: epoch time at the instant of error.