June 17, 2019
Written byMichal Drozdzal, Adriana Romero, and Julia Peter
Michal Drozdzal, Adriana Romero, and Julia Peter
A new approach to generating recipes directly from food images that produces more compelling recipes than retrieval-based approaches, according to human judgment. Evaluated on the large-scale Recipe1M data set, this approach improves performance with respect to previous baselines for ingredient prediction. With this work, we aim to provide access to the preparation of a meal simply by inputting a food image.
Generating a recipe from an image requires a simultaneous understanding of the ingredients composing the dish as well as any processing they went through, e.g., slicing, or blending with other ingredients. Traditionally, the image-to-recipe problem has been formulated as a retrieval task, where a recipe is retrieved from a fixed data set based on the image similarity score in an embedding space. The performance of such systems highly depends on the data set size and diversity, as well as on the quality of the learned embedding. Not surprisingly, these systems fail when a matching recipe for the image query does not exist in the static data set.
An alternative to overcome the data set constraints of retrieval systems is to formulate the image-to-recipe problem as a conditional generation one. We argue that instead of obtaining the recipe from an image directly, a recipe-generation pipeline would benefit from an intermediate step: predicting the ingredients list. The sequence of instructions would then be generated conditioned on both the image and its corresponding list of ingredients, where the interplay between image and ingredients could provide additional insights on how the latter were processed to produce the resulting dish.
Our image-to-recipe generation system takes as input a food image and outputs a recipe containing title, ingredients, and cooking instructions. Our method starts by pretraining an image encoder and an ingredients decoder, which predicts a set of ingredients by exploiting visual features extracted from the input image and ingredient co-occurrences. Then we train the ingredient encoder and the instruction decoder, which generate title and instructions by taking the image’s visual features and the predicted ingredients and feeding them into a state-of-the-art sequence generation model.
Food recognition challenges current computer vision systems to go beyond the merely visible. When compared with natural image understanding, visual ingredient prediction requires high-level reasoning and prior knowledge (e.g., that croissants likely contain butter). This poses additional challenges, as food components have high intra-class variability, heavy deformations occur during cooking, and ingredients are frequently occluded in a cooked dish. Our system is a first step toward broader food understanding systems such as calories estimation and recipe creation.
Additionally, this kind of training can be used for any problem that requires predicting long structured text from an image and predicted keywords. The first part of the pipeline (ingredient prediction) could be applied to address broader problems, such as image-to-set prediction.