Research

Using AI to bring children’s drawings to life

December 16, 2021

Children draw fascinatingly unique and inventive characters that push our imaginations and require us to think a little differently to recognize the people and things in their pictures. While it can be fairly simple for a parent or teacher to see what a child’s drawing is meant to show, AI struggles with this task. Kids’ drawings are often constructed in abstract, fanciful ways, so if a figure’s feet are placed precariously or if both arms are on the same side of its body, it confuses even state-of-the-art AI systems that excel at spotting objects in photorealistic images and drawings.

Meta AI researchers are working to overcome this challenge so that AI systems will be better able to recognize drawings of human figures in the wildly varied ways that children create them.

We’re excited to announce a first-of-its-kind method for automatically animating children’s hand-drawn figures of people and humanlike characters (i.e., a character with two arms, two legs, a head, etc.) that bring these drawings to life in a matter of minutes using AI. By uploading them to our prototype system, parents and children can experience the excitement of watching their drawings become moving characters that dance, skip, and jump. And they can even download their animated drawings to share with friends and family. If parents choose, they can also submit those drawings to help improve the AI model.

Something Went Wrong
We're having trouble playing this video.

By teaching AI to work effectively with this quintessential human form of creativity, we hope this project will move us closer to building AI that can understand the world from a human point of view. We also hope this work will spur more research on using AI to enhance people’s creativity and inspire imaginative new uses for this technology.

Why automatic AI animation tools don’t work on children’s drawings

Our goal was to build an AI system that can identify and automatically animate the humanlike figures in children’s drawings with a high success rate and without any human guidance. While many AI tools and techniques are designed to handle realistic images of humans, children’s drawings add a level of variety and unpredictability that makes identifying what’s being portrayed much more complex. “Humans” in children’s drawings come in many different forms, colors, sizes, and scales, with little similarity when it comes to body symmetry, morphology, and point of view. We approached this AI challenge through a four-step process, fine-tuning our approach at each stage to adapt to the enormous variety present in children’s drawings.

Something Went Wrong
We're having trouble playing this video.

Identifying humanlike figures through object detection

The first step in animating children’s drawings of people is distinguishing the human figures from the background and from other types of characters in the picture. Object detection using existing techniques works quite well on children’s drawings, but the segmentation masks aren’t accurate enough to be used for animation. To address this, we instead use the bounding boxes obtained from the object detector and apply a series of morphological operations and image processing steps to obtain masks.

When extracting the humanlike characters within a child’s drawing for processing, we use Meta AI’s convolutional neural network–based object detection model, Mask R-CNN, as implemented in Detectron2. Mask R-CNN is pretrained on one of the largest publicly available segmentation data sets, but it’s made up of photos of real-world objects, not drawings. To work on drawings, the model needed to be fine-tuned, which we did with ResNet-50+FPN to predict a single class, “human figure.” We invited our colleagues at Meta to share and animate their kids’ artwork using our system, and we obtained approximately 1,000 drawings that helped us train the AI.

After the fine-tuning process, the network did a good job of detecting human figures within the test data set. The failure cases we observed fell into four different categories: not including the entire figure, not separating the figure from the background, not separating several figures drawn close together, and incorrectly identifying nonhuman figures (such as trees). We believe these types of failures stem from the wide variety in human figures in the training set and the model will continue to improve as it gets more drawings to learn from.

Something Went Wrong
We're having trouble playing this video.

Lifting the humanlike figure from the scene using character masking

After identifying and extracting a human figure from a drawing, the next step in preparing for animation is to separate it from other parts of the scene and the background in a process called masking. The mask must closely mirror the contours of the figure because it will be used to create a mesh, which will then be deformed to produce the animation. When properly done, a mask will include all parts of the character and nothing from the background.

Even though Mask R-CNN can output masks, we found they weren’t suitable for animation. The predicted masks often failed to capture the entire figure whenever the body parts varied greatly in appearance, such as in the figure below, which shows a large yellow triangle for a body and a single pencil stroke for the arm. The predicted masks also often failed by leaving out the middle of “hollow” characters, or characters drawn as outlines and not colored in.

Instead, we developed a classical image processing–based approach that is more robust to these variations. With this method, we crop the image using its predicted bounding box for each detected character. We then apply adaptive thresholding and morphological closing/dialating operations, flood fill from the edges of the box, and assume the mask is the largest polygon not touched by the flood fill. While this method is straightforward and effective for extracting accurate masks suitable for animation, it can fail when the background is cluttered, characters are drawn close together, or the paper has wrinkles, tears, or shadows on the page.

Segmentation masks from Mask R-CNN sometimes failed to closely follow the form of the character (middle, top) or to include all parts of the character, such as stick arms (middle, bottom). In many situations, using an image processing pipeline on Mask R-CNN’s predicted bounding box results in masks more suitable for animation (right).

Prepping for animation via rigging

Children draw figures with a huge variety of body shapes, far beyond the conventional human shape with a head, arms, legs, and a torso. Many children start out depicting humans as what are often called “tadpole people,” with no torso and with arms and legs attached directly to the head. Some children progress to “transitional” figures, which have legs extending from the head and arms extending from the upper legs. We needed a method of rigging that could handle this type of morphological variation.

We use AlphaPose, a model trained for human pose detection, to identify key points on the human figures that can serve as hips, shoulders, elbows, knees, wrists, and ankles. AlphaPose was trained on images of real people, so before we could adapt it to detecting poses in children’s drawings, we had to retrain it to handle the types of variation present in children’s drawings. We did this by internally collecting and annotating a small data set of children’s drawings of human figures. Then, using the pose detector trained on this initial data set, we created an internal tool that allows parents to upload and animate their children’s drawings and allows us to use the uploaded drawings for additional training. As more data came in, we iteratively retrained the model until we reached high levels of accuracy.

Animating 2D figures using 3D motion capture

Once we have mask and joint predictions, we have everything we need to produce the animation. We begin by using the extracted mask to generate a mesh, texturing it with the original drawing. Using the predicted joint locations, we create a skeleton for the character. By rotating the bones and using the new joint locations to deform the mesh, we can move the character into various poses. By moving the character into a series of consecutive poses, we can create an animation. We can select different motions to apply depending upon how confident the joint predictions are: In cases where both arms and legs have been predicted correctly, the animation can happen seamlessly. But if a limb is not present in the drawing, its joint confidence values would be low, and we would have to forgo animations that require that limb, ask the user to correct the prediction, or declare the animation a failure.

To animate the 2D figures using 3D motion capture, we take advantage of the fact that many children draw using what we refer to as a twisted perspective. It’s common for many children to initially draw body parts from their most identifiable point of view, which may be different from the way they would appear on actual humans. For instance, they tend to draw legs and feet from a side view and heads and torsos from a front view.

We take advantage of this perspective in our motion retargeting step. Independently for the lower and upper body, we automatically determine whether the motion is more recognizable from a front view or a side view. Using the selected views, we project the motion onto a single 2D plane and use it to drive the character. We validate the results of such a motion retargeting approach using perceptual user studies run via Mechanical Turk.

Left: Prior to animating, we create a rigged character from the drawing. Right: We repose the character by projecting a frame of motion capture data onto a 2D plane and rotating the character’s limbs to match those of the project. We can project the motion capture data from the front (top row), from the side (middle row), and from a twisted perspective (bottom).

Taking the twisted perspective into account is helpful because many types of motion do not cleanly fall onto a single plane of projection. For example, with jumping rope, the arms and wrists tend to move primarily in the frontal plane, while the bending legs tend to move in the sagittal plane. Because of this, we do not determine a single plane of motion for the motion capture pose but determine projection planes for the upper and lower body separately.

Using AI to power more complex animations

AI has become a powerful tool for creativity, empowering artists; inspiring new forms of self-expression, like AR effects; offering fashion advice; and even generating new dance routines. We hope our animation tool will inspire people to experiment with their drawings and take them in uncharted directions.

By sharing our work, we also hope to encourage more computer vision work in the domain of amateur drawings. Future research for this project could focus on identifying and applying more tailored motions to subcategories of figures, such as superheroes, princesses, monsters, and ninjas. A more fine-grained analysis of the parts of a character would also be useful for identifying antennae, tails, and capes, for example, and applying secondary motion elements in order to increase the animation’s appeal. Someday, perhaps, an AI system could take a complex drawing and then instantly create a detailed animated cartoon using multiple fantastical characters interacting with one another and elements from the background. With AR glasses, those stories could even seem to come to life in the real world, dancing or talking with the child who drew it just moments earlier. The possibilities are as limitless as the human imagination.

We invite you to test out the systems’ animation capabilities by uploading your children’s drawings to our prototype. In the coming year, we also hope to release the data set and share more details on our research.

Written By

Jesse Smith

Postdoctoral Researcher