The Fruits of Deep Learning: How Convolutional Neural Networks Support Robotic Harvesting and Yield Mapping
Accurate and efficient fruit detection is of critical importance for robotic harvesting and yield mapping.
No matter how sophisticated their grasping systems are, robots can pick only those fruits that their vision systems detect. Furthermore, fruit detection helps generate yield maps that track the spatially variable output of crop production and serve as decision tools in precision farming.
A number of factors make fruit detection a challenging task: Fruits occur in scenes of varying illumination, can be occluded by other objects and are sometimes hard to visually distinguish from the background.
The ideal fruit detection system is accurate, can be trained on an easily obtainable data set, generates its predictions in real time, adapts to different types of fruits and works day and night using different modalities, such as color images and infrared images.
Over the last few years, deep learning methods have made considerable progress in addressing these requirements. This article highlights some of the challenges and recent milestones in fruit detection.
The first challenge in a computer vision task is to acquire raw data in the form of camera images or video frames. In the context of agriculture, data is frequently collected through the use of drones and robots.
Fruit detection can be formulated as an image segmentation problem. For the computer vision system to learn from the available raw data, pixels that are part of fruits need to be distinguished from pixels representing the background.
Annotating fruit pixels individually is labor-intensive. Fortunately, deep learning systems do not require pixel-wise information and can learn from bounding box annotations instead.
Sa et al.  report that it takes close to 100 seconds to label the fruit pixels in one image and around 10 seconds to perform bounding box annotations. In other words, given the same resources, the use of bounding boxes will result in an annotated data set that is ten times larger.
Synthetically generated data sets provide a potential alternative to human annotations. Rahnemoonfar & Sheppard  experimented with a relatively simple process that involves filling blank images with green and brown circles, blurring the image and drawing circles of random sizes in random positions. Barth et al.  used an elaborate process to synthetically create 10,500 images for 42 different plant models with randomized plant parameters.
In many cases, deep vision systems benefit from transfer learning. Instead of starting each project from scratch, existing convolutional neural networks (CNNs) that have been trained on millions of images and optimized by leading researchers over many years can be fine-tuned for domain-specific tasks.
In particular, many deep learning solutions to the problem of fruit detection are based on a highly successful object detection network named Faster R-CNN. This neural network has been trained in two steps: ImageNet, a data set consisting of 1.2 million images, was used to train Faster R-CNN to classify images. The network was then fine-tuned for object detection using another data set named PASCAL VOC.
The region proposal module in Faster R-CNN generates a set of candidate bounding boxes that could contain the objects of interest. In the words of the authors, this module tells the network “where to look.” For each proposed box, the region classification step then computes a probability distribution over the classes that the box can belong to.
The variants of Faster R-CNN that exist differ in the set of convolutional layers that is used to extract image features for the region proposal part of the network. Whereas the VGG-16 variant with its 13 convolutional layers is deeper and more computationally intensive, the ZF variant is shallower but faster.
Prior to the advent of deep neural networks, features in computer vision models were engineered manually for a specific task and the specific conditions under which the data has been collected. Deep learning models automatically learn these features, thus saving a considerable amount of development time.
The publication the Sa et al. paper on the DeepFruits system marks an important milestone in the development of deep learning approaches to fruit detection. DeepFruits builds upon Faster R-CNN to detect sweet peppers and rock melons in images taken in a greenhouse. The system can be trained in a matter of a few hours, runs on commodity hardware and performs detection in around 400 milliseconds per image.
The authors use the VGG-16 variant of Faster R-CNN and show that filters in the first layer specialize in low-level features such as reddish and greenish colors corresponding to red and green sweet papers. Regions in higher-level layers with strong activation often correspond to image regions that belong to fruits.
A region containing fruits is considered to be detected when the intersection over union score is greater than 0.4. Using this threshold, the authors report F1 scores between 0.8 and 0.84 for the task of sweet pepper detection. The best-performing approach combines proposals from separately trained models for RGB and near-infrared images using a scoring method termed “later fusion.”
Interestingly, DeepFruit reaches a respectable F1 score of 0.8 when the Faster R-CNN component is fine-tuned on a mere 20 training images. It should be noted that the data sets used in the experiments were quite small. Each classifier was trained on around 100 images and tested on a few dozen images.
Since its publication, a number of new papers have built upon DeepFruits to solve similar and more challenging problems.
One notable example is a paper published by Bargoti & Underwood on fruit detection in orchards. The researchers used a robotic vehicle during daylight hours to capture high-resolution whole tree images for three fruit types: apples, almonds and mangoes. In total, the data set contains more than 2,000 training images and almost 500 test images.
The authors emphasize the increased difficulty due to the high pixel count per fruit and the low fruit-count per image. Almond trees, for example, can host 1,000–10,000 almonds and are of a smaller size than the other fruit varities.
Both DeepFruits and the system described in the Bargoti & Underwood paper use Faster R-CNN and treat fruit detection as a set of binary problems: one detector is trained for each fruit type. Another commonality is the application of non-maximum suppression (NMS), a procedure designed to handle overlapping regions. NMS eliminates regions associated with lower confidence that are strongly overlapping with high-confidence candidates, as measured by the Intersection over Union score.
Bargoti & Underwood experimented with different data augmentation techniques and found that the largest boost in performance is achieved by flipping and scaling the available images.
VGG-16 requires 2.5 GB of GPU memory for 0.25 megapixels image. To overcome this bottleneck, the authors employ an approach that they refer to as “Tiled Faster R-CNN”: detections are performed using a window of appropriate size that is sliding over the image. To ensure that fruits are not split across tiles, the overlap between two tiles is greater than the maximum fruit size. NMS is then applied over the combined output over all tiles.
Using this setup, the VGG variant of Faster R-CNN outperformed the shallower ZF variant and achieved F1 scores above 0.9 for apples and mangoes. For the smaller sized and frequently occurring almonds, the fruit detection result is reported to be close to 0.78. Interestingly, detection performance reaches 0.6 for apples with just five images and increases only by 0.01 for the last doubling of training images.
Finally, the authors point out that the right choice of the Faster R-CNN depends on the task at hand. Computational efficiency, for example, is more important for robotic harvesting than it is for yield mapping which can be performed offline.
The promising results that have been achieved with DeepFruits and related systems mark an exciting development in agricultural technology. Let’s hope that deep vision systems will make important contributions towards the goal of universal access to affordable, nutritious and delicious food.
Fruit counting, a higher-level task that builds on fruit detection and requires keeping track of fruits that have already been seen in previous frames, is a topic that may be discussed in a future article.
Thank you for reading! If you’ve enjoyed this article, hit the clap button and follow me to receive more information about the latest machine learning resources.
 Bargoti, S. and Underwood, J., 2017, May. Deep fruit detection in orchards. In Robotics and Automation (ICRA), 2017 IEEE International Conference on (pp. 3626–3633). IEEE.
 Barth, R., IJsselmuiden, J., Hemming, J. and Van Henten, E.J., 2018. Data synthesis methods for semantic segmentation in agriculture: A Capsicum annuum dataset. Computers and Electronics in Agriculture, 144, pp.284–296.
 Rahnemoonfar, M. and Sheppard, C., 2017. Deep count: fruit counting based on deep simulated learning. Sensors, 17(4), p.905.
 Ren, S., He, K., Girshick, R. and Sun, J., 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91–99).
 Sa, I., Ge, Z., Dayoub, F., Upcroft, B., Perez, T. and McCool, C., 2016. Deepfruits: A fruit detection system using deep neural networks. Sensors, 16(8), p.1222.
Source: Deep Learning on Medium