fcrn深度图预测的准确率_使用fcrn模型在ios上实现深度估计

最新推荐文章于 2024-09-04 07:06:55 发布

翻译最新推荐文章于 2024-09-04 07:06:55 发布 · 1.9k 阅读

0 ·

本内容遵循CC 4.0 BY-SA版权协议

原文链接：https://heartbeat.fritz.ai/implement-depth-estimation-on-ios-using-a-fcrn-model-7208c4f7c4d2

标签

#机器学习 #深度学习 #人工智能 #tensorflow #python

本文探讨了使用FCRN模型在iOS设备上实现深度图预测的准确性，并提供了相关实现细节。

PyTorch 2.6

PyTorch 是一个开源的 Python 机器学习库，基于 Torch 库，底层由 C++ 实现，应用于人工智能领域，如计算机视觉和自然语言处理

fcrn深度图预测的准确率

计算机视觉-iOS (Computer Vision — iOS)

Depth estimation is a major problem in computer vision, particularly for applications related to augmented reality, robotics, and even autonomous cars.

深度估计是计算机视觉中的一个主要问题，特别是对于与增强现实，机器人技术甚至自动驾驶汽车相关的应用。

Traditional 3D sensors typically use stereoscopic vision, movement, or projection of structured light. However, these sensors depend on the environment (sun, texture) or require several peripherals (camera,projector), which leads to very bulky systems.

传统的3D传感器通常使用结构化光的立体视觉，运动或投影。但是，这些传感器取决于环境(阳光，纹理)或需要多个外围设备(相机，投影仪)，这导致系统非常庞大。

Many efforts have been made to build compact systems — perhaps the most remarkable are the light field cameras that use a matrix of microlenses in front of the sensor.

为了构建紧凑的系统，已经做了很多努力-也许最引人注目的是在传感器前面使用微透镜矩阵的光场相机。

Recently, several depth estimation approaches based on deep learning have been proposed. These methods use a single point of view (a single image) and generally optimize a regression on the reference depth map.

最近，已经提出了几种基于深度学习的深度估计方法。这些方法使用单个视角(单个图像)，并且通常优化参考深度图上的回归。

The first challenge concerns the network architecture, which usually follows the advances proposed each year in the field of deep learning: VGG16, residual networks (ResNet), and so on.

第一个挑战涉及网络体系结构，通常遵循每年在深度学习领域提出的进步：VGG16，残差网络(ResNet)等。

The second challenge is defining an appropriate loss function for deep regression. Thus, the relationship between networks and objective functions is complex, and their respective influences are difficult to distinguish.

第二个挑战是为深度回归定义合适的损失函数。因此，网络和目标函数之间的关系很复杂，并且它们各自的影响难以区分。

Previous methods exploit the geometric aspects of the scene to deduce the depth. Another known index for depth estimation is defocus blur.

先前的方法利用场景的几何方面来推断深度。深度估计的另一个已知指标是散焦模糊。

Image for post — Depth from Defocus method 离焦方法的深度

However, depth estimation using focus blurring (Depth from Defocus, DFD) with a conventional camera and a single image suffers from ambiguity relative to the plane of focus and the blind zone related to the depth of field of the camera, where no blurring can be measured. Furthermore, to estimate the depth of an unknown fuzzy scene, DFD requires a scene model and a fuzzy calibration to relate it to a depth value.

但是，在传统相机和单个图像上使用焦点模糊(来自Defocus的深度，DFD)进行深度估计会产生相对于焦平面和与相机景深有关的盲区的歧义，其中不会出现模糊测量。此外，为了估计未知模糊场景的深度，DFD需要场景模型和模糊校准以将其与深度值相关联。

为什么要移动？ (Why mobile?)

Since the advent of augmented reality, which consists of inserting computer-generated images over real-world scenes using a mobile phone camera or special glasses (i.e Hololens).

自增强现实技术问世以来，它包括使用手机摄像头或专用眼镜(即Hololens )将计算机生成的图像插入现实世界场景中。

Small cameras located in the middle and outside of each lens send continuous video images to two small screens on the inside of the glasses.

位于每个透镜中部和外部的小型摄像机将连续的视频图像发送到眼镜内部的两个小屏幕。

Once connected to a computer, the data is combined with live/filmed reality, creating a unique stereoscopic field of view on the LCD screen, where the computer-generated images are superimposed with those of the real world.

连接到计算机后，数据将与现场/拍摄的现实相结合，从而在LCD屏幕上创建独特的立体视场，其中计算机生成的图像与真实世界的图像叠加在一起。

In 2017, Apple had this genius idea to put a depth sensor in the front-facing iPhone camera, mainly to improve security and accuracy for FaceID. Alongside this, they also released the first version of ARKit.

在2017年，Apple有这个天才的想法，在前置iPhone相机中安装了一个深度传感器，主要是为了提高FaceID的安全性和准确性。除此之外，他们还发布了ARKit的第一个版本。

But unfortunately, the back cameras lacked that feature. Many developers were eager to have the same depth data on the back cameras in order to understand, and even reconstruct, the 3D representation of the world in order to insert digital objects in more immersive and realistic ways.

但不幸的是，后置摄像头缺少该功能。许多开发人员渴望在后置摄像头上具有相同的深度数据，以了解甚至重建世界的3D表示，从而以更加身临其境且逼真的方式插入数字对象。

For now, the only way we have to get depth data is to try to predict the depth level of a scene using neural networks, and the input can only be a single image.

目前，我们唯一需要获取深度数据的方法就是尝试使用神经网络来预测场景的深度级别，并且输入只能是单个图像。

There’s a lot to consider when starting a mobile machine learning project. Our new free ebook explores the ins and outs of the entire project development lifecycle.

启动移动机器学习项目时需要考虑很多因素。我们的新免费电子书探讨了整个项目开发生命周期的来龙去脉。

FCRN (FCRN)

FCRN, short for Fully Convolutional Residual Networks, is one of the most-used models on iOS for depth prediction. The model is based on a CNN (ResNet-50) to predict the depth level of a scene using a single image, and ot leverages the residual network with a pre-trained model.

FcRn结合，短于F ullyÇonvolutionalřesidualÑetworks，是在iOS最常用的模型深度预测之一。该模型基于CNN(ResNet-50)来使用单个图像预测场景的深度级别，并且使用预训练模型来利用残差网络。

Traditional methods (depth from stereo images) work by taking two or more images and estimating a 3D model of the scene. This is done by finding matching pixels in the images and converting their 2D positions into 3D depths. But this traditional method requires special lenses with expensive equipment.

传统方法( 从立体图像开始的深度 )通过拍摄两个或更多图像并估计场景的3D模型来工作。这是通过在图像中找到匹配的像素并将其2D位置转换为3D深度来完成的。但是这种传统方法需要配备昂贵设备的特殊镜片。

The stereoscopy process is modeled on human perception, thanks to the two flat images that we perceive from each eye. To put it simply, if two images of the same scene are acquired from different angles, then the depth of the scene creates a geometric disparity between them.

由于我们从每只眼睛都能看到两个平面图像，因此立体视过程是建立在人类感知上的。简而言之，如果从不同角度获取同一场景的两个图像，则场景的深度会在它们之间造成几何差异。

Deep learning approaches are quite different. Broadly speaking, we take a single image and predict the depth level for every pixel. The FCRN model is trained on the NYU Depth Dataset V2, which consists of 464 scenes, captured with a Microsoft Kinect, with the official split consisting of 249 training and 215 test scenes.

深度学习方法完全不同。从广义上讲，我们拍摄一张图像并预测每个像素的深度级别。 FCRN模型在NYU深度数据集V2上进行训练，该数据集由464个场景组成，这些场景由Microsoft Kinect捕获，官方划分为249个训练场景和215个测试场景。

No need for me to go into details about the network architecture—I think the original research article is pretty straightforward and easy to understand:

无需我详细介绍网络体系结构，我认为原始的研究文章非常简单易懂：

Apple offers a Core ML version on its official website. Actually, there are two versions—the first one stores the full weights of the model using 32-bit precision, and the other is half-precision (16-bit).

Apple在其官方网站上提供了Core ML版本。实际上，有两个版本-第一个版本使用32位精度存储模型的全部权重，另一个版本是半精度(16位)。

I chose the first one because I noticed that it’s the most consistent, but you can use them both, depending on the phone you’re running inference. This will help you optimize inference speed depending on the iPhone’s computing units.

我选择第一个是因为我注意到这是最一致的，但是您可以同时使用它们，具体取决于您正在运行推理的电话。这将帮助您根据iPhone的计算单元来优化推理速度。

生成iOS应用程序 (Build the iOS Application)

Now we have our project ready to go. I don’t like using storyboards myself, so the app in this tutorial is built programmatically, which means no buttons or switches to toggle — just pure code 🤗.

现在，我们的项目已准备就绪。我不喜欢自己使用情节提要板，因此本教程中的应用程序是通过编程方式构建的，这意味着没有按钮或开关可切换-仅是纯代码🤗。

To follow this method, you’ll have to delete the main.storyboard file and set your SceneDelegate.swift file (Xcode 11 only).

要遵循此方法，您必须删除main.storyboard文件并设置SceneDelegate.swift文件(仅Xcode 11)。

With Xcode 11, you’ll have to change the Info.plist file like so:

使用Xcode 11，您必须像这样更改Info.plist文件：

You need to delete the “Storyboard Name” in the file, and that’s about it.

您需要删除文件中的“故事板名称”，仅此而已。

Training a mobile-ready model is just one step of a complicated lifecycle. Fritz AI Studio covers it all, from collecting and labeling an initial dataset, all the way to managing and improving models in production.

训练适用于移动设备的模型只是复杂生命周期的一个步骤。从收集和标记初始数据集到管理和改进生产模型，Fritz AI Studio涵盖了所有方面。

1.设置摄像头会话 (1. Setup the camera session)

// MARK: - Setup the Capture Session
fileprivate func setupCamera() {
    let captureSession = AVCaptureSession()
    captureSession.sessionPreset = .vga640x480
    
    guard let captureDevice = AVCaptureDevice.default(.builtInDualCamera, for: .video, position: .back) else { return }
            
    guard let input = try? AVCaptureDeviceInput(device: captureDevice) else { return }
    captureSession.addInput(input)
    
    captureSession.startRunning()
    
    captureDevice.configureDesiredFrameRate(50)
    
    let previewLayer = AVCaptureVideoPreviewLayer(session: captureSession)
    previewLayer.videoGravity = AVLayerVideoGravity.resizeAspect
    previewLayer.connection?.videoOrientation = .portrait
    view.layer.addSublayer(previewLayer)
    previewLayer.frame = view.frame
    
    let dataOutput = AVCaptureVideoDataOutput()
    dataOutput.setSampleBufferDelegate(self, queue: DispatchQueue(label: "videoQueue"))
    captureSession.addOutput(dataOutput)
}

Let break down the code:

让我们分解代码：

Instantiate an AVCaptureSession().
实例化AVCaptureSession() 。
Set the video quality. I chose the lowest possible (640 x 480) because the model doesn’t need a big image—takes a 304 x 228 image.
设置视频质量。我选择了最低的( 640 x 480 )，因为该模型不需要较大的图像-需要304 x 228图像。
Set up which camera to use. In my case, I have an iPhone X, so I chose the builtInDualCamera on the back and set it for video.
设置要使用的相机。就我而言，我有一部iPhone X，所以我选择了背面的builtInDualCamera并将其设置为视频。
Add the preview layer to the sublayer of our main view.
将预览层添加到主视图的子层。
Set up the capture video delegate and add the output to the capture session.
设置捕获视频委托，并将输出添加到捕获会话。

2.预测 (2. Predict)

// MARK: - Setup Capture Session Delegate
func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
    
    guard let pixelBuffer: CVPixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }
    
    let config = MLModelConfiguration()
    config.computeUnits = .all
    
    guard let myModel = try? MLModel(contentsOf: FCRN.urlOfModelInThisBundle, configuration: config) else {
        fatalError("Unable to load model")
    }
    
    guard let model = try? VNCoreMLModel(for: myModel) else {
                fatalError("Unable to load model")
            }
    
    let request = VNCoreMLRequest(model: model) { (request, error) in
        if let results = request.results as? [VNCoreMLFeatureValueObservation],
            let heatmap = results.first?.featureValue.multiArrayValue {
            
            let start = CFAbsoluteTimeGetCurrent()
            let (convertedHeatmap, convertedHeatmapInt) = self.convertTo2DArray(from: heatmap)
            let diff = CFAbsoluteTimeGetCurrent() - start
            
             print("Convertion to 2D Took \(diff) seconds")
            DispatchQueue.main.async { [weak self] in
                self?.drawingView.heatmap = convertedHeatmap
                let start = CFAbsoluteTimeGetCurrent()
                let average = Float32(convertedHeatmapInt.joined().reduce(0, +))/Float32(20480)
                let diff = CFAbsoluteTimeGetCurrent() - start
                print("Average Took \(diff) seconds")
                
                print(average)
                if average > 0.35 {
                    self?.haptic()
                }
            }
        } else {
            fatalError("Model failed to process image")
        }
    }
    
    request.imageCropAndScaleOption = .scaleFill
    
    let handler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer, options: [:])
    DispatchQueue.global().async {
        do {
            try handler.perform([request])
        } catch {
            print(error)
        }
    }
}

A lot of things are happening here, so let’s break it down:

这里发生了很多事情，所以让我们分解一下 ：

All of the predictions are happening inside the capture session delegate. Thus, you need to implement the AVCaptureVideoDataOutputSampelBufferDelegate protocol.
所有预测都在捕获会话委托内部进行。因此，您需要实现AVCaptureVideoDataOutputSampelBufferDelegate协议。
Create an instance of CVPixelBuffer from the sample buffer. This will be the image we will feed the model.
从示例缓冲区创建CVPixelBuffer的实例。这将是我们将输入模型的图像。
Create an MLModelConfiguration() method to instruct the model to use all the compute units (CPU, GPU, and ANE). This step will not guarantee the phone will allow it. (Matthijs Hollemans has a repository that explains this in more detail).
创建一个MLModelConfiguration()方法来指示模型使用所有计算单元(CPU，GPU和ANE)。此步骤不能保证手机会允许。 ( Matthijs Hollemans的存储库对此进行了详细解释) 。
Instantiate the model and set the configuration.
实例化模型并设置配置。
Create an instance of VNCoreMLModel to feed it to our Core ML request.
创建一个VNCoreMLModel实例以将其提供给我们的Core ML请求。
Then create the Core ML request using the model—the request returns the result and the error.
然后使用模型创建Core ML请求-请求返回结果和错误。
The model returns a multiArrayValue, which is a multi-dimensional array with the depth values (the higher the values, the closer the object to the camera).
模型返回一个multiArrayValue ，它是一个具有深度值的多维数组(值越高，对象越接近相机)。
To create an image, we need to convert it into a 2D array (see part 3 below).
要创建图像，我们需要将其转换为2D数组(请参见下面的第3部分)。
We use the converted array to draw a view with gray pixels.
我们使用转换后的数组绘制带有灰色像素的视图。
Then we flatten the matrix and calculate an average of the black and white pixels. This will give us a number that we’ll use to estimate the level of darkness in the image. The darker the image, the closer the objects are to the camera. I choose 0.35 as a threshold, but that’s highly debatable, depending on the lighting conditions and also the type of device you’re using.
然后，我们将矩阵展平并计算黑白像素的平均值。这将为我们提供一个数字，我们将用它来估计图像中的暗度。图像越暗，物体离相机越近。我选择0.35作为阈值，但根据光照条件以及所用设备的类型，这值得商de。
And finally, the Core ML image request handler will take our image in the form of a CVPixelBuffer and perform the request with our instance of VNCoreMLRequest
最后，Core ML图像请求处理程序将以CVPixelBuffer的形式获取我们的图像，并使用我们的VNCoreMLRequest实例执行请求
Voila!
瞧！

3.将输出转换为2D矩阵 (3. Convert output to a 2D matrix)

Since the model returns a multi-array object, we need to transform it into a plane and return a 2D matrix, with every element being a value between 0 and 1, representing the gray intensity of each pixel:

由于模型返回一个多数组对象，因此我们需要将其转换为一个平面并返回一个2D矩阵，每个元素的值在0到1之间，代表每个像素的灰度强度：

extension ViewController {
    func convertTo2DArray(from heatmaps: MLMultiArray) -> (Array<Array<Double>>, Array<Array<Int>>) {
        guard heatmaps.shape.count >= 3 else {
            print("heatmap's shape is invalid. \(heatmaps.shape)")
            return ([], [])
        }
        let _/*keypoint_number*/ = heatmaps.shape[0].intValue
        let heatmap_w = heatmaps.shape[1].intValue
        let heatmap_h = heatmaps.shape[2].intValue
        
        var convertedHeatmap: Array<Array<Double>> = Array(repeating: Array(repeating: 0.0, count: heatmap_w), count: heatmap_h)
        
        var minimumValue: Double = Double.greatestFiniteMagnitude
        var maximumValue: Double = -Double.greatestFiniteMagnitude
        
        for i in 0..<heatmap_w {
            for j in 0..<heatmap_h {
                let index = i*(heatmap_h) + j
                let confidence = heatmaps[index].doubleValue
                guard confidence > 0 else { continue }
                convertedHeatmap[j][i] = confidence
                
                if minimumValue > confidence {
                    minimumValue = confidence
                }
                if maximumValue < confidence {
                    maximumValue = confidence
                }
            }
        }
        
        let minmaxGap = maximumValue - minimumValue
        
        for i in 0..<heatmap_w {
            for j in 0..<heatmap_h {
                convertedHeatmap[j][i] = (convertedHeatmap[j][i] - minimumValue) / minmaxGap
            }
        }
        
        var convertedHeatmapInt: Array<Array<Int>> = Array(repeating: Array(repeating: 0, count: heatmap_w), count: heatmap_h)
        for i in 0..<heatmap_w {
            for j in 0..<heatmap_h {
                if convertedHeatmap[j][i] >= 0.5 {
                    convertedHeatmapInt[j][i] = Int(1)
                } else {
                    convertedHeatmapInt[j][i] = Int(0)
                }
            }
        }
        
        return (convertedHeatmap,  convertedHeatmapInt)
    }
}

I also did something to optimize calculating our average. The method returns two arrays:

我还做了一些优化计算平均值的事情。该方法返回两个数组：

convertedHeatmap: 128 x 160 matrix of grayscale values (double values)
convertedHeatmap ： 128 x 160灰度值矩阵(双精度值)
convertedHeatmapInt: 128 x 160 matrix of black and white (binary threshold) values (integers)
convertedHeatmapInt ：黑白(二进制阈值)值(整数)的128 x 160矩阵

4.绘制深度图 (4. Draw the depth view)

// MARK: - Drawing View
var drawingView: DrawingView = {
   let map = DrawingView()
    map.contentMode = .scaleToFill
    map.backgroundColor = .lightGray
    map.autoresizesSubviews = true
    map.clearsContextBeforeDrawing = true
    map.isOpaque = true
    map.translatesAutoresizingMaskIntoConstraints = false
    return map
}()

Pretty straightforward:

很简单：

Create a UIView class and instantiate a 2D array of double values
创建一个UIView类并实例化一个双精度值的二维数组
Draw the scene using the converted array (convertedHeatmap) and assign to each pixel a grayscale value using UIColor() white values and an alpha channel of 1.
使用转换后的数组( convertedHeatmap )绘制场景，并使用UIColor()白色值和1的alpha通道为每个像素分配灰度值。
Then draw the geometry of each pixel using CGRect and UIBezierPath().
然后使用CGRect和UIBezierPath()绘制每个像素的几何形状。
Set the color and fill the pixel.
设置颜色并填充像素。

5.添加触觉反馈 (5. Add haptic feedback)

When the average reaches above 0.35, the phone will vibrate to give feedback to the user.

当平均值达到0.35以上时，手机将振动以向用户提供反馈。

// MARK: - Set and activate the haptic feedback
fileprivate func haptic() {
    let impactFeedbackgenerator = UIImpactFeedbackGenerator(style: .heavy)
    impactFeedbackgenerator.prepare()
    impactFeedbackgenerator.impactOccurred()
}

You can use any feedback you want—I chose to set the UIImpactFeedbackGenerator() to heavy, but you can custom build your own.

您可以使用任何需要的反馈-我选择将UIImpactFeedbackGenerator()设置为沉重，但可以自定义构建自己的反馈。

结论 (Conclusion)

The application is just a proof-of-concept—my iPhone X takes way too long to process the images (around 630 ms). That’s far too much time, considering I have to convert the output and draw the view, perform all the binary threshold calculation, and then get the average to decide whether or not an object is close to the phone or not.

该应用程序仅仅是一个证明的概念，我的iPhone X花费了太多的时间来处理图像( 大约630毫秒 )。考虑到我必须转换输出并绘制视图，执行所有二进制阈值计算，然后获取平均值来确定物体是否靠近手机，这花费了太多时间。

But, if you have the following phones (iPhone XR, iPhone XS, iPhone XS Max, iPhone 11, iPhone 11 pro …) you might get a better result. I estimated that the iPhone 11 Pro Max takes less than 150 ms, which is around 4 times better than the iPhone X.

但是，如果您使用以下手机(iPhone XR，iPhone XS，iPhone XS Max，iPhone 11，iPhone 11 pro…)，则可能会得到更好的结果。我估计iPhone 11 Pro Max耗时不到150毫秒，大约是iPhone X的4倍。

There’s probably more room for improvement, especially when handling matrices—there are ways to optimize the calculations with smart algorithms. But that’s probably for another article.

可能还有更多的改进空间，尤其是在处理矩阵时-有多种方法可以使用智能算法来优化计算。但这可能是另一篇文章。

If Apple keeps improving the internal components of the iPhone, and with the help of new optimizations on the model side, I can picture this implementation as a way to help people with visual impairments, especially if Apple decides to put the same depth camera used for FaceID on the back of the iPhone. That would be a big step forward for computer vision on the iOS ecosystem.

如果Apple不断改进iPhone的内部组件，并且借助模型方面的新优化功能，我可以将这种实现方式描述为帮助视力障碍人士的一种方式，尤其是如果Apple决定将用于摄影机的深度相机iPhone背面的FaceID。对于iOS生态系统上的计算机视觉而言，这将是一大进步。

Thank you for reading this article. If you have any questions, don’t hesitate to send me an email at omarmhaimdat@gmail.com.

感谢您阅读本文。如有任何疑问，请随时发送电子邮件至omarmhaimdat@gmail.com 。

This project is available to download from my GitHub account:

该项目可以从我的GitHub帐户下载：

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to exploring the emerging intersection of mobile app development and machine learning. We’re committed to supporting and inspiring developers and engineers from all walks of life.

编者注： 心跳 是由贡献者驱动的在线出版物和社区，致力于探索移动应用程序开发和机器学习的新兴交集。 我们致力于为各行各业的开发人员和工程师提供支持和启发。

Editorially independent, Heartbeat is sponsored and published by Fritz AI, the machine learning platform that helps developers teach devices to see, hear, sense, and think. We pay our contributors, and we don’t sell ads.

Heartbeat在编辑上是独立的，由以下机构赞助和发布 Fritz AI ，一种机器学习平台，可帮助开发人员教设备看，听，感知和思考。 我们向贡献者付款，并且不出售广告。

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Fritz AI Newsletter), join us on Slack, and follow Fritz AI on Twitter for all the latest in mobile machine learning.

如果您想做出贡献，请继续我们的 呼吁捐助者 。 您还可以注册以接收我们的每周新闻通讯(《 深度学习每周》 和《 Fritz AI新闻通讯》 )，并加入我们 Slack ，然后继续关注Fritz AI Twitter 提供了有关移动机器学习的所有最新信息。