Thursday, August 18, 2016

Depth Comma ai first autopilot based on the road video forecast paper

Lei feng's network: by the end of this article from the compilation.

George Hotz and Comma.ai background:

George Hotz in 2007 and first crack to crack Sony PS3 first person for the iPhone,2010 year. Internships in Google,Facebook, spent 4 months in Space after working in 2015, artificial intelligence startups Vicarious, left in July and in September of the same year founded Comma.ai, alone in the garage study on automatic driving technology, announced that challenge Google,Mobileye, in April this year, the company got a $ 3.1 million investment. On August 6, George Hotz has open source code and thesis research. (Paper and source code download)

Depth: Comma.ai first autopilot based on the road video forecast paper

Original author Eder Santana,George Hotz. Following is a compiler full text:

Application of artificial intelligence in automatic driving, Comma.ai's strategy is to build up an agent (agent), train cars through the model to predict future traffic events mimic human driving behavior and drive plans. This paper presents our current research methods for driving simulator to study variation auto-encoder (Variational Autoencoder, known as VAE) and generated against network-based (generative adversarial network, known as GAN), cost function for realizing video forecasts (cost function). , We trained a combination based on recurrent neural network (RNN) model (transition model). Fossil iPhone case

The optimized model of cost function does not exist in pixel space, but shows us ways to achieve its forecast for multi-frame realism.

| Profile

The self-driving car [1] is in the study of artificial intelligence, one of the most promising areas in the short term, which makes use of a lot of driving at this stage in the process, with tags and contextual information-rich data. Taking into account its perceived complexity and control, automatic driving technologies, once realized, will also develop many interesting topics, such as action recognition in video and drive plans. At this stage, with the camera as primary sensors, combined with Visual processing and artificial intelligence technology for automatic driving of the advantage in terms of cost.

Due to the depth of recursive neural network for learning, development, virtual reality interaction (interaction) of a more convenient, vision-based control and reinforcement learning in the following reference [7][8][9][10] has been successful. This interactive form allows us to to repeat the test with a different policy scenarios, and can simulate all possible events to train the neural-network-based controller. For example, Alpha Go[9] depth using Convolutional neural networks (CNN) by accumulating and his chess game experience to predict the winning probability of the next. Go game engine to simulate all possible changes in the results of a game, and used as a Markov chain tree (Markov Chain Tree) search. At present, such as Let Go learned to play the game screen Torcs[7] or Atari[8], required hours of training and learning.

Because the learning agent is difficult to implement exhaustive interaction with the reality, which at present is about two solutions, one of which is manually developed a simulator, the other is to train the ability to predict future scenarios. Programme relates to the physical world of the former rule definitions, and stochastic modeling of the real area of expertise, but such expertise has been covering all the information associated with the control, covering the existing Flight Simulator [11], robot, [12] and so on.

We focus on setting human agent (agent) to make its own simulations to predict real-world scenarios, installed on the windscreen of a front-facing camera as input to a video stream. FOSSIL iPhone 5

Depth: Comma.ai first autopilot based on the road video forecast paper

Early years is based on the State space of a physical agent [13] to controller training simulations, other models of Visual processing alone can only adapt to the low dimension or texture features simple video, such as the game Atari[14][16]. Texture features for complex video, is through the passive video projections (passive video prediction) to identify actions [17].

This paper relevant literature complemented the existing video projections, we let the controller itself to train the model and predict real video scene, lower dimensional compression is calculated and converted into the corresponding action. In the next section, we described to predict live traffic video shot by the DataSet (a DataSet).

| Datasets (DataSet)

Part of our open source used in this paper to automatic drive test data. Data sets of test data with self-driving cars comma.AI test platform used is consistent with the cameras and sensors. Our Acura ILX 2016 one Point Grey cameras installed on the windscreen, and by 20hz image acquisition and frequency on the road. Release the data set contains a total of 7.25 hours of driving data, 11 video, video frame is captured from captured video 160*320 pixel picture. In addition to video, DataSet includes data from several sensors, respectively, measured at different frequencies, which interpolated 100Hz, sample data included vehicle speed, steering angle, such as GPS, gyroscopes, IMU. Specific details of the data sets and measurement devices can be accessed through synchronization site gets.

We record the sensor and the timestamp when the captured video frame and insert the test time and linear synchronous sensor data and video. We have also released original video and sensor data stored in HDF5 format, the format of choice because it is easier to machine learning and control software in use.

This article will focus on video frames, steering angle and vehicle speed. Our original by downsampling the image data is 80*160, and the image is-1 to 1 pixel to fine-tune your actions (renormalizing), this completes the pretreatment. Sample image shown in Figure 1. In the next section we define this article aims to study the problem.

| Problem definition (Problem definition)

T-frame of the XT is a dataset, Xt is the frame length for videos n says:

Depth: Comma.ai first autopilot based on the road video forecast paper

St is the control signal, and directly related to the image frame:

Depth: Comma.ai first autopilot based on the road video forecast paper

At the speed and steering angle (steering angle) corresponds to.

Depth: Comma.ai first autopilot based on the road video forecast paper

When projecting images define a valuation function f:

Depth: Comma.ai first autopilot based on the road video forecast paper

  Predictions for the next frame as follows:

Depth: Comma.ai first autopilot based on the road video forecast paper

Note that the defined for higher dimensions and the dimensions are interconnected, similar problems can also occur in the machine learning such as slow convergence or poor fitting of data (underfit) [26].

Studies have shown that [20], using dynamic neural network convolution (Convolutional dynamic network), without appropriate adjustments (regularization), the model is on a single set of data to simulate well but the overall prediction accuracy of the data is low.

Way is through a simple video, artificial [14] direct training valued function f, recent papers [20][17] shows that generate texture complexity can predict video, but did not solve the problem of motion transfer, nor a compact intermediate representation of the data generated. In other words, their models without downsampling or hidden coding for low dimension, but full implementation by convolution transform.  But because of dense space (dense space) [18], the probability that filter (filter) and control the output blurred the definition of (ill-defined), compact (compact intermediate representation) is vital to our work.

As far as we know, this is the first article from the real road scenarios to predict subsequent frames of video essays, in this connection, in this article, we have decided to piecewise function f in order to debug debugging it can block.

 First of all, we learned a Autoencoder frame XT embedding data into Gaussian hidden Zt (Gaussian latent space),

Depth: Comma.ai first autopilot based on the road video forecast paper

Dimensions of 2048 is the demand by the experiment, variational Bayesian [1] encoding (variational Autoencoding Bayes) enforce the Gaussian assumption. First step is space simplifies the transfer of learning to the pixels in the hidden layer (latent space), but assumed from the encoder Autoencoder can properly learn hidden layer of Gaussian, then just transfer model could promise not to leave the high density area of the embedded space, we can simulate a realistic video. Hypersphere of RADIUS ρ in high density areas, is the embedding dimension, and Gaussian prior variance function. Described in the next section, we will begin a detailed Autoencoder and transfer model.

| Driving Simulator (Driving Simulator)

Taking into account the complexity of the issue, we do not consider the end-to-end (End-to-End), but uses a separate network to learn video projections. Proposed architecture is based on two models: one is to use Autoencoder to dimension, the other is converted using RNN (transition).  Complete model is shown in Figure 2.

Depth: Comma.ai first autopilot based on the road video forecast paper

 From the encoder (Autoencoder) we chose a hidden layer of Gaussian probability distribution model of learning data embedding, especially to avoid the hyper-sphere low probability and discrete areas of the focus at the origin, this regional presence would prevent the hidden layer of continuous transformation model of learning. Variational Autoencoder[1] and related activities [19][21] in the hidden layer of the original data using Gaussian prior models model (generative model). However, the Gaussian assumption in the original data and applies to process natural images, so VAE predicted the result will look fuzzy (see figure III). The other hand, the fight network (GAN) [22] and related activities [2][3] learn to build model together with the generator cost functions.  So generative alternating with discriminator networks can be trained.

Generative model to hidden distribution of sample data into the dataset, discriminator distinguishing network centralized the data samples from all discrimination out of the sample generator, generator to be able to play the role of fool discriminator, therefore discriminator can also be seen as a cost function generator.

Not only do we need to learn from hidden layer to image space generator, also need to be able to encode the image back to the hidden layer, so you'll need to VAE and GAN networks combined. Intuitively, a simple way is to use VAE method with a cost function directly. Donahue et.Al [23], proposes a model of learning and two-way GAN bijective transformation network. Lamb et. Al.[24] proposes distinguishing network (discriminator generative networks), previously trained classifier differences as part of the cost function. Finally, Larsen et.Al[25] is trained to VAE and GAN networks, this encoder can optimize hidden layer of Gaussian prior models, and GAN networks extracted feature similarities. Generator will output the hidden random samples as the input, and outputs the encoder network, optimized to fool discriminator, and minimize the original image and similarity of the decoded image. Arbiter is always trained to distinguish type – judging the authenticity the authenticity of the image.

We use Larsen et.Al. [25] the methods to train the Autoencoder, schematic diagram in Figure 2 illustrates this model. In their paper as described in [25], encoders (Enc), generator (Gen) and arbiter (Dis) network optimization makes the following minimum cost function:

Depth: Comma.ai first autopilot based on the road video forecast paper

In the above formula,

Depth: Comma.ai first autopilot based on the road video forecast paper

Meet coding output distribution q (z|x) and prior distribution p (z) of Kullback-Liebler divergent, are is VAE are is of matrix, p (z) meet n (0,1) Gaussian distribution, we with reparemetrization to optimization its regularizer, so in training process in the always meet z = µg + ∈ σ, in test process is meet z = μ (formula in the µg and σ is coding network of output, ∈ is is and µg, and Σ Gaussian random vectors of the same dimension)

Second error is a calculated value represents the l layer hides the activation value in the discrimination network, the value with a legitimate image x and the corresponding encoding-decoding the value of Gen (Dis (x)) is calculated.

Assumptions:

Depth: Comma.ai first autopilot based on the road video forecast paper

Can be obtained:

Depth: Comma.ai first autopilot based on the road video forecast paper

In the course of training, in order to avoid steps that are too cumbersome, Dis is usually constant.

Last LGAN is generated against the network (GAN) costs [22], the cost function is expressed between Gen and Dis game. When the Dis are trained, Enc and Gen always remains fixed at:

Depth: Comma.ai first autopilot based on the road video forecast paper

U meet the normal distribution n (0,1) random variable, the first part of the formula is the log-likelihood function of Dis, used to distinguish legitimate images, the remaining two parts are random vectors u coded values or z = Enc (x) value used to distinguish whether the fake of image samples.

FOSSIL iPhone 5

At the time of training the Gen, Dis and Enc and always maintain a fixed value:

Depth: Comma.ai first autopilot based on the road video forecast paper

Said Gen fool Dis identification network, [25] second Enc in the equation (x) in the training process is usually set to 0.

Our train is 200 times of the Autoencoder, each iteration containing 10000 gradient update, increment size is 64, as described in the previous section, sample random sampling from the driving data. We use optimized Adam [4], from the encoder network architecture reference Radford et.Al[3]. generator from 4 layers to a volume composed of grass-roots, following the normalization of samples after each layer and activate functions that leaky-ReLU. Made up of multi-volume primary arbiter and the encoder, and immediately after the first layer is the normalized sample of operations, the activation function used here is ReLU. Disl is the third rolled grass-roots network decoder output, then the normalized sample ReLU operation. Distinguishing the output size is 1, it cost function is binary cross-entropy function, encoding the output size is 2048, compact (compact representation) into the raw dimensions of 1/16. Details can be viewed in Figure 2 or this article synchronization code, coding-decoding and image of the sample is shown in Figure 3.

After the Autoencoder training, we fixed all the weights and Enc as a training model preprocessing steps, we will discuss in the next section switching model.

 Conversion models (transition model)

Training Autoencoder later, we got used to convert the data set, use Enc XT-> training RNN:zt,ht,ct-> Zt+1 ZT to indicate the encoding transformation of space.

Depth: Comma.ai first autopilot based on the road video forecast paper

W,V,U,A for training the right values in the formula, and HT is the hidden RNN, CT signal directly controls the vehicle speed and steering angle, and LSTM,GRU, as well as CT multiplication with the ZT iterations will further research in the future, now used to optimize training weighted cost function is the mean square error (MSE):

Depth: Comma.ai first autopilot based on the road video forecast paper

Obviously the formula is optimal, because when we train Autoencoder, the distribution of code z Lprior Gauss constraint is imposed. In other words, the mean square error equal to the value of a normally distributed random variable. If the predicted encoded values for:

Depth: Comma.ai first autopilot based on the road video forecast paper

Forecast picture frames can be expressed as

Depth: Comma.ai first autopilot based on the road video forecast paper

We use a sequence of video frames of up to 15 to train the model, after the first 5 frames of output as input to 10 frames after learning network, that is, Enc (XT) when the function Z1,...,Z5 is calculated, continue to enter as a follow-up to get

Depth: Comma.ai first autopilot based on the road video forecast paper

Feedback to continue as input. In RNN's literature, will continue as input output feedback is called RNN hallucination. In order to avoid complicated operations, we'll continue as input the output feedback in the process of the gradient is set to 0.

| Test results

In this study, we will spend most of our energy on how to make Autoencoding structure to hold on the texture of the road, as mentioned above, we have studied the different cost functions, although their mean square error is about the same, but using GAN network cost function is to get the Visual effects for best results. As shown in Figure 3, we show two set of training models corresponding to different cost functions of decoded picture, as expected, the image is fuzzy neural networks based on MSE, making multiple lane mark recognized by the error into a long single lane.

Depth: Comma.ai first autopilot based on the road video forecast paper

In addition, the fuzzy reconstruction cannot be preserved behind to the edges of the image, so this method cannot be used to promote the most important reason is difficult to achieve distance and estimation of the distance between the vehicle. The other hand, the learning curve is plotted with the MSE line speeds faster than models based on fight network. Perhaps in learning with steering angle information to pixels is encoded can be immune from the problem. We will keep this issue for future research.

Once we get a good Autoencoder, you can begin to train the model. Predict the results picture frame as shown in Figure 4, we use the 5Hz video of training the model, after learning of the transformation model even after 100 frames are consistently maintained road frame structure. When the seed sampling frames from converting model is different, we have observed through the lane, close to the vehicle in front, and front drive open, such as driving, but the model does not simulate scenes out of corners. When we drive in the corners of the image frame when you initialize the model, model fast lane straight, straight and start simulation. Under this model, although no accurate optimization cost function in pixel space, we can still learn from the video conversion. We also believe that the reliance on the more powerful model (such as depth RNN, the LSTM, GRU) coding the context and contextual encoding (sensor-assisted video sample plus the steering angle and speed) will appear more realistic simulations.

This paper release data set that contains this method during the experiment, all the necessary sensors.

| Conclusion

Comma.AI in learning is introduced in this paper preliminary research of vehicle driving simulator, video forecast model based on Autoencoder and RNN. We do not have end-to-end (End-to-End) associated with all things, instead of based on the fight network (GAN) cost function is to train Autoencoder, lifelike images, and then we train the RNN model in the embedded space. While the Autoencoder and the outcome of conversion model looks very realistic, but to simulate all the events associated with the drive still needs to do more research. Automatic driving in order to stimulate further research on, we publish the video sampling as well as vehicle speed, driving, steering angle sensor data, such as data sets, and is currently being trained neural network is open source source code.

"References"

[1] Diederik P Kingma and Max Welling, "Auto-encoding variational bayes," arXiv preprint

arXiv:1312.6114, 2013.

[2] Emily L Denton, Soumith Chintala, Rob Fergus, et al., "Deep generative image models using laplacian pyramid of adversarial networks," in Advances in Neural Information Processing Systems, 2015.

[3] Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative adversarial networks." arXiv preprint arXiv:1511.06434, 2015.

[4] Diederik Kingma and Jimmy Ba, "Adam: A method for stochastic optimization." arXiv

preprint arXiv:1412.6980, 2014.

[5] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, "Generative adversarial nets," in Advances in Neural Information Processing Systems, 2014.

[6] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian Goodfellow, "Adversarial Autoencoders," arXiv preprint arXiv:1511.05644, 2015.

[7] Jan Koutn´ ık, Giuseppe Cuccu, Jurgen Schmidhuber, and Faustino Gomez, "Evolving large- scale neural networks for vision-based reinforcement learning," Proceedings of the 15th annual conference on Genetic and evolutionary computation, 2013.

[8] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, et al., "Human-level control through deep reinforcement learning," Nature, 2015.

[9] David Silver, Aja Huang, Chris Maddison, et al., "Mastering the game of Go with deep neural networks and tree search," Nature, 2016.

[10] Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen, "Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection,"

arXiv preprint arXiv:1603.02199, 2016.

[11] Brian L Stevens, Frank L Lewis and Eric N Johnson, "Aircraft Control and Simulation: Dynamics, Controls Design, and Autonomous Systems," John Wiley & Sons, 2015.

[12] Eric R Westervelt, Jessy W Grizzle, Christine Chevallereau, et al., "Feedback control of dynamic bipedal robot locomotion," CRC press, 2007.

[13] HJ Kim, Michael I Jordan, Shankar Sastry, Andrew Y Ng, "Autonomous helicopter flight via reinforcement learning," Advances in neural information processing systems, 2003.

[14] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, et al., "Action-conditional video prediction using deep networks in atari games," Advances in Neural Information Processing Systems, 2015.

[15] Manuel Watter, Jost Springenberg, Joschka Boedecker and Martin Riedmiller, "Embed to control: A locally linear latent dynamics model for control from raw images," Advances in Neural Information Processing Systems, 2015.

[16] Jurgen Schmidhuber, "On learning to think: Algorithmic information theory for novel com- binations of reinforcement learning controllers and recurrent neural world models," arXiv preprint arXiv:1511.09249, 2015.

[17] Michael Mathieu, Camille Couprie and Yann LeCun, "Deep multi-scale video prediction beyond mean square error," arXiv preprint arXiv:1511.05440, 2015.7

[18] Ramon van Handel, "Probability in high dimension," DTIC Document, 2014.

[19] Eder Santana, Matthew Emigh and Jose C Principe, "Information Theoretic-Learning Autoencoder," arXiv preprint arXiv:1603.06653, 2016.

[20] Eder Santana, Matthew Emigh and Jose C Principe, "Exploiting Spatio-Temporal Dynamics for Deep Predictive Coding," Under Review, 2016.

[21] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly and Ian Goodfellow, "Adversarial Autoencoders", arXiv preprint arXiv:1511.05644, 2015.

[22] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, et al., "Generative adversarial nets," Advances in Neural Information Processing Systems, 2014.

[23] Jeff Donahue, Philipp Krahenb ¨ uhl and Trevor Darrell, "Adversarial Feature Learning," ¨ arXiv preprint arXiv:1605.09782, 2016.

[24] Alex Lamb, Vincent Dumoulin Vincent and Aaron Courville, "Discriminative Regularization for Generative Models," arXiv preprint arXiv:1602.03220, 2016.

[25] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle and Ole Winther, "Autoencoding beyond pixels using a learned similarity metric," arXiv preprint arXiv:1512.09300, 2015.

[26] Jose C Principe, Neil R Euliano, W Cur Lefebvre, "Neural and adaptive systems: fundamentals through simulations with CD-ROM" John Wiley 

Lei feng's network (search for "Lei feng's network", public interest) Note: please contact the authorized and keep the source and author, no deletion of content.

No comments:

Post a Comment