Estimating time, a waste of time?

10 min readJun 18, 2021

Seminar Computer Vision by Deep Learning Project Blog
by Olaf Braakman and Sander Gielisse

Introduction

What time is it? Are you also able to do it without looking at the clock? You can probably make a reasonable guess. Together with all sorts of other inputs such as vision, most of us can make a rough estimate of what the time of day is without looking at a clock at all. We decided to ask the same question to a computer. How well can a deep neural network tell the time without using its millisecond-accurate internal clock? In this blog we hope to give some insights towards answering that question.

Finding a Suitable Dataset

Before starting the construction of a deep neural network, we first wanted to make sure there was a suitable dataset available. However, finding one was not straightforward. Although nowadays, there are plenty of high quality datasets available, almost none actually contain timestamps of when the image was taken. So we started our scavenger hunt and eventually settled on the MIRFLICKR (LIACS Medialab, 2010) dataset. This dataset contains one milion images split in ten different directories of 12Gb each. More importantly, for most of the images an EXIF file was available. The EXIF file contains metadata about an image file. It can contain e.g.: camera brand and model, shutter speed and diaphragm settings. It also records the time at which the photo was taken. Thus we extracted this information from each image’s corresponding EXIF file as a true label for the corresponding image. However, there are no sanity checks on these timestamps. For example, we came across ludicrous timestamps larger than 23 hours; or some EXIF files did not contain a timestamp at all. So after filtering out all these images, we were left with roughly half of the original dataset.

Looking for an Approach

Next we had to define a goal. What exactly are we going to try to predict? We ranked our ideas by increased level of complexity:

Four buckets (morning, afternoon, evening, night)
Hour of day classes (Extension of the buckets {0,…,23} ∈ N)
Predict an angle (like a clock’s hourwatch, [0,2π]∈R)
Von Mises (predict a distribution around a circle, see image below)

For the four-buckets classification approach we found one existing paper (Sharma, 2016), which shows that a vanilla SVM and three layered AlexNet can achieve around 50 and 60 percent accuracy respectively. Given the results from this paper we did not really consider the extended version of 24 classes, since it would technically be the same approach, just with more classes. We really wanted to explore the two regression options: the angle based approach and the even more complex approach of predicting a distribution. We opted to first research the more complex approach since we still had plenty of time on our hands and we could always fall back on the simpler approach of predicting an angle.

Before trying out random things, we first had to observe a key property of the problem: namely, that our data is in fact circular. To demonstrate why some approaches could fail, consider the following example: Each network is defined by a loss function. A loss function which does not take into account the circular property of our data would be a simple MSE loss function. Given a true timestamp and a predicted timestamp, MSE would compute the loss between 14:00 and 16:00 to be two hours. But it would compute the loss between 23:00 and 01:00 to be twenty two hours, although the two times are just two hours apart.

Next, to look at a bit more complex approach, we investigate the Von Mises approach, named after the distribution we are predicting. The Von Mises distribution is essentially a gaussian distribution wrapped around a unit circle (Statistical Odds & Ends, 2020). The distribution is defined by two parameters: a mean and a concentration parameter 𝒌, where a higher concentration essentially means a lower variance and the closer 𝒌 gets to zero the more it becomes a uniform distribution. In the following image a Von Mises distribution is depicted with mean π and some 𝒌.

The idea behind this approach is as follows: During the daytime the light intensity changes quite a lot, however, during the nighttime there are no lighting cues at all across a bigger span of hours. By introducing a certainty parameter, in this case the concentration of the distribution, we assumed the network could infer that for ‘day time’ images the concentration parameter could be higher and lower for the ‘night time’ images. Hopefully resulting in a lower total loss. That was how we thought about it first, but we did not have a clear idea of what the appropriate loss function would be.

From the Von Mises distribution we can use the inverse log probability as a loss function for the network. For predicting the μ, we realized that this might suffer from the same problem as mentioned before. If we project the 24 hours of a day on [0,2π], it becomes clear for us that 0.01π is actually quite close to 1.99π. For a neural network however, this is numerically quite different; a single gradient step could not change the output from 0.01π to 1.99π or vice versa. In an attempt to solve this issue, we came up with predicting the x and y component of the μ separately. We restrict the output values to be between -1 and 1 via a tanh activation function. Then, we can easily convert this to a point P within the unit square, where using some basic trigonometry we can calculate the angle between the vector (0, 1) and the vector going through (0,0) and P. The advantage of this approach is that both the predicted x and y components are continuous on the domain [-1, 1] without any jumps; the aforementioned large jump from 0.01π to 1.99π does not reflect into the x and y predictions.

Von Mises Distribution for different K values (concentration)

Architecture

For the architecture we considered multiple approaches. While the earlier mentioned time-bin paper used AlexNet, its scores have been improved significantly over the past years. An early example of this is VGG-16/19, but even these are considered to be somewhat outdated for great performance. Within the high-accuracy spectrum, there are mostly three trending architectures mentioned; Inception, Resnet and DenseNet. While we consider Inception and DenseNet to be more complex than Resnet, we do not believe their performances should be significantly better for our experiment. For that reason, we choose the Resnet architecture as a trade-off between complexity and performance.

For the Resnet architecture, we simply remove the last fully connected layer and replace it with our own; this layer predicts 2 numbers for the μ and one optional one for the k needed for the Von Mises experiments. Next, we investigate the desired complexity; Resnet-18/34/50/101/152; where these numbers correspond to the number of layers. We initially chose the Resnet-50 architecture as a nice trade-off between complexity and performance. This worked well for a large dataset like MIRFLICKR, but as we will discuss below we later used a significantly smaller dataset. Hereby we observed severe overfitting, so reduced our architecture to Resnet-18 instead. We saw long training times for our initial runs, and therefore also decided to take a transfer learning approach. By initializing the weights of the layers to values as obtained from a pre-trained network on ImageNet, the idea is that our own training only has to learn our last custom layer, together with some so-called “fine-tuning” of the layers before it. This showed a significant increase in convergence time, suggesting that ImageNet images contain features that our model benefits from, as we already expected. We use the Adam optimizer with a learning rate of 1e-4 and a batch size of 64. All these values were found experimentally.

Problems with the MIRFLICKR dataset

Since we observed a severe imbalance in the labels, we investigated multiple approaches to resolve this. In the first place, we investigated the adding of a higher weight to the loss of underrepresented samples. While this did provide some improved performance, it was difficult to estimate this exact weight. Therefore, we instead binned the data per-hour as shown below. The underrepresented bins were simply over-sampled until all bins had the size of the largest bin. While this adds lots of ‘duplicates’ to the dataset, we do not consider this a problem due to our data augmentation. After loading the image, we resize it to 256x256, take a random 224x224 crop, randomly flip and then randomly rotate the image.

Imbalanced data of MIRFLICKR dataset; this would invoke a bias towards predicting times around hourly value 15.

After analyzing the results obtained when training on the MIRFLICKR dataset, we conclude that the dataset contains too many ‘unsolvable’ images. These images simply do not contain any useful features to tell what time it is.

Building a Custom Dataset

We were not content with the amount of irrelevant images present in the MIRFLICKR dataset. So we decided to see if we could build a custom dataset which mostly contains outdoors frames. As a result, we programmed a Flickr crawler which searches for images by tag which also contain a date-time property in the corresponding EXIF file. In total the custom dataset contains around 10k images sampled randomly from the following tags:

‘outside’, ‘morning’, ‘afternoon’, ‘evening’, ‘night’, ‘city’, ‘raining’, ‘winter’, ‘summer’, ‘autumn’, ‘fall’, ‘building’, ‘farm’, ‘street’, ‘beach’, ‘castle’, ‘valley’, ‘mountain’, ‘wildlife’, ‘creek’, ‘lake’, ‘forest’, ‘river’, ‘skyline’, ‘desert’, ‘harbour’, ‘village’, ‘hill’, ‘airport’ and ‘park’.

We do observe a very similar hourly distribution of the images in the custom dataset, confirming our belief that simply more images are taken during the late afternoon. We applied the same dataset balancing to our own dataset.

Afterwards we still found issues with our own dataset. Despite it being way more appropriate for the problem we are trying to solve, we still encountered images which were impossible to tell the time for. Furthermore, the dataset also contains some black and white images. Although it could be possible to filter them, we did not have enough time to do that.

Circular Loss Instead of Von Mises

Earlier we discussed the Von Mises approach, however, the results were not that promising and we considered that the model was maybe too complex for the limited amount of data we had in our custom dataset. Therefore we decided to dumb-down our approach to the angle prediction mentioned earlier as well. So a more standard loss function that does take circularity of the data into account would be the cosine similarity. A convergence plot of such a run is given below.

Resnet-18 — Fine-Tuning on Custom Dataset — Pre-trained on ImageNet;

In the graph above, our loss is given in hours. This means that initially the network is as-good-as random guessing; which would be 6-hours off. After some training and many experiments, we find the run above to be optimal; 3 epochs on our found settings. The validation loss shows to be only a single hour better than random guessing, on a 24-hour scale. While this initially did not seem like a great achievement, we were somewhat surprised after looking at the images that the network did the predictions for. Below we discuss the first few images of our test dataset (thus randomly selected).

These images reflect some common ‘mistakes’ made by the model. The top left prediction is simply wrong and too far off. The prediction in the upper right images actually sounds plausible, but the true label is wrong, so we cannot really blame the model for this. In the lower left we observe that the model confused a sunrise with a sunset. The prediction of the image in the lower right is not necessarily wrong, but it is almost impossible to tell from the image itself. So again, the model is not really to blame.

From the following series of images we see that the algorithm can in fact also make some pretty good predictions. However, this could also be a case of just good luck, but as the convergence graph shows, the test loss is in fact almost one and a half hours better than random guessing.

Conclusion

Even though the accuracy is not amazing, we did not really expect that in the first place. For indoor images also a human would have no idea and probably do random guessing; unless there is a clock somewhere which humans can read, but we do not expect our neural network to implicitly learn the skill of reading clocks. For outdoor images, a moon indicates night, but even this gives no specific indication; a moon with a dusk might indicate the evening or sunrise; but again there are no definitive conclusions here. We consider this project more as a fun example of trying to find something that humans don’t really know how to do, but maybe a neural network would. Our hypothesis was that maybe certain positions of the sun, colors of the sky and surroundings might give away more information about the time than humans notice unconsciously. While there are some images where the network makes a nice prediction, we do not observe any surprising results and thus consider this hypothesis as not confirmed.

This blog was written as part of the Seminar Computer Vision by Deep Learning course to share the experience gained by realizing a custom project from scratch in the field of computer vision using deep learning. All code is publicly available at GitHub.We would like to thank our TA, Yunqiang Li, for his continuous support throughout the project.

References

LIACS Medialab. (2010). The MIRFLICKR Retrieval Evaluation. The MIRFLICKR Retrieval Evaluation. http://press.liacs.nl/mirflickr/mirdownload.html

Sharma, P. (2016). Automated Image Timestamp Inference Using Convolutional Neural Networks.

Statistical Odds & Ends. (2020). What is the von Mises distribution? https://statisticaloddsandends.wordpress.com/2020/02/25/what-is-the-von-mises-distribution