A novel deep-learning based approach to DNS over HTTPS network traffic detection

ABSTRACT


INTRODUCTION
The usage of the domain name system (DNS) over hypertext transfer protocol secure (HTTPS) (DoH) protocol is currently supported by the most commonly used web browsers.In DoH, the DNS requests/responses [1] are encapsulated in the data of the HTTPS protocol [2].This approach allows DoH to be easily forwarded through network firewalls or Internet routers.The content of the requests or responses is impossible to identify due to encryption.
Internet service providers (ISPs) commonly use their DNS servers for domain monitoring which are accessed via HTTPS protocol.This is suitable because HTTPS encryption transport layer security (TLS) can since the used TLS version 1.3 completely hide the accessed domain name in the packet header.When the users prefer in their web browsers the usage of DoH protocol, there is practically no possibility to detect which domains are accessed by them.For a specific potentially dangerous user, it can be helpful to mark or drop the DoH traffic packets and force the user to use a normal unencrypted version of the DNS protocol.The robust, fast and reliable DoH packet detection within a huge volume of common network traffic is a non-trivial task.
In our work, we first created a specific DoH dataset retrieved from real DoH network traffic coming from DoH servers placed in the Czech network interchange center (CZ.NIC).Then we identified a specific DoH network pattern, resp.created the two deep-learning-based neural network models.The created models were then used to identify unknown DoH traffic coming from Cloudflare network services.We have performed many experiments due to the creation of machine-learning models able to detect the DoH traffic within normal network traffic.Our final-version convolutional models were applied to the identification of unknown DoH network traffic coming from Cloudflare and achieved an accuracy near 95%.Our paper consists of 4 following sections: i) Current state-of-the-art: this section contains the DoH protocol introduction and a detailed overview of previously created machine-learning models used for DoH detection; ii) Our approach and proposed solution definition: this section contains the algorithm used for captured DoH data preprocessing and there are described the dense and convolutional machine-learning models used for DoH detection and their hyperparameter optimization; iii) Results: we introduced a new data set used for the training of the DoH detection models and there is an overview of the achieved accuracy of the models introduced in the previous section; and iv) Conclusion: there are summarized results of our work and proposed possible new directions for future work.

CURRENT STATE-OF-THE-ART
Because DoH is a relatively new technology, there are few papers regarding this topic.The complete source we have found on this topic is [3].The author mentions the following detection options: i) TLS inspection, ii) profiling of encrypted traffic, iii) checking connections to servers from the list of DoH providers (frequent updates are needed), iv) whitelist of network applications and their proper configurations, and v) statistical processing of metadata (e.g., number of bytes transmitted, connection length, and average packet length).
According to the author, with the introduction of TLS 1.3, where certificates are encrypted and therefore not transmitted in plaintext, as in earlier versions, TLS inspection is much more difficult.Another extension of TLS 1.3 encrypts the SNI field so that it is no longer possible to "look out" for the domain name from the TLS handshake.More information about SNI encryption can be found in Rescorla et al. [4].
Another possibility is to use artificial intelligence methods.A very interesting paper on DoH detection using machine learning techniques is [5].The authors focused on two things: i) simple detection of DoH traffic and ii) identification of a concrete web browser involved in DoH communication.The features mainly concerned statistics of communication delays, sizes of transmitted data and their ratios within one flow.It should be noted that the feature vector did not contain any data on IP addresses and ports, which could be used to easily identify the DoH server and thus DoH traffic.To train and test selected machine learning (ML) models (5-NN, C4.5, random forest, naive Bayes (NB), Ada-boosted decision tree DT), the authors created and then published their dataset, including DoH traffic from currently most used web browsers, Mozilla Firefox and Google Chrome.The results were excellent both in the case of simple DoH detection and in the case of identification of a concrete web browser.An interesting experimental finding of the authors was that the feature that most distinguishes DoH traffic from normal HTTPS traffic is the flow duration, which is much longer in the case of DoH.
Bushart and Rossow [6] and Wang et al. [7] investigated the possibility of recognizing websites visited via DoH with DNS message padding (RFC 8467).They created a mechanism called DNS Sequences, which describes the time sequence of DNS response sizes and gaps (in millisecond log scale) between responses when visiting a particular website.They used the k-nearest neighbors (k-NN) classifier to classify DNS-Sequences into websites.They used 10,000 websites from the Tranco list for learning and testing.The presented results show that a classifier would be able to classify about 80% of websites with 90% success (9 out of 10 samples).The authors also suggested possible countermeasures that should radically reduce the success of the recognition of visited websites.
Hynek and Cejka [8] evaluated the possibilities of inferring individual domain names from encrypted DoH connections.The authors based their solution on 2 findings: i) For each website load, it is possible to observe multiple DNS packet bursts -as each loaded asset might have other dependencies and ii) Although the order of DNS responses can be shuffled, DNS packet sizes remains almost unchanged in one web page load.For each response, they defined 3 neighbor zones according to the time distance from the response (time distances were stated experimentally).For each zone separately, they then measured statistics (min, max, mean, median, variance) of the size of the DoH responses that were in the zone.From these statistics, they finally selected a total of 11 as a feature vector.They experimented with several ML models, the best results were achieved using the combination of the AdaBoosted decision tree and the Bagging meta-learning algorithm.According to the authors, their classifier can infer domain names with an accuracy up to 90% on HTTP 1.1 and up to 70% on HTTP 2 protocol.
Classification of DoH traffic into benign or malicious classes was the main goal of the work [9].The authors created two ensemble learning classifiers.The first one consisted of decision tree (DT), logistic regression (LR) and k-NN, the second one was the RF classifier.The CIRA-CIC-DoHBrw-2020 benchmark dataset [10] was used for training and testing.According to the results, the RF classifier achieved the best possible results (100%) in terms of precision, recall and F1-score while the composed DT, LR and k-NN ensemble classifier performed only slightly worse.
Testing different kinds of machine learning algorithms (beyond artificial neural network (ANN)) on DoH traffic recognition, including its classification into benign vs. malignant, was the focus of the work [11].Among other things, the authors also focused on feature engineering, in the process of which they identified A novel deep-learning based approach to DNS over HTTPS network traffic detection (Jan Fesl)

6693
and removed features that were insignificant for traffic classification.They thus achieved a noticeable reduction in the time required to train the model and its prediction.The average prediction time of all the tested algorithms was below 1 s, allowing their eventual use in practice.As in the work of Wang et al. [7], the CIRA-CIC-DoHBrw-2020 dataset was used for training and classification.
Trying to get around complicated feature engineering led the Ding et al. [12] to propose an end-toend anomaly detection model based on a variational autoencoder which incorporates the attention mechanism.They used a bidirectional GRU-based network to automatically learn the feature representations and detect anomalies via reconstruction error.Huang et al. [13] deals with the evaluation of several ML models detecting malicious DoH traffic.The authors optimized hyperparameters of several models and compared them with respect to precision, accuracy, recall and F1-Score.They concluded that RF and DT models performed best in comparison with KNN, 1D convolutional neural network (CNN), 2D CNN and long short-term memory (LSTM) models.Training and testing were conducted on CIRA-CIC-DoHBrw-2020 dataset.The CNNoriented approach was used also in [14].
The testing of different machine learning algorithms for recognizing DoH traffic from classic web traffic was also addressed in [15], [16].Unlike the other works reported here, the training and testing dataset was generated by the authors themselves -by capturing traffic to the top 20,000 most visited domains from Alexa's list of top 1 million websites.Beyond the DoH traffic detection itself, they also focused on finding techniques that would, in turn, significantly reduce the detection accuracy of the trained ML model used, e.g., on the ISP side.
Casanova and Lin [17] used LSTM and bidirectional long short-term memory (BiLSTM) models for DoH traffic classification.Their aim was to create a generalized, portable model, independent of the target deployment environment.The BiLSTM model achieved better results in terms of accuracy and training and classification time.
Jha et al. [18] focuses on the detection of DoH tunneling.The authors built a test environment and created their dataset along with CIRA-CIC-DoHBrw-2020.They observed that many tunneling instances had large packet sizes and long request duration so they used outlier detection models, namely k-NN, SVM, deep factorization machines (DeepFM) and RF.All the mentioned models, except k-NN, achieved excellent results over 99% in terms of FI-score.
In addition to achieving high DoH traffic detection accuracy, Zebin et al. [19] and Banadaki [20] focused on the transparency of ML model decision making, which has received increasing attention in recent years in the context of explainable machine learning.The authors constructed a Balanced Stacked Random Forest classifier and used Shapley additive explanations (SHAP) values to illustrate the impact of individual features on the model's decision making.The detection results were also convincing, over 99.9% in terms of FI-score.
Steadman and Scott-Hayward [21] deals with the design of system architecture for detection and mitigation of malicious DNS and DoH communication.The DoHxP architecture is based on SDN.Nguyen and Park [22] proposed a detection system for DoH tunneling attacks based on Transformer to detect a malicious DoH tunneling.The main advantage over conventional supervised machine learning approaches is that training requires significantly less labeled data, the authors report around 20% compared to another research, while achieving F1 over 99%.
Zhan et al. [23], Li et al. [24] also focused on the detection of DoH tunneling.They divided the detection into 2 phases.In the first, preliminary phase, they tested TLS data and compared it with fingerprints of DoH clients, according to the authors' findings, fingerprints of DoH clients are often unique.In the second phase, flow-based features were fed to the trained ML classifier.The authors tested a total of 3 classifiers: boosted DT, RF and LR, investigating the effect of location and the recursive solver used on detection accuracy.Shatoori et al. [25] came up with the idea of packet clumps (sequence of consecutive packets in a network flow) to find patterns in a limited window of traffic, which can reduce detection latency.They examined and analyzed dependency of accuracy and response time on the number of packet clumps in a sliding window.
N-shot learning was used by Zou et al. [26].Their model, called Depl, outputs websites a user visited.Depl uses BiLSTM to extract features of the input data, which consists of sequences of packet sizes.They achieved remarkable results using a very small number of training samples, only 5 samples were enough for an accuracy of around 86% in a closed environment.Al-Fawa'reh et al. [27] combined the bi-directional recurrent neural networks (RNNs) and the statistical methods and achieved very good accuracy for a specific dataset.

OUR APPROACH AND PROPOSED SOLUTION
Our team has developed a special platform allowing us to capture and visualize the data retrieved from network traffic probes.A graphical representation of the traffic per packet is depicted in Figure 1, the green spots mean the direction from a client to a server and the red spots are the reverse direction.The X-axis represents the time and Y-axis the packet size in bytes.In our work, for a DoH connection, we have introduced the term DoH handshake (DoHH).DOHH means one complete DNS REQUEST/RESPONSE over HTTPS.One DoH connection contains a TLS initial handshake and then a consequence of standalone DOHHs, in detail depicted in Figures 2 and 3.This approach showed as suitable and used for the evaluation of practical measurements.The motivation for the introduction of the DOHH mechanism was the effort to divide the communication between the client and the server into logically related parts, which would subsequently allow easier processing and separation of DoH traffic.

ML-model creation, testing, and optimization
The approach to model design and preprocessing of input data has gradually evolved.The first version of the was based on the basic paradigms associated with neural networks, and fully interconnected (dense) layers were used as crucial layers.However, the process of designing a specific model is not described by a general methodology, so this phase was supported by the creation of a specialized software tool forming an extension of the Keras API.All presented models are optimized and trained on the GPU NVIDIA GeForce 2080 TI.The structure of the optimization superstructure can be divided into several layers.The fundamental part represents hyper-parameters and other factors influencing the model's learning (after this only parameters).Appropriate classes are defined for individual parameter types, allowing to specify the allowed range and the methods of selecting a specific parameter value.Each parameter instance of any type is then able to generate this value within the set limits.Specific parameters are the ones related to the description of the model's internal structure, especially its hidden layers.It is possible to define their number, eventually type, and other parameters.Above this layer of the optimizer, a list of parameters used to set up the model is then created.Any adjustable parameter of the model can be selected for optimization and experiments with its value.The final list of parameters forms a mixed-type vector with coordinates dedicated to optimizing their value.Each coordinate value of this vector is set (generated or otherwise selected) in each optimization step.The set model is then trained and tested.The value of loss of accuracy was chosen as a measure of quality when using a test set; we try to minimize this value.Genetic algorithms were chosen to control the optimization process.

Developed deep neural network models
Using the optimization mentioned above, it was subsequently found that on the given input data (described below), the given classification task can be satisfactorily solved with a small number of hidden layers and neurons in them.For instance, the model described in Figure 5 achieved good results.In addition to fully interconnected dense layers, layers supporting the robustness of network learning were used to improve the model's function (a layer implementing Gaussian noise and a dropout layer randomly zeroing some elements of the previous layer's output Experiments with the model structure have also shown that networks containing convolution layers generally achieve better results.Several models were tested with this global architecture again, including models containing multiple parallel convolution layers with different convolution kernels.However, the model shown in Figure 6 achieved the best results in testing.

Input dataset preprocessing
The original idea was to train the neural network model on three basic parameters of each packet.These parameters were the packet size in bytes, packet direction, and time distance from the previous packet.These data were used from the first 30 packets from each data flow (data flow here means continuous communication between two IPs using the same ports).The 90-element vector was then used for training and testing models described above and based on dense and convolutional layers.
Further was also performed with the DOHH-based structure of input data, not data flows.The method of determination of these DOHHs is given above.In this case, data from the first 20 packets in each DoHH were used.The characteristics of these packets were the same as mentioned above.
The data itself came from two different sources.The first one was the catching of DoH queries within the CZ.NIC network on the model infrastructure.In this case, several groups of data sets were generated.The first group was created by modelling a separate communication for each DoH query when there was a separate communication flow for each query (CZ.NIC S sets).The second group (CZ.NIC L sets) was obtained by modelling communication where one communication flow contained more DoH queries.The second data source was data from the model infrastructure's communication with DoH servers located within the domain cloudflare.com(CLOUDFLARE set).For DoHH-based testing, the CZ.NIC N and CLOUDFLARE N sets were generated from the data flow described above.For sets from the CZ.NIC S and N sets, the obtained data were divided into three essential parts: training, validation, and test set.The first two data sets were used in the training of models.The last part, together with data from other sets, was used to evaluate the model's generalization abilities in the tests.
The models were always trained on data from the CZ.NIC network for separate queries against DoH servers (from the CZ.NIC S and N set).The experiments aimed to verify in practice whether models were trained to communicate with a specific server in the CZ.NIC network will be able to detect DoH communication with other servers (e.g., cloudflare.com).That was also a measure of the model quality.

Experimental results and evaluation
Within the project, a large number of experiments were performed with different model structures.Only selected top-quality outputs are listed below.The best results were obtained with the model shown in Figure 6.The models' evaluation is given in Table 1 and 2 for data from the CZ.NIC network and data from communication with the cloudflare.comdomain.
In Table 1, the columns describe the input data used for testing, the number of test examples and the total input vector dimension.In Table 2, the first two columns describe the model type and the number of packets from which the input data was used.The type of model follows, type D1 model was based on dense layers (depicted in Figure 5), while models C were on convolution layers (depicted in Figure 6).The following are four columns for testing evaluation (true positive, true negative, false positive, false negative).In Figure 7, it is possible to see the achieved quality of all models.The quality is computed as  = (  () +  ())/( +  +  () +   ()), from the data contained in Table 2.The results show that the D1 model also achieved good results.However, already on the CZ.NIC network data, these results were worse than in the models involving convolutional layers.Therefore, these  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 13, No. 6, December 2023: 6691-6700 6698 models were not further tested.Several experiments were performed with type C models to find the model's optimal setting in terms of its parameters and hyperparameters.Somewhat paradoxically, it turned out that the best results were achieved already by the second tested model, C2, which is also shown in Figure 6.Its quality lies in correctly identifying DoH data against the cloudflare.comdomain and thus in its robustness.The tests were also performed on DoHH-based inputs (the last two rows of the table).The model's quality was very high on data from the CZ.NIC network.However, its generalization ability tested on data from cloudflare.com was only small, even if three parallel CNN layers with different kernel sizes were used.The overall achieved quality of all models is depicted in Figure 7.The models trained on the data coming from provider CZ.NIC were able to detect the DoH connections from another provider, Cloudflare.These results confirmed the ability to generalization of our proposed models.All presented models are part of the free available dataset and can be downloaded via https://gitlab.prf.jcu.cz/root/dohgpu/.
In our future work, we would like to target the graphical visualization of the DoH traffic and the usage of computer vision methods suitable for its identification.This approach has a great advantage in the sense that is near to human thinking and easily checkable.The next gap which is worthwhile for attention is the elimination of the jitter of the TCP packets causing the ambiguity of the DoH patterns.

Figure 1 .
Figure 1.Visualization of DoH network traffic flow.Each spot means a single packet.The red color means the direction from source to destination and the green backwards

Figure 2 .Figure 3 .
Figure 2. The visualization of a concrete DoH network traffic flow containing detected DoHHs

Figure 4 .
Figure 4. Transformation of TCP packets into DoHHs.The above-depicted algorithm works as a state machine

Figure 7 .
Figure 7. Achieved quality for different neural network models and their comparison

Table 1 .
Input datasets and selected neural network models (Dense vs CNN)

Table 2 .
Achieved results for specific datasets and models