Water Quality Classification

Introduction

Water quality refers to the physical, chemical, and biological characteristics of water that determine its suitability for various uses, such as drinking, agriculture, industry, and ecosystem health. Assessing water quality is crucial for ensuring the safety of human health and the environment. When it comes to assessing water quality classification, there are various attributes of concern including amount of: aluminum, ammonia, arsenic, barium, cadmium, chloramine, chromium, copper, fluoride, bacteria, viruses, lead, nitrates, nitrites, mercury, perchlorate, radium, selenium, silver, and uranium. Based on these criteria, we can classify water quality using the attribute ‘is_safe’.

Measurable criteria (numeric variables) for all above attributes are listed below:

Attribute	Dangerous (if greater than)
Aluminum	2.8
Ammonia	32.5
Arsenic	0.01
Barium	2
Cadmium	0.005
Chloramine	4
Chromium	0.1
Copper	1.3
Fluoride	1.5
Bacteria	0
Viruses	0
Lead	0.015
Nitrates	10
Nitrites	1
Mercury	0.002
Perchlorate	56
Radium	5
Selenium	0.5
Silver	0.1
Uranium	0.3

Prediction of Water Quality ("is_safe" attribute):

Based on the mentioned attributes, the task is to predict the attribute "is_safe" for water quality. This classification involves analyzing the concentrations or presence of the mentioned attributes to determine whether the water is safe or not.

By employing neural network model, the relationship between these attributes and the "is_safe" attribute can be learned from available data, enabling the prediction of water quality safety.

is_safe - class attribute {0 - not safe, 1 - safe}

Artificial Neural Network (ANN):

Referred to as a Feed-Forward Neural Network.
Because inputs are processed only in a forward-facing direction.

Data Preparation

First, dataset file should be uploaded into jupyter Notebook in Anaconda Navigator or into Google Colab. Then, need to make sure the availability of the data in the uploaded dataset by checking its first rows of data. Also, make sure the shape (number of rows & columns) of data of whole dataset.

After that, need to check if there are any duplicated rows and also any missing (null) values in the dataset. If found any of that, need to remove and fix it because that can help to increase the accuracy of the dataset.

Since there are no any duplicated rows and no any missing values, we can continue further for the pre-process of data.

Data Pre-processing

First, we can do this by dropping some unnecessary columns (attributes) of the dataset. Because analyzing that dataset and available knowledge sources such as some researches on water quality we found that there are some attributes which not much essential for predicting the water quality of giving sample.

Then need to identify if there any invalid values in that new dataset. If found any, remove those rows that include invalid values like '#NUM!' from the dataset, since those are unusable when later training and testing purposes.

Dataset = 7999 (initially)

= 7996 (after pre-processing)

Then split the dataset into training data and testing data.

Splitting a dataset into training and testing data is essential for evaluating the performance of the neural network model. Training data is a subset of the overall dataset that is used to train the model. It is the input that the model learns from and is used to adjust the model's parameters or weights. The testing data allows us to assess how well the model performs on unseen examples and helps prevent overfitting. It also enables hyperparameter tuning and avoids data leakage.

Depending on the size of the dataset, the complexity of the problem, and the available resources, used the ratio as 80% for training and 20% for testing.

Scikit-learn (sklearn):

Machine learning library for Python.
Provides a wide range of machine learning algorithms, including supervised learning algorithms.

random_state = 42

It will generate a different test set.

Now, we can check the distribution of output values: 'is_safe' (0 - not safe, 1 - safe)

Below showing the relevant graphs for the distribution of training and testing data:

For training:

For testing:

Training & testing dataset distribution:

More bias to the 0 (not safe): means more than 50% of dataset tends to produce these values.

As we can see the unbalanced appearing of this distribution, need to normalize (standardization) both this training and testing data sets.

To reduce unbalance:

Normalize the input features of the dataset:

o Standard technique: Use mean and standard deviation.

This normalization used for:

Equalizing the scale
Treating all features equally
Enabling compatibility with certain algorithms

Model Design

Model:

Ø # 8 neurons in layer 1

o Output from this layer – input to the next layer

Ø # 4 neurons in layer 2

Ø # 1 neuron in last layer

o for classifying into one class (0/1)

Divide into layers:

To allow neural network to learn more.

Here, we chose the ReLu and Sigmoid for activation functions for the layers. Because recent studies shown that ReLu function is the best function for reducing the gradient vanishing problem. Sigmoid function for mapping input values to a value between 0 and 1, making it useful for binary classification.

Hyperparameter Selection

For selecting hyperparameters, need to consider certain things such as the number of layers, activation and loss functions for each layer, and the optimizers.

Activation Functions:

ReLU

For hidden layers in neural networks
Because it is computationally efficient
Help to prevent the vanishing gradient problem:

occurs during the training of neural networks
because the updates to the weights become very small
caused by: the derivatives of the activation function used in the neural network can become very small

Sigmoid

For the output layer
Used to map the output of the network to a probability distribution
Useful for classification problems:

into a particular class

Model structure shown below:

Network consists of a sequence of Dense layers:

which are densely connected (also called fully connected) neural layers

Overview of the Implementation Platform

Anaconda Navigator is a comprehensive implementation platform that provides support for neural network models through the inclusion of popular libraries such as TensorFlow, Keras, and other Python libraries like NumPy and Pandas.

Neural Network Model Support: Anaconda Navigator includes TensorFlow, a widely used deep learning library, and Keras, a high-level neural networks API. These libraries provide a rich set of tools and functionalities for building, training, and deploying neural network models. With Anaconda Navigator, users can easily access and utilize these libraries for various machine learning and deep learning tasks.

NumPy: Anaconda Navigator comes pre-installed with NumPy, a fundamental package for scientific computing in Python. NumPy provides efficient multi-dimensional array operations and mathematical functions, making it a crucial component for data manipulation and numerical computations in neural network modeling.

Pandas: Another essential library included in Anaconda Navigator is Pandas, which offers high-performance data manipulation and analysis capabilities. Pandas provides data structures and functions for handling structured data, making it valuable for preprocessing and data wrangling tasks in neural network modeling.

By having TensorFlow, Keras, NumPy, and Pandas readily available in Anaconda Navigator, users can seamlessly leverage the power of these libraries for building and training neural network models. The integration of these tools within the platform simplifies the setup process and allows users to focus more on developing and experimenting with their neural network architectures.

Training

Before begin the training need to compile the code:

Optimizer:

rmsprop –

An innovative stochastic mini-batch learning method

Loss:

binary_crossentropy –

Used as a loss function for binary classification model

Metrics:

accuracy -

To monitor during training and testing

Training the dataset:

Callback

To save the Keras model or model weights at some frequency.
In Keras, that can be used to improve the performance of deep learning models.

Callback Functions:

ModelCheckpoint

Saves the model weights at regular intervals during training.

EarlyStopping

Stops training when the model performance on a validation set stops improving.
Saves the weights automatically based on a criterion. (the performance on the validation set)

epochs and batch size:

Two important hyperparameters that can affect the performance of a model.

epochs:

The number of times that the entire training dataset is passed through the model.

batch size:

The number of training samples that are processed at a time.

When training looks like below:

640/640: total number of batches

Now, need to check the learning curves for above training:

Optimizing the Hyper Parameters

Final Optimized Model

Test Results

Output:

Precision, recall, and F1 score are metrics used to evaluate the performance of a machine learning model in binary classification tasks.

Precision measures the accuracy of positive predictions. It is calculated as the number of true positive predictions divided by the total number of positive predictions, including both true positives and false positives.

Recall measures the model's ability to find all positive examples. It is calculated as the number of true positive predictions divided by the total number of positive examples, including both true positives and false negatives.

F1 score is a harmonic mean of precision and recall. It is calculated as 2 * (precision * recall) / (precision + recall).

Discussion

The objective of the developed model was to assess the quality of water samples using deep learning technology, specifically an artificial neural network (ANN). The model achieved an impressive accuracy of 91.62%, indicating its effectiveness in accurately categorizing water quality.

Throughout the training phase, we encountered several challenges, primarily stemming from limited resources available for the model training environment. To mitigate this constraint, we leveraged Anaconda Navigator, a freely accessible platform for running code on personal computers. However, to achieve more precise outcomes, we had to repeatedly retrain the model.

It is worth noting that the attained accuracy rate of 91.62% signifies the model's potential in water quality classification. Nevertheless, we acknowledge that further enhancements and refinements are necessary to improve the model's accuracy and robustness.

To summarize, the developed model exhibits promise in the realm of water quality classification. Despite facing limitations related to resource availability during the training process, the achieved accuracy of the model underscores its potential practicality. To advance the model's performance and address encountered limitations, additional research and advancements are imperative.

References

[1] MsSmartyPants (2021) Water quality, Kaggle. Available at: https://www.kaggle.com/datasets/mssmartypants/water-quality (Accessed: 15 June 2023).

[2] Addisie, M.B. (2022) ‘Evaluating drinking water quality using water quality parameters and esthetic attributes’, Air, Soil and Water Research, 15, p. 117862212210750. doi:10.1177/11786221221075005.

Created by:

U.L. Rathnayake - ICT/18/19/059 - 4587

Search This Blog

Philosophy of Science