Water Quality Classification
Introduction
Water quality refers to the physical, chemical, and biological characteristics of water that determine its suitability for various uses, such as drinking, agriculture, industry, and ecosystem health. Assessing water quality is crucial for ensuring the safety of human health and the environment. When it comes to assessing water quality classification, there are various attributes of concern including amount of: aluminum, ammonia, arsenic, barium, cadmium, chloramine, chromium, copper, fluoride, bacteria, viruses, lead, nitrates, nitrites, mercury, perchlorate, radium, selenium, silver, and uranium. Based on these criteria, we can classify water quality using the attribute ‘is_safe’.
Measurable criteria (numeric variables) for all above attributes are listed below:
Attribute |
Dangerous (if greater than) |
Aluminum |
2.8 |
Ammonia |
32.5 |
Arsenic |
0.01 |
Barium |
2 |
Cadmium |
0.005 |
Chloramine |
4 |
Chromium |
0.1 |
Copper |
1.3 |
Fluoride |
1.5 |
Bacteria |
0 |
Viruses |
0 |
Lead |
0.015 |
Nitrates |
10 |
Nitrites |
1 |
Mercury |
0.002 |
Perchlorate |
56 |
Radium |
5 |
Selenium |
0.5 |
Silver |
0.1 |
Uranium |
0.3 |
Prediction of Water Quality ("is_safe" attribute):
Based on the mentioned attributes, the task is to predict the attribute "is_safe" for water quality. This classification involves analyzing the concentrations or presence of the mentioned attributes to determine whether the water is safe or not.
By employing neural network model, the relationship between these attributes and the "is_safe" attribute can be learned from available data, enabling the prediction of water quality safety.
- is_safe - class attribute {0 - not safe, 1 - safe}
- Referred to as a Feed-Forward Neural Network.
- Because inputs are processed only in a forward-facing direction.
Data Preparation
First, dataset file should be uploaded into jupyter Notebook in Anaconda Navigator or into Google Colab. Then, need to make sure the availability of the data in the uploaded dataset by checking its first rows of data. Also, make sure the shape (number of rows & columns) of data of whole dataset.
After that, need to check if there are any duplicated rows and also any missing (null) values in the dataset. If found any of that, need to remove and fix it because that can help to increase the accuracy of the dataset.
Data Pre-processing
First, we can do this by dropping some unnecessary columns (attributes) of the dataset. Because analyzing that dataset and available knowledge sources such as some researches on water quality we found that there are some attributes which not much essential for predicting the water quality of giving sample.
Then need to identify if there any invalid values in that new dataset. If found any, remove those rows that include invalid values like '#NUM!' from the dataset, since those are unusable when later training and testing purposes.
Dataset = 7999 (initially)
= 7996
(after pre-processing)
Depending on the size of the dataset, the complexity of the problem, and the available resources, used the ratio as 80% for training and 20% for testing.
Scikit-learn (sklearn):
- Machine learning library for Python.
- Provides a wide range of machine learning algorithms, including supervised learning algorithms.
random_state = 42
- It will generate a different test set.
Now, we can check the distribution of output values: 'is_safe' (0 - not safe, 1 - safe)
Below showing the relevant graphs for the distribution of training and testing data:
Training & testing dataset distribution:
- More bias to the 0 (not safe): means more than 50% of dataset tends to produce these values.
As we can see the unbalanced appearing of this distribution, need to normalize (standardization) both this training and testing data sets.
To reduce unbalance:
- Normalize the input features of the dataset:
o Standard technique: Use mean and standard deviation.
- Equalizing the scale
- Treating all features equally
- Enabling compatibility with certain algorithms
Model Design
Model:
Ø
# 8 neurons in layer 1
o
Output from this layer – input to the next layer
Ø
# 4 neurons in layer 2
Ø
# 1 neuron in last layer
o
for classifying into one class (0/1)
Divide into layers:
- To allow neural network to learn more.
Here, we chose the ReLu and Sigmoid for activation functions for the layers. Because recent studies shown that ReLu function is the best function for reducing the gradient vanishing problem. Sigmoid function for mapping input values to a value between 0 and 1, making it useful for binary classification.
Hyperparameter Selection
For selecting hyperparameters, need to consider certain things such as the number of layers, activation and loss functions for each layer, and the optimizers.
Activation Functions:
ReLU
- For hidden layers in neural networks
- Because it is computationally efficient
- Help to prevent the vanishing gradient problem:
- occurs during the training of neural networks
- because the updates to the weights become very small
- caused by: the derivatives of the activation function used in the neural network can become very small
- For the output layer
- Used to map the output of the network to a probability distribution
- Useful for classification problems:
- into a particular class
- which are densely connected (also called fully connected) neural layers
Overview of the Implementation Platform
Anaconda Navigator is a comprehensive implementation platform that provides support for neural network models through the inclusion of popular libraries such as TensorFlow, Keras, and other Python libraries like NumPy and Pandas.
Neural Network Model Support: Anaconda Navigator includes TensorFlow, a widely used deep learning library, and Keras, a high-level neural networks API. These libraries provide a rich set of tools and functionalities for building, training, and deploying neural network models. With Anaconda Navigator, users can easily access and utilize these libraries for various machine learning and deep learning tasks.
NumPy: Anaconda Navigator comes pre-installed with NumPy, a fundamental package for scientific computing in Python. NumPy provides efficient multi-dimensional array operations and mathematical functions, making it a crucial component for data manipulation and numerical computations in neural network modeling.
Pandas: Another essential library included in Anaconda Navigator is Pandas, which offers high-performance data manipulation and analysis capabilities. Pandas provides data structures and functions for handling structured data, making it valuable for preprocessing and data wrangling tasks in neural network modeling.
By having TensorFlow, Keras, NumPy, and Pandas readily available in Anaconda Navigator, users can seamlessly leverage the power of these libraries for building and training neural network models. The integration of these tools within the platform simplifies the setup process and allows users to focus more on developing and experimenting with their neural network architectures.
Training
- rmsprop –
- An innovative stochastic mini-batch learning method
Loss:
- binary_crossentropy –
- Used as a loss function for binary classification model
Metrics:
- accuracy -
- To monitor during training and testing
Callback
- To save the Keras model or model weights at some frequency.
- In Keras, that can be used to improve the performance of deep learning models.
Callback Functions:
- ModelCheckpoint
- Saves the model weights at regular intervals during training.
- EarlyStopping
- Stops training when the model performance on a validation set stops improving.
- Saves the weights automatically based on a criterion. (the performance on the validation set)
epochs and batch size:
Two important hyperparameters that can affect the performance of a model.
- epochs:
- The number of times that the entire training dataset is passed through the model.
- batch size:
- The number of training samples that are processed at a time.
- 640/640: total number of batches
Final Optimized Model
Test Results
Output:
Precision, recall, and F1 score are metrics used to evaluate
the performance of a machine learning model in binary classification tasks.
- Precision measures the accuracy of positive predictions. It is calculated as the number of true positive predictions divided by the total number of positive predictions, including both true positives and false positives.
- Recall measures the model's ability to find all positive examples. It is calculated as the number of true positive predictions divided by the total number of positive examples, including both true positives and false negatives.
- F1 score is a harmonic mean of precision and recall. It is calculated as 2 * (precision * recall) / (precision + recall).
Discussion
- The objective of the developed model was to assess the quality of water samples using deep learning technology, specifically an artificial neural network (ANN). The model achieved an impressive accuracy of 91.62%, indicating its effectiveness in accurately categorizing water quality.
- Throughout the training phase, we encountered several challenges, primarily stemming from limited resources available for the model training environment. To mitigate this constraint, we leveraged Anaconda Navigator, a freely accessible platform for running code on personal computers. However, to achieve more precise outcomes, we had to repeatedly retrain the model.
- It is worth noting that the attained accuracy rate of 91.62% signifies the model's potential in water quality classification. Nevertheless, we acknowledge that further enhancements and refinements are necessary to improve the model's accuracy and robustness.
- To summarize, the developed model exhibits promise in the realm of water quality classification. Despite facing limitations related to resource availability during the training process, the achieved accuracy of the model underscores its potential practicality. To advance the model's performance and address encountered limitations, additional research and advancements are imperative.
Comments
Post a Comment