Platform Known Error Database

FM-KED-000 Getting Started
The Finmars Known Error Database is a curated registry of recurring and well-understood operational issues observed in Finmars environments, together with documented recovery procedures and practical guidance. 

 Its purpose is not to list every possible failure, but to capture patterns that are already known , reproducible, and solvable through established algorithms. Each entry describes the symptoms, scope of impact, severity, and an expected recovery path based on real operational experience. 

 This document serves three audiences at once 

 operators who need fast orientation during incidents customers who want transparency and predictability support teams who require a shared operational memory 

 Every known error in this database is classified by severity and recovery class , allowing readers to understand both the business impact and the typical effort required to restore normal operation. Where possible, step-by-step recovery instructions are provided to reduce response time and avoid unnecessary investigation. 

 Issues that are not present in this database are considered unknown by definition. Such cases may require discovery, diagnostics, or architectural analysis and are handled outside the scope of predefined recovery commitments. 

 The Finmars Known Error Database is a living document. It evolves with the platform, operational experience, and lessons learned from real incidents. Its goal is clarity, not completeness. Reliability, not illusion. Each known issue carries two orthogonal markers : 

 Severity – impact on the business or system Recovery Class – expected effort and time to restore service 

 

 Severity Levels (Impact-based) 

 S1 — Critical 

 System unusable or data integrity at risk No viable workaround Immediate attention required 

 S2 — High 

 Core functionality degraded Workarounds exist but are painful Business impact is noticeable 

 S3 — Medium 

 Partial feature loss Clear workaround available Limited operational impact 

 S4 — Low 

 Cosmetic or non-blocking behavior No impact on business operations 

 Severity answers the question: “How bad is this for the user?” 

 

 Recovery Class A — Quick Fix 

 Expected resolution: under 1 hour Fully documented procedure Typical for routine maintenance issues 

 Recovery Class B — Standard Recovery 

 Expected resolution: up to 4 hours Known issue with multiple steps or validation Usually fits within monthly support window 

 Recovery Class C — Extended Recovery 

 Expected resolution: up to 1 business day High complexity or cross-component dependency Not fully covered by standard support 

 Recovery Class D — Exploratory 

 Resolution time unknown Partial or missing diagnostic data Discovery and investigation required 

 Recovery class answers the question: “How long does it usually take when things go normally?” 

 

 How it looks in registry 

 Example 

 Issue ID: FM-KED-### Title: Background jobs stuck in pending state Severity: S2 — High Recovery Class: B — Standard Recovery Estimated Recovery Time: up to 4 hours Covered by Monthly Support: Yes Resolution Algorithm: documented 

 For an unknown issue: 

 Severity: S1 — Critical Recovery Class: D — Exploratory Covered by Monthly Support: No Notes: subject to separate agreement

FM-KED-001 — VM Disk Space Exhaustion
Severity: S2 — High Recovery Class: A — Quick Fix Covered by Monthly Support: Yes 

 

 Description 

 Disk space on a virtual machine reaches critical levels, leading to degraded system behavior, application instability, or failed background operations. 

 This issue is operational, recurrent, and typically caused by uncontrolled growth of logs, containers, temporary files, or Kubernetes artifacts. 

 

 Typical Symptoms 

 Services failing to write logs or temporary files Background jobs failing without explicit errors Kubernetes pods entering Evicted or Terminating state System warnings related to low disk space 

 

 Diagnostic Checklist 

 Identify Top Disk Consumers 

 sudo du -ahx / | sort -rh | head -n 20 

 

 Recovery Procedure 

 Follow the steps below as needed , not necessarily all of them. 

 

 1. Clean Package Manager Artifacts 

 sudo apt-get autoremove

sudo du -sh /var/cache/apt

sudo apt-get autoclean

sudo apt-get clean 

 

 2. Clean System Journals 

 sudo journalctl --vacuum-time=3d 

 

 3. Truncate Docker Logs 

 sudo truncate -s 0 /var/lib/docker/containers/**/*-json.log 

 

 4. Prune Docker Resources 

 sudo docker system prune 

 

 5. Remove Obsolete Kubernetes ReplicaSets 

 kubectl get rs -A -o wide | tail -n +2 | \

awk '{if ($3 + $4 + $5 == 0) print "kubectl delete rs -n "$1, $2 }' | sh 

 

 6. Clear Evicted Kubernetes Pods 

 kubectl get pods | grep Evicted | awk '{print $1}' | xargs kubectl delete pod 

 With explicit kubeconfig: 

 kubectl --kubeconfig bank.yaml get pods | grep Evicted | \

awk '{print $1}' | xargs kubectl --kubeconfig bank.yaml delete pod 

 

 7. Force Remove Stuck Terminating Pods 

 for p in $(kubectl --kubeconfig bank.yaml get pods | grep Terminating | awk '{print $1}');

do

 kubectl --kubeconfig bank.yaml delete pod $p --grace-period=0 --force

done 

 

 Optional Diagnostics 

 Inspect Memory Usage (for runaway processes) 

 ps -eo size,pid,user,command --sort -size | \

awk '{ hr=$1/1024 ; printf("%13.2f Mb ",hr) } \

{ for ( x=4 ; x<=NF ; x++ ) { printf("%s ",$x) } print "" }' 

 

 

 Preventive Notes 

 Disk usage monitoring is strongly recommended Log rotation must be verified after updates Kubernetes cleanup should be part of routine maintenance

FM-KED-002 — PostgreSQL Table and Index Bloat (Missing or Ineffective VACUUM)
Severity: S2 — High Recovery Class: B — Standard Recovery Covered by Monthly Support: Yes 

 

 Description 

 PostgreSQL database performance degrades over time due to table and index bloat caused by insufficient or ineffective VACUUM operations. 

 This issue manifests gradually and is commonly observed on systems with high write activity, long-running transactions, or misconfigured autovacuum settings. 

 

 Typical Symptoms 

 Slow queries without obvious query plan changes Increased disk usage on database volumes Tables or indexes significantly larger than expected Elevated I/O usage Application timeouts under normal load 

 

 Diagnostic Checklist 

 Identify Database Size and Largest Tables 

 SELECT

 relname AS table_name,

 pg_size_pretty(pg_total_relation_size(relid)) AS total_size

FROM pg_catalog.pg_statio_user_tables

ORDER BY pg_total_relation_size(relid) DESC

LIMIT 10; 

 Check Autovacuum Activity 

 SELECT

 relname,

 last_vacuum,

 last_autovacuum,

 n_dead_tup

FROM pg_stat_user_tables

ORDER BY n_dead_tup DESC; 

 

 Recovery Procedure 

 Follow steps carefully . Some operations are I/O intensive. 

 

 1. Run Manual VACUUM (Non-blocking) 

 VACUUM (VERBOSE, ANALYZE); 

 Recommended for moderate bloat and active systems. 

 

 2. Vacuum Specific Tables 

 VACUUM (VERBOSE, ANALYZE) table_name; 

 Use when bloat is localized. 

 

 3. Reclaim Disk Space (Blocking) 

 VACUUM FULL table_name; 

 ⚠️ Locks the table for the duration of the operation ⚠️ Use during maintenance windows only 

 

 4. Reindex Bloated Indexes 

 REINDEX TABLE table_name; 

 Or concurrently, when supported: 

 REINDEX INDEX CONCURRENTLY index_name; 

 

 

 Preventive Notes 

 Ensure autovacuum is enabled and properly tuned Monitor n_dead_tup growth over time Avoid long-running transactions Schedule periodic maintenance for write-heavy tables 

 

 Operational Notes 

 Disk space reclaimed by VACUUM is reusable by PostgreSQL, not always returned to the OS VACUUM FULL physically rewrites tables and should be used sparingly

FM-KED-003 — Network Connectivity Loss on Virtual Machine
Severity: S1 — Critical Recovery Class: B — Standard Recovery Covered by Monthly Support: Yes (diagnostics only) 

 

 Description 

 A virtual machine becomes partially or fully unreachable due to loss of network connectivity. This may affect administrative access, application availability, or external integrations. 

 The root cause may lie in operating system configuration, firewall rules, cloud security settings, or infrastructure provider issues. 

 

 Typical Symptoms 

 SSH access unavailable or unstable Applications unreachable from external networks Inability to access external services or the public internet Timeouts in inter-service communication 

 

 Diagnostic Checklist 

 Proceed in order. Each step narrows the responsibility boundary. 

 

 1. Verify SSH Connectivity 

 ssh user@vm_ip 

 If SSH is unreachable: 

 Verify correct IP and credentials Check whether the VM responds to ICMP (ping), if allowed 

 

 2. Verify Internet Access from the VM 

 ping -c 3 8.8.8.8

curl https://example.com 

 Distinguish between: 

 No outbound connectivity DNS resolution issues 

 

 3. Check OS-Level Firewall Rules 

 sudo iptables -L -n

sudo ufw status 

 Verify that required inbound and outbound traffic is allowed. 

 

 4. Check Cloud Security Groups and Network Rules 

 Review inbound and outbound rules in the cloud provider console Confirm correct ports, protocols, and source ranges Verify network routing and subnet configuration 

 

 5. Escalation to Infrastructure Provider 

 If all checks above are inconclusive: 

 Collect timestamps, VM identifiers, and observed symptoms Open a support ticket with the cloud provider Attach diagnostic evidence and test results 

 This step marks the transition beyond Finmars operational control. 

 

 Preventive Notes 

 Restrict firewall changes to controlled processes Audit security group changes regularly Maintain documented network topology and access rules 

 

 Responsibility Boundary 

 Finmars SCSA provides best-effort diagnostics and configuration verification. Network outages caused by infrastructure providers, underlying hardware, or provider-managed networks are outside Finmars SCSA responsibility.

FM-KED-004 — Backup Failure Due to Excessive Backup Size
Severity: S1 — Critical Recovery Class: B — Standard Recovery Covered by Monthly Support: Yes 

 

 Description 

 Automatic daily backups fail because the generated database dump exceeds available disk capacity on the backup worker or application node. 

 This condition usually indicates uncontrolled growth of historical data and must be addressed immediately to restore backup continuity. 

 

 Typical Symptoms 

 Daily backup job not completed or failing repeatedly Backup files missing or incomplete Disk space exhaustion on backup or worker node Alerts indicating failed or skipped backup runs 

 

 Root Cause 

 Excessive volume of historical records retained in the database Insufficient disk capacity on the worker node where the Authorizer VM is deployed, or the Finmars Community Edition is installed 

 

 Diagnostic Checklist 

 Verify Backup Job Status 

 Confirm backup job execution logs Identify failure reason and timestamps 

 Check Disk Space on Backup Node 

 df -h 

 Estimate Database Dump Size 

 pg_dump --format=custom --file=/tmp/test.dump db_name

ls -lh /tmp/test.dump 

 

 Recovery Options 

 Choose one or a combination, depending on business requirements. 

 

 Option 1: Remove Obsolete Historical Records 

 Identify historical tables with excessive row counts Confirm data retention requirements Delete or archive obsolete records Re-run backup after cleanup 

 ⚠️ Data deletion is irreversible and must be explicitly approved. 

 

 Option 2: Increase Disk Capacity on Backup Worker Node 

 Extend disk size of the worker node Ensure sufficient free space for full backup generation Re-run the backup job and verify completion 

 This option preserves all historical data. 

 

 Preventive Notes 

 Define and enforce data retention policies Monitor backup file sizes over time Monitor free disk space on backup and worker nodes Periodically validate backup completion, not only existence 

 

 Responsibility Boundary 

 Finmars SCSA provides diagnostics, recommendations, and operational guidance. Infrastructure changes such as disk resizing may depend on customer approval or cloud provider action.

FM-KED-005 — SSL/TLS Certificate Expiration
Severity: S1 — Critical Recovery Class: B — Standard Recovery Covered by Monthly Support: Yes 

 

 Description 

 SSL/TLS certificates used by Finmars services expire, causing secure connections to fail and rendering applications inaccessible over HTTPS. 

 This issue is time-based and entirely recoverable through certificate renewal. 

 

 Typical Symptoms 

 Browsers displaying certificate expiration warnings HTTPS connections rejected by clients or integrations API calls failing due to TLS handshake errors Monitoring alerts related to certificate validity 

 

 Diagnostic Checklist 

 Verify Certificate Expiration 

 openssl s_client -connect domain:443 -servername domain | openssl x509 -noout -dates 

 Identify Certificate Termination Point 

 Nginx reverse proxy Kubernetes Ingress controller 

 

 Recovery Procedure 

 Follow the procedure relevant to the deployment model. 

 

 Option 1: Renew Certificate in Nginx Proxy 

 Generate or obtain renewed certificate Replace certificate and private key in Nginx configuration Reload Nginx configuration 

 sudo nginx -t

sudo systemctl reload nginx 

 

 Option 2: Renew Certificate in Kubernetes Ingress 

 Renew certificate via the configured certificate manager Update or recreate TLS secret used by the Ingress Verify Ingress reload and certificate propagation 

 

 Preventive Notes 

 Track certificate expiration dates Use automated renewal where possible Monitor certificate validity proactively 

 

 Responsibility Boundary 

 Finmars SCSA provides best-effort renewal guidance and validation. Certificate issuance authority availability and DNS control remain customer responsibilities.

FM-KED-006 — Kubernetes Cluster Certificate Expiration
Severity: S1 — Critical Recovery Class: B — Standard Recovery Covered by Monthly Support: Yes 

 

 Description 

 Internal Kubernetes certificates expire, leading to partial or complete cluster malfunction. This may affect control plane communication, node registration, API access, or workload scheduling. 

 This issue typically appears in long-running clusters where certificate rotation was not automated or monitored. 

 

 Typical Symptoms 

 kubectl commands failing with TLS or x509 errors Nodes switching to NotReady state Control plane components restarting or failing Ingress, networking, or admission controllers malfunctioning 

 

 Diagnostic Checklist 

 Verify Certificate Expiration 

 On control plane node: 

 sudo kubeadm certs check-expiration 

 If kubeadm is not available, inspect certificates directly: 

 openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates 

 

 Recovery Procedure 

 ⚠️ Perform these steps on the control plane node ⚠️ Requires administrative access 

 

 1. Renew Kubernetes Certificates 

 sudo kubeadm certs renew all 

 This renews all cluster certificates managed by kubeadm. 

 

 2. Restart Control Plane Components 

 sudo systemctl restart kubelet 

 Kubernetes will automatically recreate static pods for: 

 kube-apiserver kube-controller-manager kube-scheduler 

 

 3. Refresh Local kubeconfig Files 

 sudo cp /etc/kubernetes/admin.conf ~/.kube/config

sudo chown $(id -u):$(id -g) ~/.kube/config 

 Repeat for any other kubeconfig files in use. 

 

 4. Verify Cluster Health 

 kubectl get nodes

kubectl get pods -A 

 Ensure all nodes return to Ready state and system pods stabilize. 

 

 Preventive Notes 

 Monitor certificate expiration dates regularly Schedule certificate renewal before expiration Prefer automated rotation where supported Avoid running clusters indefinitely without maintenance 

 

 Responsibility Boundary 

 Finmars SCSA provides best-effort operational guidance. Clusters not managed via kubeadm or heavily customized may require additional investigation beyond standard support scope.

FM-KED-007 — 502 Bad Gateway Error (Application Unreachable)
Severity: S1 — Critical Recovery Class: B — Standard Recovery Covered by Monthly Support: Yes (known causes only) 

 

 Description 

 Nginx returns a 502 Bad Gateway error because the Django application backend becomes unreachable. 

 In the majority of observed cases, this is caused by Out Of Memory (OOM) conditions on the virtual machine hosting the application worker, leading to termination of Gunicorn or equivalent application processes. 

 

 Typical Symptoms 

 HTTP 502 responses from Nginx Application intermittently unavailable Gunicorn workers restarting or disappearing Kernel logs indicating OOM events 

 

 Primary Root Cause 

 Application requests producing excessive memory usage Large datasets loaded into memory Insufficient RAM on the worker virtual machine 

 When memory limits are exceeded, the operating system terminates the application process, leaving Nginx without a valid upstream. 

 

 Diagnostic Checklist 

 Confirm OOM Condition 

 dmesg | grep -i oom

journalctl -k | grep -i kill 

 Check Available Memory 

 free -h 

 Verify Application Server Status 

 systemctl status gunicorn 

 

 Recovery Options 

 Apply one or more of the following, depending on constraints. 

 

 Option 1: Reduce Request Scope 

 Apply stricter filters to API requests Limit requested date ranges Reduce number of portfolios, instruments, or entities per request Avoid bulk data retrieval in a single call 

 This reduces memory pressure at the application level. 

 

 Option 2: Increase RAM on Worker Virtual Machine 

 Increase memory allocation on the worker VM Restart application services after resizing Verify stability under previous load 

 This addresses the issue at the infrastructure level. 

 

 Escalation and Unknown Issues 

 If the issue persists after: 

 request scope reduction, and sufficient memory allocation 

 then the incident is classified as an unknown issue . 

 Such cases require investigation, profiling, or architectural analysis and are not covered by the standard monthly support allocation. 

 

 Preventive Notes 

 Avoid unbounded API queries Monitor memory usage trends Define safe defaults and limits at API level Prefer asynchronous processing for heavy workloads 

 

 Responsibility Boundary 

 Finmars SCSA provides best-effort diagnostics and guidance for known memory-related causes. Application design decisions and infrastructure capacity planning beyond documented scenarios require separate analysis.

FM-KED-008 — HTTP 500 Internal Server Error (Application Error)
Severity: S2 — High Recovery Class: D — Exploratory Covered by Monthly Support: No (fix requires product change) 

 

 Description 

 The application returns an HTTP 500 Internal Server Error , indicating an unhandled exception or logic failure inside the Finmars application. 

 In most cases, this represents a software defect rather than an infrastructure or configuration issue. 

 

 Typical Symptoms 

 API requests returning HTTP 500 Application stack traces in logs Errors reproducible with the same input No corresponding infrastructure or resource alerts 

 

 Primary Meaning 

 A 500 error usually means: 

 the request reached the application correctly the application failed while processing it 

 This class of error cannot be reliably resolved through operational actions alone. 

 

 Required Action 

 Register an Issue in Finmars Core Repository 

 All confirmed 500 errors should be reported via GitHub: 

 🔗 https://github.com/finmars-platform/finmars-core/issues 

 When creating an issue, include: 

 request details (endpoint, parameters) error messages or stack traces timestamps environment details 

 

 Resolution Path 

 Issue is analyzed by Finmars maintainers Fix is implemented in the product codebase Resolution is delivered via a new Finmars release 

 Customers must upgrade to the fixed version to resolve the issue. 

 

 Urgency Handling 

 Most urgent bugs are addressed in best-effort time Priority depends on severity, reproducibility, and impact No guaranteed fix timelines are implied 

 

 Estimated Recovery Time 

 Not predictable Depends on investigation complexity and release cycle 

 

 Preventive Notes 

 Keep Finmars versions up to date Monitor application logs for early signals Avoid unsupported customizations where possible 

 

 Responsibility Boundary 

 Finmars SCSA provides guidance, triage assistance, and escalation support. Bug fixes require product changes and fall outside standard operational support.

FM-KED-009 — Missing Data in Reports
Severity: S3 — Medium Recovery Class: A — Quick Fix Covered by Monthly Support: Yes 

 

 Description 

 Reports return incomplete results or missing values because required Prices and/or FX Rates are not available for the requested reporting date. 

 This is not an application error. Finmars can only calculate reports based on data that exists in the system. 

 

 Typical Symptoms 

 Empty or partially populated report fields Missing valuations or calculated figures Reports returning results for some instruments but not others No application or infrastructure errors present 

 

 Primary Meaning 

 The reporting engine executed successfully, but the input data set is incomplete . 

 Most commonly: 

 prices are missing for one or more instruments FX rates are missing for one or more currencies data is not available for the exact requested date 

 

 Diagnostic Checklist 

 Identify Report Date 

 Confirm the exact date used in the report 

 Verify Price Availability 

 Check that instrument prices exist for the report date 

 Verify FX Rate Availability 

 Check that FX rates exist for all required currency pairs 

 

 Recovery Procedure 

 Import Missing Data into Finmars 

 Import required Prices for the requested date Import required FX Rates for the requested date Re-run the report after data import 

 Once data is present, the report will calculate correctly. 

 

 Preventive Notes 

 Ensure price and FX data feeds are complete Monitor data import success regularly Align report dates with available market data 

 

 Responsibility Boundary 

 Finmars SCSA provides diagnostics and guidance. Data availability and correctness depend on customer data imports or connected providers.