markojak commited on
Commit
46e6e62
·
verified ·
1 Parent(s): f7f46aa

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. .gradio/certificate.pem +31 -0
  2. README.md +80 -8
  3. creators.py +506 -0
  4. requirements.txt +4 -0
.gradio/certificate.pem ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ -----BEGIN CERTIFICATE-----
2
+ MIIFazCCA1OgAwIBAgIRAIIQz7DSQONZRGPgu2OCiwAwDQYJKoZIhvcNAQELBQAw
3
+ TzELMAkGA1UEBhMCVVMxKTAnBgNVBAoTIEludGVybmV0IFNlY3VyaXR5IFJlc2Vh
4
+ cmNoIEdyb3VwMRUwEwYDVQQDEwxJU1JHIFJvb3QgWDEwHhcNMTUwNjA0MTEwNDM4
5
+ WhcNMzUwNjA0MTEwNDM4WjBPMQswCQYDVQQGEwJVUzEpMCcGA1UEChMgSW50ZXJu
6
+ ZXQgU2VjdXJpdHkgUmVzZWFyY2ggR3JvdXAxFTATBgNVBAMTDElTUkcgUm9vdCBY
7
+ MTCCAiIwDQYJKoZIhvcNAQEBBQADggIPADCCAgoCggIBAK3oJHP0FDfzm54rVygc
8
+ h77ct984kIxuPOZXoHj3dcKi/vVqbvYATyjb3miGbESTtrFj/RQSa78f0uoxmyF+
9
+ 0TM8ukj13Xnfs7j/EvEhmkvBioZxaUpmZmyPfjxwv60pIgbz5MDmgK7iS4+3mX6U
10
+ A5/TR5d8mUgjU+g4rk8Kb4Mu0UlXjIB0ttov0DiNewNwIRt18jA8+o+u3dpjq+sW
11
+ T8KOEUt+zwvo/7V3LvSye0rgTBIlDHCNAymg4VMk7BPZ7hm/ELNKjD+Jo2FR3qyH
12
+ B5T0Y3HsLuJvW5iB4YlcNHlsdu87kGJ55tukmi8mxdAQ4Q7e2RCOFvu396j3x+UC
13
+ B5iPNgiV5+I3lg02dZ77DnKxHZu8A/lJBdiB3QW0KtZB6awBdpUKD9jf1b0SHzUv
14
+ KBds0pjBqAlkd25HN7rOrFleaJ1/ctaJxQZBKT5ZPt0m9STJEadao0xAH0ahmbWn
15
+ OlFuhjuefXKnEgV4We0+UXgVCwOPjdAvBbI+e0ocS3MFEvzG6uBQE3xDk3SzynTn
16
+ jh8BCNAw1FtxNrQHusEwMFxIt4I7mKZ9YIqioymCzLq9gwQbooMDQaHWBfEbwrbw
17
+ qHyGO0aoSCqI3Haadr8faqU9GY/rOPNk3sgrDQoo//fb4hVC1CLQJ13hef4Y53CI
18
+ rU7m2Ys6xt0nUW7/vGT1M0NPAgMBAAGjQjBAMA4GA1UdDwEB/wQEAwIBBjAPBgNV
19
+ HRMBAf8EBTADAQH/MB0GA1UdDgQWBBR5tFnme7bl5AFzgAiIyBpY9umbbjANBgkq
20
+ hkiG9w0BAQsFAAOCAgEAVR9YqbyyqFDQDLHYGmkgJykIrGF1XIpu+ILlaS/V9lZL
21
+ ubhzEFnTIZd+50xx+7LSYK05qAvqFyFWhfFQDlnrzuBZ6brJFe+GnY+EgPbk6ZGQ
22
+ 3BebYhtF8GaV0nxvwuo77x/Py9auJ/GpsMiu/X1+mvoiBOv/2X/qkSsisRcOj/KK
23
+ NFtY2PwByVS5uCbMiogziUwthDyC3+6WVwW6LLv3xLfHTjuCvjHIInNzktHCgKQ5
24
+ ORAzI4JMPJ+GslWYHb4phowim57iaztXOoJwTdwJx4nLCgdNbOhdjsnvzqvHu7Ur
25
+ TkXWStAmzOVyyghqpZXjFaH3pO3JLF+l+/+sKAIuvtd7u+Nxe5AW0wdeRlN8NwdC
26
+ jNPElpzVmbUq4JUagEiuTDkHzsxHpFKVK7q4+63SM1N95R1NbdWhscdCb+ZAJzVc
27
+ oyi3B43njTOQ5yOf+1CceWxG1bQVs5ZufpsMljq4Ui0/1lvh+wjChP4kqKOJ2qxq
28
+ 4RgqsahDYVvTH9w7jXbyLeiNdd8XM2w9U/t7y0Ff/9yi0GE44Za4rF2LN9d11TPA
29
+ mRGunUHBcnWEvgJBQl9nJEiU0Zsnvgc/ubhPgXRR4Xq37Z0j4r7g1SgEEzwxA57d
30
+ emyPxgcYxn/eR44/KJ4EBs+lVDR3veyJm+kXQ99b21/+jh5Xos1AnX5iItreGCc=
31
+ -----END CERTIFICATE-----
README.md CHANGED
@@ -1,13 +1,85 @@
1
  ---
2
- title: Tt Creators
3
- emoji: 🦀
4
- colorFrom: purple
5
- colorTo: blue
6
  sdk: gradio
7
  sdk_version: 5.20.0
8
- app_file: app.py
9
- pinned: false
10
- short_description: TT-Creators Exploration
11
  ---
 
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: tt-creators
3
+ app_file: creators.py
 
 
4
  sdk: gradio
5
  sdk_version: 5.20.0
 
 
 
6
  ---
7
+ # TikTok Creator Analyzer
8
 
9
+ A Gradio-based tool for analyzing TikTok creator profiles from CSV files.
10
+
11
+ ## Features
12
+
13
+ - Efficiently loads and processes millions of TikTok creator profiles
14
+ - Caches data in Parquet format for faster subsequent loads
15
+ - Tracks processed files to avoid reprocessing the same data
16
+ - Incrementally updates the database when new files are added
17
+ - Advanced search with multiple filters:
18
+ - Follower count range (min/max)
19
+ - Video count range (min/max)
20
+ - Keywords in signature
21
+ - Region filter
22
+ - "Has Email" filter to find profiles with contact information
23
+ - Download search results as CSV
24
+ - Network accessible interface (binds to 0.0.0.0)
25
+ - Shareable via temporary public URL
26
+
27
+ ## Installation
28
+
29
+ 1. Install the required dependencies:
30
+
31
+ ```bash
32
+ pip install -r requirements.txt
33
+ ```
34
+
35
+ 2. Make sure your CSV files are in the correct location (`../data/tiktok_profiles/`)
36
+
37
+ ## Usage
38
+
39
+ Run the script:
40
+
41
+ ```bash
42
+ python creators.py
43
+ ```
44
+
45
+ The first run will:
46
+ 1. Load all CSV files from the data directory
47
+ 2. Combine them into a single dataset
48
+ 3. Save the combined data as a Parquet file for faster loading in the future
49
+ 4. Track which files have been processed to avoid duplicates
50
+ 5. Launch a Gradio web interface for searching and analyzing the data
51
+
52
+ Subsequent runs will:
53
+ 1. Load the existing data from the Parquet file
54
+ 2. Check for new CSV files that haven't been processed yet
55
+ 3. If new files exist, process only those files and update the database
56
+ 4. Launch the Gradio interface with the updated data
57
+
58
+ The interface will be accessible from:
59
+ - Other machines on your network at: `http://your-ip-address:7860`
60
+ - A temporary public URL that will be displayed in the console (thanks to `share=True`)
61
+
62
+ ## Maintenance
63
+
64
+ The application includes a Maintenance tab that shows:
65
+ - How many files have been processed
66
+ - When the database was last updated
67
+ - An option to force reload all files (useful if you suspect data corruption)
68
+
69
+ ## Data Format
70
+
71
+ The CSV files should have the following columns:
72
+ - id
73
+ - unique_id
74
+ - follower_count
75
+ - nickname
76
+ - video_count
77
+ - following_count
78
+ - signature
79
+ - email
80
+ - bio_link
81
+ - updated_at
82
+ - tt_seller
83
+ - region
84
+ - language
85
+ - url
creators.py ADDED
@@ -0,0 +1,506 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ import os
3
+ import glob
4
+ import pandas as pd
5
+ import gradio as gr
6
+ import time
7
+ import pyarrow as pa
8
+ import pyarrow.parquet as pq
9
+ import json
10
+ from pathlib import Path
11
+
12
+ # Configuration
13
+ DATA_DIR = Path("../data/tiktok_profiles")
14
+ CACHE_FILE = Path("../data/tiktok_profiles_combined.parquet")
15
+ PROCESSED_FILES_LOG = Path("../data/processed_files.json")
16
+ COLUMNS = [
17
+ "id",
18
+ "unique_id",
19
+ "follower_count",
20
+ "nickname",
21
+ "video_count",
22
+ "following_count",
23
+ "signature",
24
+ "email",
25
+ "bio_link",
26
+ "updated_at",
27
+ "tt_seller",
28
+ "region",
29
+ "language",
30
+ "url",
31
+ ]
32
+
33
+
34
+ def get_processed_files():
35
+ """
36
+ Get the list of already processed files from the log.
37
+ Returns a set of filenames that have been processed.
38
+ """
39
+ if PROCESSED_FILES_LOG.exists():
40
+ with open(PROCESSED_FILES_LOG, "r") as f:
41
+ return set(json.load(f))
42
+ return set()
43
+
44
+
45
+ def update_processed_files(processed_files):
46
+ """
47
+ Update the log of processed files.
48
+ """
49
+ PROCESSED_FILES_LOG.parent.mkdir(exist_ok=True)
50
+ with open(PROCESSED_FILES_LOG, "w") as f:
51
+ json.dump(list(processed_files), f)
52
+
53
+
54
+ def load_data(force_reload=False):
55
+ """
56
+ Load data from either the cache file or from individual CSV files.
57
+ Only processes new files that haven't been processed before.
58
+ Returns a pandas DataFrame with all the data.
59
+
60
+ Args:
61
+ force_reload: If True, reprocess all files regardless of whether they've been processed before.
62
+ """
63
+ start_time = time.time()
64
+
65
+ # Get all available CSV files
66
+ all_csv_files = {file.name: file for file in DATA_DIR.glob("*.csv")}
67
+
68
+ # If cache exists and we're not forcing a reload, load from cache
69
+ if CACHE_FILE.exists() and not force_reload:
70
+ print(f"Loading data from cache file: {CACHE_FILE}")
71
+ df = pd.read_parquet(CACHE_FILE)
72
+
73
+ # Check for new files
74
+ processed_files = get_processed_files()
75
+ new_files = [
76
+ all_csv_files[name] for name in all_csv_files if name not in processed_files
77
+ ]
78
+
79
+ if not new_files:
80
+ print(
81
+ f"No new files to process. Data loaded in {time.time() - start_time:.2f} seconds"
82
+ )
83
+ return df
84
+
85
+ print(f"Found {len(new_files)} new files to process")
86
+
87
+ # Process only the new files
88
+ new_dfs = []
89
+ for i, file in enumerate(new_files):
90
+ print(f"Loading new file {i+1}/{len(new_files)}: {file.name}")
91
+
92
+ # Read CSV with optimized settings
93
+ chunk_df = pd.read_csv(
94
+ file,
95
+ dtype={
96
+ "id": "str",
97
+ "unique_id": "str",
98
+ "follower_count": "Int64",
99
+ "nickname": "str",
100
+ "video_count": "Int64",
101
+ "following_count": "Int64",
102
+ "signature": "str",
103
+ "email": "str",
104
+ "bio_link": "str",
105
+ "updated_at": "str",
106
+ "tt_seller": "str",
107
+ "region": "str",
108
+ "language": "str",
109
+ "url": "str",
110
+ },
111
+ low_memory=False,
112
+ )
113
+ new_dfs.append(chunk_df)
114
+ processed_files.add(file.name)
115
+
116
+ if new_dfs:
117
+ # Combine new data with existing data
118
+ print("Combining new data with existing data...")
119
+ new_data = pd.concat(new_dfs, ignore_index=True)
120
+ df = pd.concat([df, new_data], ignore_index=True)
121
+
122
+ # Remove duplicates based on unique_id
123
+ df = df.drop_duplicates(subset=["unique_id"], keep="last")
124
+
125
+ # Save updated data to cache file
126
+ print(f"Saving updated data to {CACHE_FILE}")
127
+ df.to_parquet(CACHE_FILE, index=False)
128
+
129
+ # Update the processed files log
130
+ update_processed_files(processed_files)
131
+
132
+ print(f"Data loaded and updated in {time.time() - start_time:.2f} seconds")
133
+ return df
134
+
135
+ # If no cache file or force_reload is True, process all files
136
+ print(f"Loading data from CSV files in {DATA_DIR}")
137
+
138
+ # Get all CSV files
139
+ csv_files = list(all_csv_files.values())
140
+ total_files = len(csv_files)
141
+ print(f"Found {total_files} CSV files")
142
+
143
+ # Load data in chunks
144
+ dfs = []
145
+ processed_files = set()
146
+
147
+ for i, file in enumerate(csv_files):
148
+ if i % 10 == 0:
149
+ print(f"Loading file {i+1}/{total_files}: {file.name}")
150
+
151
+ # Read CSV with optimized settings
152
+ chunk_df = pd.read_csv(
153
+ file,
154
+ dtype={
155
+ "id": "str",
156
+ "unique_id": "str",
157
+ "follower_count": "Int64",
158
+ "nickname": "str",
159
+ "video_count": "Int64",
160
+ "following_count": "Int64",
161
+ "signature": "str",
162
+ "email": "str",
163
+ "bio_link": "str",
164
+ "updated_at": "str",
165
+ "tt_seller": "str",
166
+ "region": "str",
167
+ "language": "str",
168
+ "url": "str",
169
+ },
170
+ low_memory=False,
171
+ )
172
+ dfs.append(chunk_df)
173
+ processed_files.add(file.name)
174
+
175
+ # Combine all dataframes
176
+ print("Combining all dataframes...")
177
+ df = pd.concat(dfs, ignore_index=True)
178
+
179
+ # Remove duplicates based on unique_id
180
+ df = df.drop_duplicates(subset=["unique_id"], keep="last")
181
+
182
+ # Save to cache file
183
+ print(f"Saving combined data to {CACHE_FILE}")
184
+ CACHE_FILE.parent.mkdir(exist_ok=True)
185
+ df.to_parquet(CACHE_FILE, index=False)
186
+
187
+ # Update the processed files log
188
+ update_processed_files(processed_files)
189
+
190
+ print(f"Data loaded and cached in {time.time() - start_time:.2f} seconds")
191
+ return df
192
+
193
+
194
+ def search_by_username(df, username):
195
+ """Search for profiles by username (unique_id)"""
196
+ if not username:
197
+ return pd.DataFrame()
198
+
199
+ # Case-insensitive search
200
+ results = df[df["unique_id"].str.lower().str.contains(username.lower(), na=False)]
201
+ return results.head(100) # Limit results to prevent UI overload
202
+
203
+
204
+ def search_by_nickname(df, nickname):
205
+ """Search for profiles by nickname"""
206
+ if not nickname:
207
+ return pd.DataFrame()
208
+
209
+ # Case-insensitive search
210
+ results = df[df["nickname"].str.lower().str.contains(nickname.lower(), na=False)]
211
+ return results.head(100) # Limit results to prevent UI overload
212
+
213
+
214
+ def search_by_follower_count(df, min_followers, max_followers):
215
+ """Search for profiles by follower count range"""
216
+ if min_followers is None:
217
+ min_followers = 0
218
+ if max_followers is None:
219
+ max_followers = df["follower_count"].max()
220
+
221
+ results = df[
222
+ (df["follower_count"] >= min_followers)
223
+ & (df["follower_count"] <= max_followers)
224
+ ]
225
+ return results.head(100) # Limit results to prevent UI overload
226
+
227
+
228
+ def format_results(df):
229
+ """Format the results for display"""
230
+ if df.empty:
231
+ # Return an empty DataFrame with the same columns instead of a string
232
+ return pd.DataFrame(columns=df.columns)
233
+
234
+ # Format the DataFrame for display
235
+ display_df = df.copy()
236
+
237
+ # Convert follower count to human-readable format
238
+ def format_number(num):
239
+ if pd.isna(num):
240
+ return "N/A"
241
+ if num >= 1_000_000:
242
+ return f"{num/1_000_000:.1f}M"
243
+ elif num >= 1_000:
244
+ return f"{num/1_000:.1f}K"
245
+ return str(num)
246
+
247
+ display_df["follower_count"] = display_df["follower_count"].apply(format_number)
248
+ display_df["video_count"] = display_df["video_count"].apply(format_number)
249
+ display_df["following_count"] = display_df["following_count"].apply(format_number)
250
+
251
+ return display_df
252
+
253
+
254
+ def combined_search(
255
+ df,
256
+ min_followers,
257
+ max_followers,
258
+ min_videos,
259
+ max_videos,
260
+ signature_query,
261
+ region,
262
+ has_email,
263
+ ):
264
+ """Combined search function using all criteria"""
265
+ results = df.copy()
266
+
267
+ # Apply each filter if provided
268
+ if min_followers is not None:
269
+ results = results[results["follower_count"] >= min_followers]
270
+
271
+ if max_followers is not None:
272
+ results = results[results["follower_count"] <= max_followers]
273
+
274
+ if min_videos is not None:
275
+ results = results[results["video_count"] >= min_videos]
276
+
277
+ if max_videos is not None:
278
+ results = results[results["video_count"] <= max_videos]
279
+
280
+ if signature_query:
281
+ results = results[
282
+ results["signature"]
283
+ .str.lower()
284
+ .str.contains(signature_query.lower(), na=False)
285
+ ]
286
+
287
+ if region:
288
+ results = results[results["region"].str.lower() == region.lower()]
289
+
290
+ # Filter for profiles with email
291
+ if has_email:
292
+ results = results[results["email"].notna() & (results["email"] != "")]
293
+
294
+ return results.head(1000) # Limit to 1000 results to prevent UI overload
295
+
296
+
297
+ def create_interface(df):
298
+ """Create the Gradio interface"""
299
+ # Get min and max follower counts for slider
300
+ min_followers_global = max(1000, int(df["follower_count"].min()))
301
+ max_followers_global = min(10000000, int(df["follower_count"].max()))
302
+
303
+ # Get min and max video counts for slider
304
+ min_videos_global = max(1, int(df["video_count"].min()))
305
+ max_videos_global = min(10000, int(df["video_count"].max()))
306
+
307
+ # Get unique regions for dropdown
308
+ regions = sorted(df["region"].dropna().unique().tolist())
309
+ regions = [""] + regions # Add empty option
310
+
311
+ with gr.Blocks(title="TikTok Creator Analyzer") as interface:
312
+ gr.Markdown("# TikTok Creator Analyzer")
313
+ gr.Markdown(f"Database contains {len(df):,} creator profiles")
314
+
315
+ # Show top 100 profiles by default
316
+ top_profiles = df.sort_values(by="follower_count", ascending=False).head(100)
317
+ default_view = format_results(top_profiles)
318
+
319
+ with gr.Tab("Overview"):
320
+ gr.Markdown("## Top 100 Profiles by Follower Count")
321
+ overview_results = gr.Dataframe(value=default_view, label="Top Profiles")
322
+
323
+ refresh_btn = gr.Button("Refresh")
324
+ refresh_btn.click(
325
+ fn=lambda: format_results(
326
+ df.sort_values(by="follower_count", ascending=False).head(100)
327
+ ),
328
+ inputs=[],
329
+ outputs=overview_results,
330
+ )
331
+
332
+ with gr.Tab("Advanced Search"):
333
+ with gr.Row():
334
+ with gr.Column(scale=1):
335
+ gr.Markdown("### Follower Count")
336
+ min_followers_slider = gr.Slider(
337
+ minimum=min_followers_global,
338
+ maximum=max_followers_global,
339
+ value=min_followers_global,
340
+ step=1000,
341
+ label="Minimum Followers",
342
+ interactive=True,
343
+ )
344
+ max_followers_slider = gr.Slider(
345
+ minimum=min_followers_global,
346
+ maximum=max_followers_global,
347
+ value=max_followers_global,
348
+ step=1000,
349
+ label="Maximum Followers",
350
+ interactive=True,
351
+ )
352
+
353
+ gr.Markdown("### Video Count")
354
+ min_videos_slider = gr.Slider(
355
+ minimum=min_videos_global,
356
+ maximum=max_videos_global,
357
+ value=min_videos_global,
358
+ step=10,
359
+ label="Minimum Videos",
360
+ interactive=True,
361
+ )
362
+ max_videos_slider = gr.Slider(
363
+ minimum=min_videos_global,
364
+ maximum=max_videos_global,
365
+ value=max_videos_global,
366
+ step=10,
367
+ label="Maximum Videos",
368
+ interactive=True,
369
+ )
370
+
371
+ with gr.Column(scale=1):
372
+ signature_input = gr.Textbox(label="Keywords in Signature")
373
+ region_input = gr.Dropdown(label="Region", choices=regions)
374
+ has_email_checkbox = gr.Checkbox(label="Has Email", value=False)
375
+ search_btn = gr.Button("Search", variant="primary", size="lg")
376
+
377
+ results_count = gr.Markdown("### Results: 0 profiles found")
378
+
379
+ # Create a dataframe with download button
380
+ with gr.Row():
381
+ search_results = gr.Dataframe(label="Results")
382
+ download_btn = gr.Button("Download Results as CSV")
383
+
384
+ # Function to update results count
385
+ def update_results_count(results_df):
386
+ count = len(results_df)
387
+ return f"### Results: {count:,} profiles found"
388
+
389
+ # Function to perform search and update results
390
+ def perform_search(
391
+ min_followers,
392
+ max_followers,
393
+ min_videos,
394
+ max_videos,
395
+ signature,
396
+ region,
397
+ has_email,
398
+ ):
399
+ results = combined_search(
400
+ df,
401
+ min_followers,
402
+ max_followers,
403
+ min_videos,
404
+ max_videos,
405
+ signature,
406
+ region,
407
+ has_email,
408
+ )
409
+ formatted_results = format_results(results)
410
+ count_text = update_results_count(results)
411
+ return formatted_results, count_text
412
+
413
+ # Function to download results as CSV
414
+ def download_results(results_df):
415
+ if results_df.empty:
416
+ return None
417
+
418
+ # Convert back to original format for download
419
+ download_df = df[df["unique_id"].isin(results_df["unique_id"])]
420
+
421
+ # Save to temporary CSV file
422
+ temp_csv = "temp_results.csv"
423
+ download_df.to_csv(temp_csv, index=False)
424
+ return temp_csv
425
+
426
+ # Connect the search button
427
+ search_btn.click(
428
+ fn=perform_search,
429
+ inputs=[
430
+ min_followers_slider,
431
+ max_followers_slider,
432
+ min_videos_slider,
433
+ max_videos_slider,
434
+ signature_input,
435
+ region_input,
436
+ has_email_checkbox,
437
+ ],
438
+ outputs=[search_results, results_count],
439
+ )
440
+
441
+ # Connect the download button
442
+ download_btn.click(
443
+ fn=download_results,
444
+ inputs=[search_results],
445
+ outputs=[gr.File(label="Download")],
446
+ )
447
+
448
+ with gr.Tab("Statistics"):
449
+ gr.Markdown("## Database Statistics")
450
+
451
+ # Calculate some basic statistics
452
+ total_creators = len(df)
453
+ total_followers = df["follower_count"].sum()
454
+ avg_followers = df["follower_count"].mean()
455
+ median_followers = df["follower_count"].median()
456
+ max_followers = df["follower_count"].max()
457
+
458
+ stats_md = f"""
459
+ - Total Creators: {total_creators:,}
460
+ - Total Followers: {total_followers:,}
461
+ - Average Followers: {avg_followers:,.2f}
462
+ - Median Followers: {median_followers:,}
463
+ - Max Followers: {max_followers:,}
464
+ """
465
+
466
+ gr.Markdown(stats_md)
467
+
468
+ with gr.Tab("Maintenance"):
469
+ gr.Markdown("## Database Maintenance")
470
+
471
+ # Get processed files info
472
+ processed_files = get_processed_files()
473
+
474
+ maintenance_md = f"""
475
+ - Total processed files: {len(processed_files)}
476
+ - Last update: {time.ctime(CACHE_FILE.stat().st_mtime) if CACHE_FILE.exists() else 'Never'}
477
+ """
478
+
479
+ gr.Markdown(maintenance_md)
480
+
481
+ with gr.Row():
482
+ force_reload_btn = gr.Button("Force Reload All Files")
483
+ reload_status = gr.Markdown("Click to reload all files from scratch")
484
+
485
+ def reload_all_files():
486
+ return "Reloading all files... This may take a while. Please restart the application."
487
+
488
+ force_reload_btn.click(
489
+ fn=reload_all_files, inputs=[], outputs=reload_status
490
+ )
491
+
492
+ return interface
493
+
494
+
495
+ def main():
496
+ print("Loading TikTok creator data...")
497
+ df = load_data()
498
+ print(f"Loaded {len(df):,} creator profiles")
499
+
500
+ # Create and launch the interface
501
+ interface = create_interface(df)
502
+ interface.launch(share=True, server_name="0.0.0.0")
503
+
504
+
505
+ if __name__ == "__main__":
506
+ main()
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ pandas
2
+ gradio
3
+ pyarrow
4
+ pip-chillpython